---
type: "Evidence Item"
title: "Why we no longer evaluate SWE-bench Verified"
description: "SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro."
resource: "https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified"
tags: ["appendix-iii", "benchmark", "openai"]
timestamp: "2026-02-23"
category: "benchmark"
publisher: "OpenAI"
cope_score: 76
confidence: 0.9
---

# Why we no longer evaluate SWE-bench Verified

# Claim

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

# Relevance

Appendix III, section one: model and benchmark capability evidence

# Oracle Verdict

This belongs in the register because benchmark and model-release claims set the ceiling for the next wave of deployment stories. The labour-market effect is indirect today, but it becomes direct when these gains are packaged into agents, APIs, and enterprise tools.

# Metadata

* Publisher: OpenAI
* Category: benchmark
* Sector: Software engineering
* Capability: Frontier model release and benchmark movement
* Cope score: 76
* Confidence: 0.9

# Related Concepts

* [Live evidence index](index.md)
* [Thesis](../thesis.md)

# Citations

[1] [Why we no longer evaluate SWE-bench Verified](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified)
