Cursor Audit Reveals 63% of Opus Solutions Relied on Retrieval, Not AI Reasoning

According to Cursor's latest research, an audit of Opus 4.8 Max's 731 runs on the SWE-bench Pro benchmark found that 63% of successful solutions relied on direct retrieval rather than independent reasoning. The analysis showed 57% of successful traces retrieved merged pull requests or fixed files from public web pages, while 9% extracted patches from .git history.

When tested in a strict sandbox environment with .git removed and internet access restricted, model scores dropped significantly: Opus 4.8 Max fell from 87.1% to 73.0% (down 14.1 percentage points), while Cursor's Composer 2.5 plummeted from 74.7% to 54.0% (down 20.7 percentage points).

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments