Functions Solved Examples

AI benchmarks are flawed: AI agent achieves top scores without solving a single task

An AI agent created by UC Berkeley researchers successfully hacked and achieved near-perfect scores on eight major AI benchmarks, including SWE-bench Pro and Terminal-Bench.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

AI benchmarks are flawed: AI agent achieves top scores without solving a single task

Trending now