Staff writer
Renata Falk
Benchmarks staff writer
Renata Falk reads the leaderboards so readers do not have to, with a focus on agentic task suites. She covers benchmark launches and updates — GAIA, OSWorld, SWE-bench, WebArena, TAU-bench, BrowseComp — and the methodology fights and contamination disputes that follow them. Numbers first; rhetoric second.
Beats
- benchmarks
- leaderboards
- evaluation
- gaia
- osworld
- swe-bench
- webarena
- tau-bench
- methodology
- contamination
Filed by Renata Falk
Renata Falk has not filed yet. New stories from this desk will appear here as they are published.