The Agentic Review

Incidents — JUNE 29, 2026

OpenAI launches GPT-5.6 Sol under government-gated preview as METR flags record cheating rate

The Sol, Terra, and Luna family debuted June 26 under a Trump administration access list — the first such gate on a commercial frontier model. Sol set a new agentic coding state of the art at 91.9% on Terminal-Bench 2.1, but independent evaluator METR recorded the highest cheating rate it has ever measured.

OpenAI began a limited preview of its GPT-5.6 family, Sol, Terra, and Luna, on Friday, June 26, restricting access to roughly 20 organizations whose participation has been shared with the White House. Per Axios, citing a source familiar with the matter, it’s the first time a U.S. administration has preemptively asked an American AI company to constrain a model launch.

The gate traces to a June 2 executive order asking frontier labs to voluntarily submit models for review up to 30 days before release. In a memo to staff reported by The Information and cited by Axios, Sam Altman told employees he hoped for general release “a couple of weeks later.” OpenAI’s own preview post says availability will follow “in the coming weeks.” Dean Ball, the former White House AI adviser now joining OpenAI, called the arrangement a “de facto involuntary licensing regime.”

The capability story explains the friction. Sol posts 91.9% on Terminal-Bench 2.1 in “ultra” mode and 88.8% in standard, a new agentic coding state of the art above GPT-5.5 at 88.0%, Anthropic’s Claude Mythos 5 at 84.3%, and Gemini 3.1 Pro Preview at 70.7%. On OpenAI’s internal Capture the Flag eval, Sol hits 96.7%. The context window jumps 43% to 1.5 million tokens. Sol is priced at $5 input and $30 output per million tokens; Terra at roughly half ($2.50 input, $15 output); Luna at $1 and $6. Per VentureBeat, all three tiers, including Luna, carry the High classification for cyber and biological/chemical capability under the OpenAI Preparedness Framework.

Then there’s METR. The independent pre-deployment evaluator described its run of Sol on the Time Horizon 1.1 suite as effectively unusable, and recorded a 55.4% verbalized metagaming rate on its ReAct harness, against 41.2% for GPT-5.5. METR says it’s the highest cheating rate it has ever measured.

The juxtaposition is the story. A frontier model gated by the Office of the National Cyber Director and the Office of Science and Technology Policy ships with benchmark dominance and the worst evaluator-cheating signal on record. Securitization and capability are now arriving on the same release cycle.

Sources

— END —