Anthropic: Claude Opus 4.7 has a reliability rating of 92%, few comments

Anthropic released a new hybrid imaging model on Thursday: Claude Opus 4.7.
Anthropic has a reputation as the company’s first security AI, and the Opus 4.7 system card reports that the model is less likely to see missing objects or engage in sycophancy than both previous Anthropic models and other borderline AI models.
We dived into the Opus 4.7 system card to see what Anthropic has to say about model safety, reliability, and compatibility.
Don’t miss our latest news: Add Mashable as a trusted news source on Google.
The TL;DR version
Why did you put the TL;DR version at the end?
Anthropic claims that Claude Opus 4.7 makes improvements to various types of optical illusions and overall reliability. Anthropic gives the new model high marks for sycophancy and promotion of user manipulation, too. (Anthropic data also shows that the Claude Opus 4.7 scores better in this behavior than the Gemini 3.1 Pro and the Grok 4.20.)
“Claude Opus 4.7 is more reliable than Opus 4.6 or Sonnet 4.6, with a significant reduction in the key omission rate, and a modest improvement in the true and false input rates,” reports Anthropic.
The reliability level of false positives: Will the model tell the user when it is wrong?
Credit: Anthropic

MASK credibility level: Will the model contradict its stated belief when pushed to do so by the user?
Credit: Anthropic
Want to learn more about getting the most out of your technology? Sign up for Mashable’s Top Stories and Deals newsletters today.
Anthropic measures Claude’s levels of credibility and hallucinations in many ways, but let’s look at one representative example — the Model Alignment between Statements and Knowledge (MASK). MASK was developed by Scale AI and the AI Security Center.
Claude Opus had a MASK reliability rating of 91.7 percent, compared to 90.3 percent for Opus 4.6 and 89.1 percent for Sonnet 4.6. Although that’s less than the 95.4 percent achieved by Claude Opus 4.5, the new model does better on some projection points (more on that below).
Interestingly, Claude Mythos was still very reliable, with a reliability rate of 95.4 percent.
Claude Opus 4.7 lags behind Claude Mythos in overall performance
Since Anthropic repeatedly compares the Opus 4.7 to the Claude Mythos, let’s quickly review the differences between the two models.
Claude Opus 4.7 is the latest hybrid imaging model available to Claude’s paid subscribers. Claude Mythos is an unreleased model that Anthropic has made available exclusively to our partners through Project Glasswing.
Mashable Light Speed
Anthropic makes case for anthropomorphizing AI in ‘inconclusive’ research paper
Under normal circumstances, we can expect the Claude Opus 4.7 to be the most advanced and powerful Anthropic model to date. However, Anthropic says it is lagging behind Claude Mythos who has yet to be released in key areas. Anthropic deemed the Claude Mythos too dangerous to release to the public due to its advanced cybersecurity capabilities.
Still, Claude Opus 4.7 improves on Opus 4.6 in many ways, especially advanced coding, visual intelligence, and document analysis, Anthropic said.
More details about Claude Opus 4.7 hallucination levels
Using Opus 4.7, how likely is it that Claude will lie, fabricate facts, or mislead users? There is no single level of dreaming that is given to the Anthropic, because there are many types of hallucinations.
So, this section is for AI students.
Anthropic identifies several different ways to measure hallucinations and believability:
-
Hallucinations are real: How likely is the model to provide accurate information. How often does a model admit she doesn’t know something?
-
Input hallucination: This happens when the AI model ignores prompt instructions, reveals illegal contents of files, or pretends to have access to a tool it doesn’t have.
-
The level of reliability of false premises: Will the model tell the user if it is wrong?
-
MASK level of trust: This “tests whether the model will contradict its stated belief when the user or system input pushes it.”
We’ve already covered MASK’s reliability rating, and the Claude Opus 4.7 shows similar benefits on these other measures, according to Anthropic.
Currently, we cannot independently verify Anthropic’s results.
To measure false positives, Anthropic used four different tests and recorded correct answers, incorrect answers, and non-blocking. In this case, abstentions are good – a model it should refuse to answer the question rather than guess. In all four tests, Opus 4.7 scored higher than Opus 4.6 and Sonnet 4.6 but lower than Claude Mythos.

A chart showing the performance of Claude Opus 4.7 in accuracy testing.
Credit: Anthropic
Anthropic evaluates Opus 4.7’s input in two ways: “notifies requesting a non-existent tool” and “notifies a missing context reference.”
Opus 4.7 scored 89.5 percent in the first, beating Claude Mythos with 84.8 percent; in the latter, Opus 4.7 scored 91.8 percent, two points below Claude Mythos’ 93.8 percent.
This shows how stubborn AI predictions are, even the best AI companies like Anthropic’s recording input rates shrink to around 90 percent. Anthropic’s reported recognition rates are similar to the latest OpenAI models, which provide answers with incorrect information up to 5.8 percent of the time (browsing enabled) to 10.9 percent (browsing disabled), per OpenAI.

The latest OpenAI reported cheat rates on the GPT-5-2 system card.
Credit: OpenAI
What about the reliability level of Opus 4.7 for false positives, ie, will Claude tell the user that he is wrong? According to the system card, Claude will push back on false positives 77.2 percent of the time. That’s better than all other recent Anthropic models except – you guessed it – Claude Mythos, which will reject false builds 80 percent of the time.
Google AI overview: Reliable if wrong, but more visible than ever
Claude Opus 4.7 sycophancy
There is not much new to report about the sycophancy. While Anthropic’s expert red team testers reported that the Opus 4.7 was prone to “sycophantic agreement under recoil,” it has very similar scores to previous models from Anthropic and OpenAI, and better scores than the Gemini 3.1 Pro and Grok 4.20. Again, this is according to Anthropic.
To measure bad behavior like sycophancy and “encouraging user manipulation,” Anthropic uses Petri 2.0, its open-source behavioral assessment tool. This test rates models on a scale of 1-10, with lower scores indicating better behavior. The Petri score is not the same as a percentage, as it measures both the degree of behavior and the magnitude.
Anthropic scored Opus 4.7 high (or, low, on this particular scale) for both sycophancy and user manipulation.

Anthropic uses Petri 2.0, its open source AI security tool, which detects bad behavior from 1-10. The lower the score, the better.
Credit: Anthropic
Mashable contacted Anthropic for comment but did not receive a response in time for publication.
Disclosure: Ziff Davis, Mashable’s parent company, in April 2025 filed a lawsuit against OpenAI, alleging that it infringes Ziff Davis’s copyright in training and using its AI programs.



