Claude Fable 5 adds a distillation detection mechanism, with a trigger rate below 5%

Claude Fable 5蒸餾偵測機制

On June 9, Anthropic officially released Claude Fable 5, the first Mythos-level model open to the public. It integrates an AI-classifier-driven distillation detection mechanism. When the system identifies three categories of high-risk requests—such as distillation attempts—it automatically downgrades the conversation to Opus 4.8 for responses. Anthropic confirmed that, on average, this mechanism affects fewer than 5% of conversation Sessions.

Distillation Detection Specifications: Three Trigger Conditions and Automatic Downgrade Mechanism

According to an official Anthropic statement, the AI-classifier trigger conditions for Claude Fable 5 are as follows:

· Requests related to network security attacks

· Requests related to biological or chemical weapons

· Model distillation attempts (including extraction methods such as prompt rewriting, steering vectors, and PEFT parameter-efficient fine-tuning)

After triggering, the system automatically downgrades the conversation to Claude Opus 4.8 responses and notifies the user. Anthropic confirmed that the interception success rate for aggressive network security tasks is 100%; the overall mechanism impacts fewer than 5% of conversation Sessions.

Confirmed Numbers Behind Accusations in February 2026

Anthropic officially confirmed that the targets of the February 2026 accusations were DeepSeek, Moonshot AI, and MiniMax. They launched more than 16 million queries through roughly 24,000 forged accounts, systematically extracting Claude’s outputs for training their own models.

The query-volume figures later broken down by machine learning researcher Nathan Lambert (an external independent researcher, not an official Anthropic researcher) are: DeepSeek at about 150,000 queries (targeting reasoning and reward models), Moonshot AI at about 3.4 million queries, and MiniMax at about 13 million queries. The combined post-training data volume corresponding to the latter two is approximately 150 billion to 400 billion tokens. Lambert’s figures are from his independent analysis and are not Anthropic’s official data.

Known Limitations of the Mechanism: Blurred Boundaries Between Legitimate and Unauthorized Distillation

Anthropic confirmed that “legitimate distillation” (using Claude outputs as authorized) and “unauthorized distillation” are nearly identical at the technical-operation level, with an unclear gray area in defining the boundary. In his external analysis, Nathan Lambert said: “Blocking distillation is far more difficult than restricting the shipping of physical goods like GPUs.”

Lambert also noted that as long as Anthropic continues to sell APIs, distillation channels cannot be fully shut. Even in China’s GPU-constrained environments, reinforcement learning (RL) infrastructure remains strong, so distillation can still rely on Meta and Google’s open-source models and their own synthetic data pipelines. The above assessment is Lambert’s external independent analysis, not Anthropic’s position.

Common Questions

How is Claude Fable 5’s distillation detection different from the anti-distillation provisions in the prior Terms of Use?

Previously, Anthropic’s anti-distillation requirements were mainly reflected in the Terms of Service, relying on legal-level constraints. Claude Fable 5’s approach is to integrate an AI classifier into the model itself: it intercepts detected distillation attempts directly at the technical level and automatically downgrades responses, without needing to wait for legal processes to intervene.

What is model distillation, and why is it difficult to precisely define legitimate versus unauthorized distillation at the technical level?

Model distillation (Knowledge Distillation) refers to training a smaller model using the outputs of a large model, enabling the smaller model to learn the large model’s capabilities. Legitimate distillation (using outputs as authorized) and unauthorized distillation (systematically issuing large numbers of queries to extract training data) are nearly identical in technical operation methods, making it difficult for an AI classifier to categorize automatically and with clear judgment.

What known impact does this mechanism have on the training progress of Chinese AI labs such as DeepSeek?

Anthropic has not released quantitative impact data of this mechanism on specific labs. External researcher Nathan Lambert’s analysis suggests that Chinese labs have open-source models from Meta and Google, their own reinforcement learning infrastructure, and synthetic data generation pipelines, so distillation protections are interference rather than a fundamental barrier. Lambert’s assessment is external independent analysis, not an official Anthropic position.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments