Anthropic has never released a technical paper detailing the Claude Mythos architecture. However, the research community has not been deterred from speculating about it. Kye Gomez recently launched an open-source project on GitHub called OpenMythos, aiming to construct a theoretical model of the Claude Mythos architecture from first principles. The project is built entirely in PyTorch and is grounded in peer-reviewed research.
OpenMythos is not a leaked model, a fine-tune, or a distillation. It is a hypothesis expressed in code, with a specific falsifiable claim that makes it intriguing.
The Main Claim: Claude Mythos Is a Recurrent-Depth Transformer
OpenMythos suggests that Claude Mythos falls under a category of architectures known as Recurrent-Depth Transformers (RDTs), also referred to as Looped Transformers in the literature. This concept differs significantly from standard transformer stacks.
In a traditional transformer model like GPT, LLaMA, or Mistral, the input passes through a sequence of distinct layers, each with its own set of independent weights. More capacity typically means more layers and parameters. In a Recurrent-Depth Transformer, a fixed set of weights is repeatedly applied across T loop steps within a single forward pass. The same weights are used multiple times. The depth of reasoning is not determined by the number of stored parameters but by the number of iterations during inference.
Imagine it as more akin to revising a draft than reading a book: the model revisits the same computational block multiple times, refining its internal representation with each iteration.
How the Architecture is Structured
OpenMythos structures this as a three-part system: Prelude → Recurrent Block → Coda. The Prelude and Coda are standard transformer layers executed once. The Recurrent Block serves as the computational core, iterated up to T=16 times.
At each loop step t, the hidden state is updated according to the following rule:
ht+1 = A·ht + B·e + Transformer(ht, e)
Here, ht represents the hidden state after the t-th loop iteration, and e is the encoded input from the Prelude, reintroduced at each step intentionally. Without this reintroduction, the hidden state would deviate from the original input signal across deep loops. The learned matrices A and B govern the amount of the previous hidden state and the encoded input that carry forward at each step.
The Feedforward Neural Network (FFN) inside the Recurrent Block is not a standard feedforward layer. OpenMythos replaces it with a Mixture-of-Experts (MoE) layer, following the design introduced in DeepSeekMoE. This involves a large pool of fine-grained routed experts, with only a sparse top-K subset activated per token, alongside a small set of always-active shared experts that capture common cross-domain patterns. The router selects distinct expert subsets at each loop depth, ensuring that each iteration is computationally distinct despite sharing the same base weights. MoE offers domain breadth, while looping provides reasoning depth.
The attention mechanism defaults to Multi-Latent Attention from DeepSeek-V2, which caches a compressed low-rank Key-Value (KV) latent instead of full key/value tensors, resulting in a significant reduction in KV memory at scale.
Reasoning in Continuous Latent Space
An essential feature of this architecture is that reasoning takes place entirely in continuous latent space. There is no emission of intermediate tokens between loop steps – the model does not generate text midway through a thought process and then re-read it. This distinguishes it structurally from chain-of-thought prompting, where reasoning is externalized as token sequences.
Saunshi et al. (2025) formally demonstrate that each loop iteration in an RDT is functionally equivalent to one step of chain-of-thought, but operates on real-valued vectors rather than discrete tokens. Continuous latent thoughts can encode multiple alternative next steps simultaneously, enabling a form of breadth-first search within the reasoning space in a single forward pass.
This characteristic also confers a tangible advantage. While a standard transformer trained on 5-hop reasoning chains struggles with 10-hop chains during testing – lacking the ability to extend its depth beyond the training phase – a Recurrent-Depth Transformer addresses this naturally. Additional inference-time loops extend the reasoning chain without the need for retraining. More complex problems receive more computation, while simpler ones terminate early.
Solving the Stability Problem
Training looped models has historically been challenging due to the possibility of the hidden state growing unbounded across iterations, a phenomenon known as residual explosion. OpenMythos tackles this issue by employing a Linear Time-Invariant (LTI) injection constraint inspired by the Parcae architecture (Prairie et al., 2026). This constraint ensures that the spectral radius of matrix A, denoted as ρ(A), remains below 1 by design, guaranteeing stability regardless of learning rate or gradient noise.
Another potential issue arises with excessive recurrence beyond a certain loop depth, leading to degraded predictions as the hidden state drifts beyond the solution and into noise. This is referred to as the ‘overthinking’ problem. Adaptive Computation Time (ACT) halting resolves this by incorporating a learned scalar per position that dynamically determines when to stop looping. Positions that pose greater processing challenges receive more computation, while tokens that have already converged terminate early.
Additionally, Depth-Wise LoRA adapters introduce a small rank-r adaptation matrix at each iteration depth, providing slight behavioral variations at each loop step without significantly increasing parameters. This bridges the gap between weight-tying and distinct layers.
Why Parameter Efficiency Matters
The Parcae paper (Prairie et al., 2026) provides empirical evidence supporting the claim of efficiency. With 770M parameters, an RDT matches a 1.3B standard transformer trained on the same data – approximately half the parameters for equivalent downstream quality. Optimal recurrence and optimal token count exhibit consistent power-law scaling across scales, establishing predictable scaling laws for looped training.
This has significant implications: reasoning depth scales with inference-time compute rather than stored parameter count. This reframes a key assumption in the scaling debate. The crucial factor may not be parameter count during training, but loop depth during inference.
What OpenMythos Contributes
OpenMythos offers four distinct research artifacts: a fully customizable PyTorch implementation of the RDT hypothesis with MoE FFN and Multi-Latent Attention; LTI-stable recurrent injection integrated as a primary training element; depth-wise LoRA adapters enabling behavioral differentiation per iteration; and a reproducible research baseline for exploring looped transformer dynamics and reasoning depth during inference.
Regardless of whether Mythos is truly an RDT, OpenMythos provides the research community with a tangible and executable resource – an implementation of an architecture class that the literature increasingly suggests is underexplored, potentially representing a fundamentally different approach to developing advanced AI beyond simply scaling up model sizes.
For the Full Codes with Notebook, click here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Are you on Telegram? You can now join us on Telegram as well.
Interested in partnering with us to promote your GitHub Repo, Hugging Face Page, Product Release, Webinar, etc.? Connect with us.





Be the first to comment