Thinking Transformer: How Mixture-of-Recursions Reshapes AI

Why future AI might think more like humans—looping, pausing, and prioritizing what matters.

The Thinking Transformer: How Mixture-of-Recursions Could Reshape AI Thinking

Written by Pax Koi, creator of Plainkoi — tools and essays for clear thinking in the age of AI.

What if your AI knew when to skim and when to stew?

When we think, we don’t give every thought the same weight. Simple stuff? We breeze through it. But the hard questions—the ones that touch on values, identity, ambiguity—we loop on those. We double back. We mull.

Most AI doesn’t do that.

Today’s language models treat every word with the same intensity. Whether you say “hello” or drop a quote from Kant, they apply the same depth of processing across the board. It’s like using a jackhammer to brush your teeth—clumsy, loud, and not quite right.

But a new approach is changing that. It’s called Mixture-of-Recursions, or MoR, and it could shift how AI allocates its mental effort—token by token, thought by thought. For the technical paper behind MoR, see Mixture-of-Recursions on arXiv.

This isn’t just about speed. It’s about giving AI a more human way to think.

Why Most Transformers Are Overkill for Easy Stuff

Every time you send a prompt to a modern language model, something odd happens under the hood.

Whether the model is evaluating the word “cat” or “metaphysics,” it pushes both through the exact same number of transformer layers—say, 48 or more. Every token gets the full ride, no matter how trivial.

Why? Because that’s how transformers were originally built: uniform, symmetrical, predictable.

But here’s the thing—humans don’t operate like that.

We triage. We scan the fluff and zoom in on the signal. We let obvious ideas pass with barely a nod while giving complex ones a full cognitive workout. We think recursively, looping back over tough material.

MoR takes that human strategy and gives it to machines.

The Problem with “Just Make It Bigger”

For years, the mantra in AI was simple: bigger is better.

More parameters. More data. More layers. And for a while, that worked. GPT-3, GPT-4, and other massive models dazzled the world by brute-forcing their way through language understanding.

But scale comes at a price. Massive FLOPs (floating point operations). Exploding inference costs. Sluggish latency. Soaring memory demands.

Even with clever tricks—quantization, pruning, better attention mechanisms—we’re still forcing every token through the same rigid pipeline. No flexibility. No finesse.

It’s like requiring every car to take the same route home, whether it’s next door or across the state.

MoR asks: what if the route changed depending on the passenger?

MoR – Mixture-of-Recursions: The Model That Thinks in Spirals, Not Staircases

Here’s the core idea behind Mixture-of-Recursions: let the model decide how deep to think—on a token-by-token basis.

Instead of marching every token through 96 stacked transformer layers, MoR introduces something clever: a small, shared set of recursive layers that can be looped through as needed.

Easy token? One pass and out. Tricky token? Loop through again. Still ambiguous? Take another lap.

This decision is handled by a lightweight router—a tiny network that acts like a mental triage nurse, directing each token to the right depth of processing.

Picture a spiral staircase. Some thoughts go down a few steps and stop. Others spiral deeper. Contrast that with the rigid floors of traditional transformers—everyone up, everyone down, no deviation.

MoR gives the model a choice. And choice is power.

Let’s Get Under the Hood (Just for a Minute)

MoR – Mixture-of-Recursions, isn’t magic—it’s just smart engineering.

Recursive Layers: Rather than dozens of unique layers, MoR reuses a small core set. They’re looped through depending on how much effort each token needs. That saves both compute and memory.
Token-Level Router: After each recursive pass, the router decides: Does this token need to keep thinking? Or can it exit? It’s like a “stop or go” sign at every layer.
KV Sharing: The keys and values calculated during the first attention pass are saved and reused. That means no redundant computation—just smart caching.
Dynamic Depth in Practice: Take the sentence:
“Einstein’s theory of relativity revolutionized physics.”
“Einstein”? Maybe one pass. “Relativity”? Loop three times. “Revolutionized”? Probably two. “Of”? Get outta here—one and done.

MoR doesn’t just save time. It saves thought.

So What Do We Get in Return?

First, let’s talk speed.

MoR is faster on inference because it avoids wasting cycles on easy tokens. That means leaner performance, faster responses, and smaller model sizes without sacrificing power.

Then there’s memory. By reusing the same few recursive layers, MoR drastically reduces the memory footprint of big models. This is a huge win, especially for deploying models on smaller devices.

But here’s the kicker: performance actually improves.

MoR models show lower validation perplexity (meaning they’re better at guessing the next word), maintain competitive few-shot performance, and process more tokens per second than traditional designs.

In other words, they’re faster, cheaper, and smarter.

That’s not just a tradeoff. That’s a breakthrough.

What If AI Thought More Like You?

Here’s where it gets fun.

MoR doesn’t just mimic our thought process technically—it echoes it cognitively.

Humans don’t give every sentence equal weight. We gloss over small talk, but when someone asks something real—something vulnerable, complex, layered—we shift. Our brain clicks into deeper gear. We loop. We ruminate.

MoR does that too.

It knows when to go deeper. It knows when to move on.

Imagine an AI that doesn’t just reply quickly—but pauses when something meaningful shows up in your prompt. An assistant that knows when to linger and when to let go. One that matches your mental rhythm, not just your words.

That’s not just better design. That’s a better companion.

A Quick Look at the Competition

So how does MoR compare to the other architectures out there?

Here’s the snapshot:

Feature	Standard Transformer	Recursive Transformer	Mixture-of-Recursions
Token-level control	❌	⚠️ (fixed depth)	✅
Memory efficiency	⚠️	❌	✅
Computational cost	❌	❌	✅
Speed/latency	⚠️	❌	✅
Smart attention	❌	⚠️	✅

MoR isn’t just a tweak. It’s a rethink of what “depth” means in AI.

The Big Questions Still on the Table

Of course, no breakthrough comes without new challenges.

Training the router—the brain behind which token loops and which exits—is still a tricky business. Options include supervised learning, reinforcement learning, or hybrids. Each has pros and pitfalls.

MoR also has to prove itself at larger scales. Can it hold up in a 20B+ parameter model without breaking? Recursive gradients are harder to manage than linear stacks.

And then there are real-world tradeoffs. If your application is latency-critical (think: real-time translation), you might want fast exits. If accuracy is king (think: legal research), you’ll want deeper loops. MoR gives you control—but you have to know how to use it.

Finally, there’s the subtle risk: biased routing. If the router overlearns patterns from biased data, it might under-think important topics or over-think irrelevant ones.

In other words, the loop is smart—but it’s still trained by us.

Where This Could Go Next

Mixture-of-Recursions is more than a model tweak—it’s a glimpse into AI’s next evolution.

It points toward a future of modular cognition: systems that adapt not by getting bigger, but by getting wiser. Like a brain with shifting gears.

Picture what happens when we combine MoR with other advances:

Multimodal AI: An image-language model that gives most visuals a glance—but loops deeply on subtle ones.
On-Device AI: Phones and edge devices with tiny models that punch above their weight thanks to smart recursion.
Truly Personalized Assistants: Over time, your AI could learn how you think—and sync its recursive patterns to your style of reasoning.

While the world races to build the next trillion-parameter model, MoR suggests something more elegant:

Don’t just scale up. Spiral in.

A More Reflective Machine

There’s something intimate about recursion. It’s not just repetition. It’s attention with memory. It’s thought that folds in on itself.

When someone really listens to you, they don’t just wait for their turn to talk. They reflect. They echo what you said and turn it into something deeper. They help you finish your meaning.

MoR moves us closer to that kind of interaction.

It’s a transformer that doesn’t just complete your sentence—it circles back, mid-thought, to help you find what you really meant to say.

Have you ever walked away from a conversation thinking, “I wish I’d gone deeper on that”?

What if your AI could feel that too?

What if it gently nudged you—Hey, that part? Let’s go one more layer.

That’s the architecture of empathy. And it starts with a spiral.

How to Think Deeper with Today’s Models

Even if your favorite AI doesn’t use MoR yet, you can still bring its spirit into your prompts. Here’s how:

Revisit the Input: Ask the model to re-read what it just wrote and refine it. Give it a second pass.
Scaffold the Task: Break up complexity. Use outlines, bullets, then prose. Think like a builder.
Force a Rethink: Ask for a summary. Then challenge it. “What’s missing? What’s a counterpoint?”
Use Multiple Mirrors: Run the same prompt through different models, or ask for different perspectives. Let the loops unfold across minds.

These aren’t hacks. They’re scaffolds. They mirror what MoR does behind the scenes: reserving deeper attention for what matters most.

Because not every idea deserves the same depth.

Some thoughts… are just thicker.

And now, finally, so is the transformer.

Written by Pax Koi, creator of Plainkoi — Tools and essays for clear thinking in the age of AI — with a little help from the mirror itself.

AI Disclosure: This article was co-developed with the assistance of ChatGPT (OpenAI) and Gemini (Google DeepMind), and finalized by Plainkoi.