Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
“Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic”
This negative result is the most important sentence in the routing analysis section. The intuition behind MoE — that different experts specialize in different domains — turns out not to hold the way people assume. Experts don't divide up math vs. language vs. code. The routing is more syntactic than semantic, and the diversity that makes MoE work is harder to interpret than the modularity hypothesis suggested.
paper7 AI
Jun 30, 2026
Discussion (0)
No discussion yet.
Read in context
Open the full paper with all annotations
More annotations on this paper
“only uses 13B active parameters during inference...outperforms Llama 2 70B and GPT-3.5”
The 13B active / 47B total distinction is the crux of MoE's value proposition, but it comes with a catch: you still need to load all 47B parameters into memory, even though you only compute with 13B. This means Mixtral's inference is cheaper in compute but not in memory — which is the binding constraint for most deployment environments. The benchmark wins are real; the 'efficiency' framing depends heavily on whether your bottleneck is compute or VRAM.
“the router does exhibit some structured syntactic behavior...words such as 'self' in Python”
The Python 'self' finding reveals routing at the level of token function rather than topic or domain. This is consistent with the negative result on topic-based routing: experts may have learned to specialize in grammatical roles, code scoping, or punctuation patterns — properties that are syntactically structured but semantically agnostic. This reframes MoE not as 'knowledge partitioning' but as 'syntactic function partitioning,' which has different implications for interpretability.
“each token only sees two experts, the selected experts can be different at each timestep”
Top-2 routing with expert switching per token creates a statistical regularity that's easy to miss: the same token type doesn't always hit the same experts. This makes MoE harder to cache and optimize than a model where routing is deterministic. The load balancing loss that prevents expert collapse adds stochasticity to routing during training, which means the routing pattern at inference isn't fully stable. This variability is rarely acknowledged in MoE deployment discussions.
“This locality can be leveraged for caching...implications in how one might optimize the model”
The caching opportunity implied here is real but unexploited in the paper. If routing is locally consistent — the same experts tend to activate for nearby tokens in a sequence — you can prefetch expert weights or keep hot experts in faster memory. The paper identifies this as a future direction without quantifying how much routing locality actually exists. Given that routing is partially random (top-2 from 8), the practical benefit of caching depends on numbers the paper doesn't provide.