Why robotics needs a new kind of neural network

microagi Research · microagi.ai

TL;DR

microagi is solving the end-to-end data problem for embodied AI. We do frontier data research, collect at massive scale, and we are building to deploy robots in the physical world to collect even more. Our operations and technology revolve around improving data quality and diversity. We serve our customers only what we believe is the best data, paired with the best forward-deployed engineering.

We believe something on the order of 100 billion hours of diverse, multimodal real-world data and interaction is needed to reach embodied AGI. Data volume is a major bottleneck, and we are building the infrastructure to break it. But there is a second bottleneck that the field has not yet solved: the memory architecture. Today's models cannot keep learning from diverse, continually arriving experiences without forgetting what they already know. Even if you collect all the data, you need somewhere scalable to put it.

The direction we care about: keep active compute small, let total capacity grow sparsely, reuse old circuits when an input is familiar, create new sparse circuits when the model genuinely needs new structure. In one sentence: real-world data needs a scalable place to go.

One example of what our research team is producing: a mechanism called Coactivation-Assembled Weight Layers (CAWL). CAWL is one piece of a broader research program. The architecture constraints we outline in this article are the research agenda. CAWL is one specific attack on one part of it. All of our research will be open sourced. Papers are pending.

1. Data is the bottleneck, and the scale required is enormous

Robotics has a scaling story that looks, on the surface, a lot like the language model story from 2019 to 2023. Collect more data, train bigger models, watch capabilities emerge. Recent generalist robot models already make one point hard to ignore: scale works in robotics too. RT-1 argued for large task-agnostic training on real robot data. Open X-Embodiment assembled data from 22 robots across 21 institutions. Octo trained on 800K trajectories from that corpus. OpenVLA pushed the open VLA line with a 7B model trained on 970K real-world demonstrations. DROID showed that broader in-the-wild collection improves robustness and generalization.

But the scale that has been demonstrated so far is nowhere near enough. Robotics is an open-world problem. A deployed robot does not see i.i.d. data. It sees an open-ended stream of homes, factories, tools, object layouts, camera changes, sensor drift, user corrections, and entirely new tasks. Inputs are multimodal, noisy, asynchronous, partially observed, and grounded in a changing physical environment. The same skill looks different under camera pose, lighting, object wear, material compliance, latency, human behavior, floor friction, clutter, and hundreds of other variables.

The important shifts are often orthogonal to each other. Lighting has nothing to do with texture. Camera latency and motion blur are separate problems. Material compliance doesn't track with shape, human style doesn't predict room layout, and tool wear tells you nothing about object appearance. Covering those orthogonal dimensions of covariate shift is not optional. It is the whole point. And it requires data at a scale the field has not yet reached.

microagi is built to solve this. We are a data research lab for embodied AI, doing frontier research on what high-quality data looks like across modalities, and collecting at a scale of over 100,000 hours per month of diverse real-world multimodal data. We will be deploying in the physical world, which will generate even more. The ideal dataset, roughly speaking, is every job and every person on earth recorded for about 10 hours, capturing the full diversity of human physical interaction with the world, and then robots deployed doing continual learning on top of that foundation. We believe something on the order of 100 billion hours of multimodal real-world interaction is what it takes.

That is the data problem. We are solving it. But there is a second problem sitting right behind it: even if you collect all that data, what architecture can actually absorb it?

1.1 The architecture problem: dense models forget

Dense models are powerful, but their default learning rule is brutal: new experience pushes gradients through the same shared weight table that already encodes old behavior. That's fine when the distribution is stable and training is mostly one giant offline phase. It's much less fine when a robot keeps encountering new rooms, new users, new objects, new tools, and action consequences nobody anticipated.

The standard mitigations are replay buffers (store old data and mix it in), elastic weight consolidation (penalize changes to important weights), or progressive networks (freeze old parameters and add new ones). They all work to some degree. None scale cleanly. Replay means storing and retraining on old data indefinitely. EWC means computing and storing a Fisher information matrix over all parameters, which gets intractable fast. Progressive networks freeze old capacity that never benefits from new experience. The root cause is always the same: dense weight matrices store everything in the same place. Learning and forgetting are mechanically coupled.

1.2 Why mixture of experts is not enough

Sparse mixture-of-expert models look like a step in the right direction. They route tokens to different expert subnetworks, so not everything is active for every input. MoE scales impressively, from Jacobs et al.'s original adaptive mixtures through Switch Transformers to modern soft routing.

But MoE sparsifies dispatch, not learning itself. The router picks a few experts, sure, but each expert is still a dense chunk of parameters that was created at initialization and never changes structure. New domains can crowd onto the same experts as old ones. So you end up sparse at the routing level and still forgetting at the weight-update level.

Put differently: MoE separates which computation to apply, but doesn't change what computation is available. The menu is fixed at init. And long context windows don't solve this either. Replaying raw history isn't memory. It's just expensive retrieval.

2. Five requirements for a robotics-scalable architecture

Bounded active compute. The amount of work per forward pass should stay roughly fixed even as the model learns more. You cannot scale a deployed robot system if every new domain makes inference more expensive.
Growing total capacity. The model must have a way to add new structure for genuinely new situations. A fixed parameter budget is a hard ceiling on what the system can represent.
Continual learning by allocation, not overwrite. Old skill circuits should survive because they are not constantly rewritten, not because an anti-forgetting penalty is applied on top of a fundamentally destructive update rule.
Multimodal grounding. Vision, language, proprioception, action, and world-state cues should land in a shared system that can route and remember across modalities.
Memory across timescales. Immediate state, short-term writable traces, and slow consolidated structure should be different things, not one giant context buffer.

This is not a claim about one layer or one training trick. It is a claim about the whole regime that embodied AI needs.

3. Coactivation-assembled weight layers

Our research team works across sparse routing, adaptive capacity allocation, multi-timescale memory, and multimodal grounding. CAWL is the first published result from that program. The idea: instead of storing a dense weight matrix W that gets applied identically regardless of input, store building blocks. A pool of latent neurons with learned embeddings, two projection matrices, shared pair modules, and a sparse persistent table. At inference time, the layer assembles a sparse circuit from those pieces, conditioned on whatever context it's currently seeing.

The standard MoE picture can be written loosely as y = Σ g_e(c) · E_e(x) summed over a top-K set of experts. That gives sparse dispatch, but the underlying expert blocks are still coarse and dense. CAWL does not dispatch tokens to prewritten experts. It selects nodes inside one shared latent graph and writes edges among them. The core equation:

W_eff(c) = U · A(c) · V^T

Here U and V are learned projections, M is the latent width, and A(c) is a context-conditioned sparse latent adjacency matrix assembled at inference time. The input does not just select coefficients in a fixed dense matrix. It selects a small active set of latent neurons, forms edges among them, and applies the projected sparse operator.

3.1 Familiarity-gated routing

Each latent neuron has a learned embedding e_i. The context is projected into the same space as a query vector q(c). Cosine similarity between the query and each neuron embedding produces a score s_i(c). The maximum similarity across all neurons defines the raw familiarity signal:

F(c) = max_i s_i(c)

A sigmoid with learnable threshold and temperature produces the familiarity gate:

γ(c) = σ( ( F(c) − τ_F ) / T_F )

When γ(c) is close to 1, the input looks familiar and the layer leans heavily on its persistent table of consolidated connections. When it's close to 0, the input is novel, so the layer activates underused neurons and relies on its dynamic pair modules to construct a fresh circuit. For robotics this matters a lot: familiar situations get cheap, while genuinely novel situations get a dedicated construction path.

The routing score for each neuron includes a novelty bias:

s'_i(c) = s_i(c) + λ_nov ( 1 − γ(c) )( 1 − ρ_i )

Here ρ_i is the running activation frequency of neuron i. Novel inputs get a bonus for selecting neurons that have not been used much. Familiar inputs gravitate toward the neurons they have always used. This turns routing into a memory decision instead of a generic discriminative score.

3.2 Sparse edge synthesis and persistent consolidation

Once the active set S(c) = TopK(s'(c), K) is selected, the layer computes connections between every ordered pair. For each pair (i, j), the shared pair feature concatenates their embeddings and routing scores:

ξ_ij(c) = [ e_i ; e_j ; s'_i(c) ; s'_j(c) ]

A gate network produces a Bernoulli parameter p_ij(c) = σ(f_θ(ξ_ij)) that decides whether the edge exists, and a value network produces the dynamic edge weight α_ij(c) = g_ψ(ξ_ij). The binary gate uses a straight-through estimator: hard threshold forward, continuous gradient backward.

For familiar inputs, the persistent table contributes stored edge weights in proportion to the familiarity gate:

α̃_ij(c) = α_ij(c) + γ(c) · C_ij

The persistent table C consolidates frequently occurring edges over time. Edges that fire reliably get strengthened via a familiarity-modulated learning rate. Edges that stop being useful get pruned after consecutive low-utility updates. The table has a fixed budget; when it fills, the least useful entries are evicted.

The layer output never materializes the full effective matrix. It gathers the active columns of V, applies the sparse adjacency in latent coordinates, and scatters through the active columns of U:

y = U · A(c) · V^Tx = Σ_i Σ_j z_ij(c) · α̃_ij(c) · u_i( v_j^Tx )

where u_i and v_j are columns of U and V respectively, and the sums run only over the active set S(c).

3.3 The combinatorial space

With K active neurons selected from a pool of M, and each ordered pair independently connected or not (including self-loops), the number of possible binary circuit topologies is:

N_K = C(M, K) × 2^K²

For M = 1024 and K = 64, this is approximately 5 × 10¹³³⁵. That is the number of binary support patterns before edge values and before projection. Not all of them produce functionally distinct effective matrices. But the point is that a compact latent basis can range over an enormous family of sparse circuits while storing significantly fewer parameters than a dense matrix of the same input/output dimension.

3.4 Compute characteristics

Let's be direct about this: right now, a CAWL layer costs more per forward pass than a dense layer of the same dimension. The pairwise edge computation over the active set dominates, and it needs fused kernels that don't exist yet to be competitive at scale. The familiarity routing path itself is cheap. The pair modules are where all the compute goes.

Storage is a different story. A CAWL layer stores considerably fewer parameters than the equivalent dense matrix, because it's never actually a single dense matrix. It's a set of building blocks that recombine differently for each context.

The path to making this practical is recall-first execution. Familiar inputs should mostly just read consolidated support from the persistent table and skip edge synthesis entirely. Novel inputs pay the construction cost, but they're the reason the model needs plasticity in the first place. If this works, familiar inference ends up cheaper than a dense baseline. That's the engineering target.

4. Structural growth and continual learning

Fixed-basis CAWL is only step one. It can prove that familiarity-gated routing is stable and that sparse circuit assembly actually works. But it's not the full continual-learning answer. With fixed M, novelty bias can delay overlap between domains, but it can't eliminate it.

The real move is letting the latent basis grow. Once you do that, learning stops meaning rewrite the same table and starts meaning add new sparse structure.

M_t+1 = M_t + m_t, U_t+1 = [ U_t | U_new ], V_t+1 = [ V_t | V_new ]

The allocation rule is what prevents capacity explosion. Low familiarity alone is not enough. Growth should happen only when novelty is paired with genuine residual error:

n(c) = α ( 1 − γ(c) ) + β · ε(c) + χ · r_⊥(c)

Here ε(c) is task error or surprise, and r_⊥(c) measures how much of the desired update sits outside the currently recalled span. If n(c) crosses a threshold, the model allocates new latent directions. If not, it reuses or recombines what already exists.

Growth is structural. New rows are added to the latent embedding matrix E, new columns are appended to U and V, and the persistent table C receives new sparse support. K stays fixed. Per-example activation stays sparse even as total capacity grows. The total parameter count will be high, because it has to be, but activation remains thin.

Old circuits survive not because the model is regularized into barely changing, but because new domains get genuinely new structure. Once a pathway is reused often enough, it consolidates. Once consolidated, its associated columns can be protected with lower learning rates or projected gradients. Dormant old circuits become much harder to corrupt.

The decisive experiment we are running. Train on domain A, then B, then C, then evaluate A again. Without replay buffers and without EWC-style penalties. If A survives because its circuits went dormant rather than because we froze the world around it, that is the result that matters.

5. Memory beyond the context window

There is a second bottleneck that matters just as much as weight overwrite: short-term memory. Robotics systems often carry giant buffers, large histories, or long token windows because they do not have a better way to keep track of what just happened. That is expensive and fragile.

The target is not infinite exact memory with zero storage. That is impossible. The target is very large effective memory with bounded active compute. A better design separates three timescales:

Active state h_t for what is currently in play.
Fast writable memory M_t for short-lived traces that can be updated online.
Slow sparse memory S for consolidated pathways and reusable long-term structure.

r_t = Read( h_t, M_t, S )
h_t+1 = f( h_t, x_t, r_t )
M_t+1 = λ M_t + Write( h_t, x_t, r_t )

If a trace repeats or proves useful, it gets consolidated into S. The robot needs a compact state, a writable scratchpad, and a persistent circuit bank. It does not need every prior token to remain live forever. Eventually this could absorb part of what current architectures carry as KV cache, but that is not the immediate claim. The immediate milestone is continual learning that allocates new sparse pathways instead of repeatedly negotiating over the same dense table.

6. Data research and architecture research are the same problem

microagi exists to solve the whole end-to-end data problem for embodied AI. Our operations and technology revolve around improving data quality and diversity across modalities. We do research to test our data quality rigorously, and we serve our customers only what we believe is the best data paired with the best forward-deployed engineering.

But data research without architecture research is incomplete. Scale data and the model can't absorb it without forgetting? You've built a pipeline with no destination. Build an architecture without understanding what data it needs, at what quality, across which modalities? You've built a destination with no pipeline. These two problems don't separate.

That's why we do both. As we scale collection and deploy in the physical world, the data volume will keep growing. Robots in the real world will generate even more data than human operators, and it'll be richer, more diverse, and more sequential than anything collected offline. The architecture has to be ready for that. Continual learning at this scale isn't a nice-to-have. It's a prerequisite.

That is why we have a research team working on these problems. CAWL is one output of that effort. There will be more. The bottleneck we are working on is one the entire field faces, and we intend to open source everything we produce.

7. Experimental priorities

This is foundational model research for embodied AGI, but the bar has to be concrete. We're not trying to win a beauty contest for sparse math. We're trying to build something that can keep learning from real-world data without blowing up compute or quietly forgetting everything.

If routing collapses, none of the rest matters.

Stability. The familiarity gate γ(c) should not collapse or flatline during training. The router should not degenerate to always selecting the same neurons. The persistent table should remain bounded without manual intervention.
Signal quality. Low γ should correlate with rare, shifted, or poorly fitted inputs. High γ should correlate with useful recall from the persistent table. If the gate collapses to a constant, the familiarity mechanism is dead.
Continual learning. Sequential domains should preserve prior performance without replay buffers becoming the whole story. The A-then-B-then-C-then-test-A experiment is the core proof.
Systems viability. Familiar inputs should become cheaper than novel ones once consolidated. If the cost remains dominated by construction even after consolidation, the design needs another iteration.

Papers are in preparation. We will share results and code as they mature.

8. The longer view

Robotics won't be solved by a single trick. Better data, better sim, better world models, better action representations, better hardware. But the memory problem sits underneath all of it. If every new skill still means overwriting a shared dense table, deployment stays fragile no matter how impressive pretraining looks.

Our thesis is that continual learning for embodied systems is fundamentally a sparse capacity allocation problem. Familiar inputs recall consolidated circuits. Novel inputs build new ones. Stuff that stops being useful decays. Capacity grows only when the current span can't explain the input. The reason this doesn't explode compute is that activation is sparse. Total parameter count grows with experience, but the active slice at any given moment stays thin.

microagi is solving the end-to-end data problem for embodied AI. We do frontier data research, collect at massive scale, and will be deploying robots in the physical world to collect even more. In parallel, we are developing compute-efficient architectures that can absorb that data through sparse activation, structured memory, and continual learning by capacity allocation rather than destructive overwrite. The goal is not a prettier benchmark model. The goal is a scalable foundation for real-world robotics and embodied AGI.

All of our research will be open sourced. The bottleneck we're working on affects the whole field, and the solutions should be shared.

We think sparse circuit assembly is the right direction, and we have a team that can generate ideas like this and ship them. CAWL is the first. More soon.

Selected references

Brohan, A., et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale.
Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
Ghosh, D., et al. (2024). Octo: An Open-Source Generalist Robot Policy.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive Mixtures of Local Experts.
Khazatsky, A., et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.
Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model.
Kontogianni, T., et al. (2024). Is Continual Learning Ready for Real-world Challenges?
Lesort, T., et al. (2020). Continual Learning for Robotics.
McClelland, J. L., McNaughton, B. L., and O'Reilly, R. C. (1995). Why There Are Complementary Learning Systems in the Hippocampus and Neocortex.
Open X-Embodiment Collaboration, et al. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
Powers, S., Gupta, A., and Paxton, C. (2023). Evaluating Continual Learning on a Home Robot.
Puigcerver, J., et al. (2023). From Sparse to Soft Mixtures of Experts.
Rusu, A. A., et al. (2016). Progressive Neural Networks.
Vander Mijnsbrugge, D., Ongenae, F., and Van Hoecke, S. (2023). Context-Aware Deep Learning with Dynamically Assembled Weight Matrices.
Yoon, J., et al. (2017). Lifelong Learning with Dynamically Expandable Networks.

Get in Touch.