The Humanoid Data Problem: Why We Don't Have Enough

Humanoids won't scale until we fix how we collect, curate, and use data

Feb 2025 ยท 8 min read

Every major humanoid company is facing the same bottleneck right now, and it is not hardware. It is not compute. It is data. Specifically, the lack of high-quality manipulation data at the scale needed to train robots that can actually do useful things in the real world.

This is the problem I keep coming back to. Having worked on both the manufacturing side at Tesla and the research side with robotic arms and sim-to-real transfer, I have seen how data constraints shape what robots can and cannot do. Here is why I think the humanoid data problem is the most important unsolved challenge in robotics right now.

Self-Driving Had It Easy (Comparatively)

To understand why humanoid data is so hard, it helps to compare it to the one robotics domain that has actually scaled: self-driving cars.

Autonomous vehicles benefit from a near-perfect data flywheel. Every Tesla, Waymo, and Cruise vehicle on the road is passively collecting data just by driving around. The sensor suite, cameras, lidar, radar, captures everything. The task space is constrained: you are navigating a 2D surface with well-defined rules (lanes, signals, speed limits). And critically, there are millions of vehicles already on the road generating this data every day.

Humanoid manipulation has none of these advantages:

The Numbers Are Brutal

Let me put some rough numbers on this. Tesla's autopilot has been trained on billions of miles of driving data. The largest manipulation datasets, things like RT-2's training data or the Open X-Embodiment dataset, have maybe tens of thousands of demonstrations across all tasks and environments combined.

That is not a gap. That is a chasm. And it is not clear that the approaches that worked for scaling driving data will work for manipulation data.

The most common approach to collecting manipulation data right now is teleoperation: a human operator controls the robot through a demonstration, and the robot records the joint positions, forces, and camera images. This works, but it is painfully slow and expensive. A skilled teleoperator might collect one demonstration every few minutes. To get the kind of dataset diversity you need for robust generalization, you would need millions of demonstrations across thousands of tasks, objects, and environments.

Do the math. Even at one demo per minute, running 24/7, a single teleoperator produces about 500,000 demos per year. You would need hundreds of teleoperators running constantly to build a dataset comparable to what self-driving has. The cost would be in the hundreds of millions of dollars.

Simulation Is Necessary but Not Sufficient

The obvious response is: just simulate it. And simulation is absolutely part of the answer. You can generate millions of demonstrations in simulation in the time it takes to collect a few hundred in the real world. Projects like Isaac Gym, MuJoCo, and various custom simulators have shown that sim-trained policies can transfer to real robots.

But sim-to-real transfer for manipulation is much harder than for locomotion or even driving. The reason is contact physics. When a robot hand grasps an object, the contact forces, friction, deformation, and slip dynamics are incredibly difficult to simulate accurately. A policy that works perfectly in simulation often fails in the real world because the contact dynamics are subtly different.

This is the sim-to-real gap, and for manipulation it is wide. The approaches that partially close it, domain randomization, system identification, adaptive sim parameters, all help but none fully solve it. You still need real-world data to fine-tune and validate.

The companies that are making the most progress, like Toyota Research Institute with their diffusion policy work, are using simulation to get 80% of the way there and then using targeted real-world data to close the gap. But that "last 20%" still requires significant real-world data collection infrastructure.

Who Is Bottlenecked and How They Are Responding

Every major humanoid company is dealing with this differently:

Tesla (Optimus) has the biggest advantage here because of their manufacturing infrastructure. They can deploy prototypes in their own factories, in controlled environments where they can collect data at scale. They also have the compute infrastructure to train massive models. But even Tesla is constrained by the number of physical robots they have and the diversity of tasks in a factory setting.

Figure has been aggressive about teleoperation data collection and recently partnered with BMW to deploy robots in manufacturing. Their approach seems to be getting robots into real environments as fast as possible and iterating on data collection in situ.

1X (formerly Halodi) took an interesting approach by starting with wheeled humanoids (EVE) that could be deployed in security and facility management. This gave them a way to collect real-world navigation and simple manipulation data before tackling the harder full humanoid problem with NEO.

Aptronic is partnering with large enterprises to co-develop task-specific data collection pipelines. The idea is that the customer provides the tasks and environments, and Aptronic provides the robots and data infrastructure.

What all of these companies have in common is that they are building data engines, not just robots. The robot is the platform. The data it collects is the real product. This is the same insight that drove Tesla's self-driving program, and it is being applied to humanoids now.

What a Better Data Pipeline Looks Like

I think the humanoid data problem gets solved through a combination of approaches, not any single silver bullet:

  1. Scalable teleoperation. Better teleoperation interfaces that let non-expert operators collect high-quality demonstrations. VR-based systems, haptic feedback, and AI-assisted teleoperation that handles the low-level control while the human provides high-level task guidance.
  2. Human video as training data. There are billions of hours of video of humans doing manipulation tasks on YouTube alone. Approaches like UniSim and RT-2 have shown that you can extract useful manipulation knowledge from human video, even though the embodiment is different. This is a massive untapped data source.
  3. Simulation with better physics. Closing the sim-to-real gap requires better contact simulation. Projects using learned physics models, trained on real-world contact data, to improve simulation fidelity are promising.
  4. Fleet learning. Once you have even a small fleet of humanoids deployed, every robot becomes a data collector. Shared learning across the fleet, where one robot's experience improves all the others, is how you eventually get the data flywheel spinning.
  5. Foundation models for manipulation. Large pretrained models like RT-2 and Octo that can generalize across tasks and embodiments reduce the amount of task-specific data you need. Instead of training from scratch for every task, you fine-tune a foundation model with a small amount of targeted data.

The Race Is On

The company that solves the humanoid data problem first will have an almost insurmountable advantage. Data compounds. More data means better models, which means robots that can do more tasks, which means more deployments, which means more data. It is a flywheel, and once it starts spinning, it is very hard for competitors to catch up.

This is why I am convinced that the most important role in humanoid robotics right now is not the hardware engineer or the ML researcher. It is the person designing the data pipeline. The robot that learns fastest wins, and learning speed is a function of data quality, data diversity, and data volume.

We are in the early innings of solving this. But the teams that treat data as their primary product, not an afterthought, are the ones that will build the humanoids that actually work.

Back to Blog