Why touch, not just vision, will decide which humanoids actually work in the real world
Close your eyes and pick up a glass of water. You can do it effortlessly. You know exactly how hard to squeeze, you can feel the weight shift as water sloshes, you can tell if the glass is wet and adjust your grip before it slips. Now imagine doing that with thick oven mitts on both hands and your eyes closed. That is roughly how most humanoid robots interact with the world today.
The robotics industry has poured billions into better vision, better language understanding, and better locomotion. But the sense that matters most for manipulation, touch, remains almost completely neglected. I think this is the single biggest gap holding humanoids back, and I think the teams that solve it will leapfrog everyone else.
Here is a fact that surprises most people: the majority of humanoid robot hands have zero tactile sensing. Some have basic force/torque sensors at the wrist. A few have pressure-sensitive fingertips. But nothing close to the density and richness of human touch.
Human fingertips have roughly 2,500 mechanoreceptors per square centimeter. These receptors detect pressure, vibration, texture, temperature, and slip at incredibly high resolution and bandwidth. When you pick up an object, your brain is processing a continuous stream of tactile data that tells you everything about the contact: where it is, how strong it is, whether the object is slipping, what the surface feels like, and how the object is deforming under your grip.
Without this information, robots are doing manipulation with one hand tied behind their back. They rely almost entirely on vision to plan grasps, and then execute those grasps open-loop, with no feedback about what is actually happening at the contact surface. This is why robots drop things, crush delicate objects, and struggle with tasks that any human child can do.
The current approach in robotics is to throw more vision at the problem. Better cameras, more viewpoints, higher resolution, depth sensors. And vision is essential. But it has fundamental limitations for manipulation:
The analogy I keep coming back to is this: trying to do manipulation with only vision is like trying to walk using only your eyes, with no proprioception or sense of ground contact. You could maybe shuffle around on flat ground, but you would fall on any uneven surface. That is where robotic manipulation is today.
The good news is that tactile sensing hardware is improving rapidly. There are several promising approaches:
Vision-based tactile sensors like GelSight and DIGIT use a camera behind a soft, deformable membrane to capture high-resolution contact geometry. When the sensor touches something, the membrane deforms, and the camera captures the deformation pattern. These sensors give you incredible spatial resolution and can detect surface textures, contact shapes, and even small forces. The downside is they are bulky and hard to integrate into a dexterous hand.
Capacitive and resistive arrays use grids of pressure-sensitive elements to create a spatial pressure map across the sensor surface. These are thinner and easier to integrate but typically have lower resolution than vision-based sensors. Companies like Pressure Profile Systems and research labs at MIT and Stanford are pushing the density of these arrays higher.
Piezoelectric and MEMS sensors can detect dynamic events like vibration and slip at very high frequencies. These are great for detecting the onset of slip, which is critical for stable grasping, but they typically do not give you the spatial contact geometry that vision-based sensors provide.
The challenge is not building any one of these sensors. It is building a sensor that has high spatial resolution, high bandwidth, is durable enough for real-world use, small enough to fit in a robot fingertip, and cheap enough to deploy at scale. Nobody has fully cracked that combination yet, but the gap is closing fast.
This is where things get really exciting. The robotics AI community has been making huge progress with VLAs, Vision-Language-Action models. These are foundation models that take in camera images and language instructions and output robot actions. RT-2, Octo, OpenVLA, and similar models have shown impressive generalization across tasks and environments.
But VLAs are missing a critical input modality: touch. And this is not just a nice-to-have. For contact-rich manipulation tasks, tactile information is arguably more important than vision.
This is why I believe the next major step in robot foundation models is VTLAs: Vision-Touch-Language-Action models. These models would take in vision, tactile sensing, and language instructions, and output actions that are informed by all three modalities.
Think about what this enables:
The technical challenge is not trivial. Tactile data is high-dimensional, high-bandwidth, and fundamentally different in structure from visual data. You cannot just concatenate a tactile image to a visual image and call it a day. The model needs to learn cross-modal representations that capture the relationship between what the robot sees and what it feels.
But the same transformer architectures that learned to fuse vision and language are perfectly capable of incorporating a third modality. The architecture is ready. What we need is the data.
And here is the catch. Building a VTLA requires paired vision-touch-action datasets at scale, and these basically do not exist yet.
Collecting tactile data is harder than collecting visual data. Every demonstration needs a robot with tactile sensors actually making contact with objects. You cannot extract tactile data from YouTube videos or existing datasets. It has to be collected from scratch, on hardware that has tactile sensors, in environments with diverse objects and tasks.
This is a chicken-and-egg problem. Nobody is collecting large-scale tactile datasets because nobody has the hardware deployed to collect them. Nobody is deploying tactile hardware at scale because there are no models that can use the data effectively. Someone has to break this cycle.
I think the path forward looks like this:
The humanoid companies that integrate tactile sensing early will have a compounding advantage. Every day of tactile data collection is a day of training data that competitors without touch cannot replicate. The VTLA models trained on this data will be capable of tasks that vision-only systems simply cannot do.
I have been thinking about this problem for a while now, and I am increasingly convinced that tactile sensing is not just a nice sensor upgrade. It is the missing modality that separates robots that can do demos from robots that can do real work. The gap between picking up a block in a lab and handling real objects in a real kitchen or factory is largely a gap in tactile intelligence.
The teams that figure out how to build good tactile sensors, collect tactile data at scale, and train VTLA models on that data are going to build robots that make everything else on the market look like it is wearing oven mitts.
And that is a future worth building toward.