Tactile Sensing: The Missing Piece for Humanoids & Why VTLAs Matter

Feb 2025 · 8 min read

Close your eyes and pick up a glass of water. You can do it effortlessly. You know exactly how hard to squeeze, you can feel the weight shift as water sloshes, you can tell if the glass is wet and adjust your grip before it slips. Now imagine doing that with thick oven mitts on both hands and your eyes closed. That is roughly how most humanoid robots interact with the world today.

The robotics industry has poured billions into better vision, better language understanding, and better locomotion. But the sense that matters most for manipulation, touch, remains almost completely neglected. I think this is the single biggest gap holding humanoids back, and I think the teams that solve it will leapfrog everyone else.

Robots Are Blind in Their Hands

Here is a fact that surprises most people: the majority of humanoid robot hands have zero tactile sensing. Some have basic force/torque sensors at the wrist. A few have pressure-sensitive fingertips. But nothing close to the density and richness of human touch.

Human fingertips have roughly 2,500 mechanoreceptors per square centimeter. These receptors detect pressure, vibration, texture, temperature, and slip at incredibly high resolution and bandwidth. When you pick up an object, your brain is processing a continuous stream of tactile data that tells you everything about the contact: where it is, how strong it is, whether the object is slipping, what the surface feels like, and how the object is deforming under your grip.

Without this information, robots are doing manipulation with one hand tied behind their back. They rely almost entirely on vision to plan grasps, and then execute those grasps open-loop, with no feedback about what is actually happening at the contact surface. This is why robots drop things, crush delicate objects, and struggle with tasks that any human child can do.

Why Vision Alone Is Not Enough

The current approach in robotics is to throw more vision at the problem. Better cameras, more viewpoints, higher resolution, depth sensors. And vision is essential. But it has fundamental limitations for manipulation:

Occlusion. When a robot hand grasps an object, the fingers occlude exactly the part of the object that matters most: the contact surface. The camera cannot see what is happening inside the grasp.
Force is invisible. You cannot see how hard a robot is squeezing something. A camera looking at a hand holding an egg cannot tell you if the hand is about to crush it or if the egg is about to slip out.
Material properties are hidden. Is the surface wet? Rough? Deformable? Slippery? Vision can give you some clues, but tactile sensing gives you the ground truth.
Contact dynamics are fast. An object starts to slip in milliseconds. By the time a vision system detects the slip (if it even can through occlusion), the object has already fallen. Tactile sensing operates at the bandwidth needed to detect and react to these events.

The analogy I keep coming back to is this: trying to do manipulation with only vision is like trying to walk using only your eyes, with no proprioception or sense of ground contact. You could maybe shuffle around on flat ground, but you would fall on any uneven surface. That is where robotic manipulation is today.

The State of Tactile Sensing

The good news is that tactile sensing hardware is improving rapidly. There are several promising approaches:

Vision-based tactile sensors like GelSight and DIGIT use a camera behind a soft, deformable membrane to capture high-resolution contact geometry. When the sensor touches something, the membrane deforms, and the camera captures the deformation pattern. These sensors give you incredible spatial resolution and can detect surface textures, contact shapes, and even small forces. The downside is they are bulky and hard to integrate into a dexterous hand.

Capacitive and resistive arrays use grids of pressure-sensitive elements to create a spatial pressure map across the sensor surface. These are thinner and easier to integrate but typically have lower resolution than vision-based sensors. Companies like Pressure Profile Systems and research labs at MIT and Stanford are pushing the density of these arrays higher.

Piezoelectric and MEMS sensors can detect dynamic events like vibration and slip at very high frequencies. These are great for detecting the onset of slip, which is critical for stable grasping, but they typically do not give you the spatial contact geometry that vision-based sensors provide.

The challenge is not building any one of these sensors. It is building a sensor that has high spatial resolution, high bandwidth, is durable enough for real-world use, small enough to fit in a robot fingertip, and cheap enough to deploy at scale. Nobody has fully cracked that combination yet, but the gap is closing fast.

VTLAs: The Next Evolution Beyond VLAs

This is where things get really exciting. The robotics AI community has been making huge progress with VLAs, Vision-Language-Action models. These are foundation models that take in camera images and language instructions and output robot actions. RT-2, Octo, OpenVLA, and similar models have shown impressive generalization across tasks and environments.

But VLAs are missing a critical input modality: touch. And this is not just a nice-to-have. For contact-rich manipulation tasks, tactile information is arguably more important than vision.

This is why I believe the next major step in robot foundation models is VTLAs: Vision-Touch-Language-Action models. These models would take in vision, tactile sensing, and language instructions, and output actions that are informed by all three modalities.

Think about what this enables:

"Pick up the egg gently" - the model uses vision to locate the egg, tactile sensing to modulate grip force in real time, and language to understand the "gently" constraint.
"Hand me the wet glass" - tactile sensing detects the moisture on the surface and automatically increases grip force to compensate for reduced friction, something vision alone cannot reliably detect.
"Feel if this bolt is tight" - the model uses tactile feedback to assess torque and compliance, a task that is almost impossible with vision alone.
"Sort these fabrics by texture" - tactile sensing provides the primary signal for distinguishing between materials that look similar but feel different.

The technical challenge is not trivial. Tactile data is high-dimensional, high-bandwidth, and fundamentally different in structure from visual data. You cannot just concatenate a tactile image to a visual image and call it a day. The model needs to learn cross-modal representations that capture the relationship between what the robot sees and what it feels.

But the same transformer architectures that learned to fuse vision and language are perfectly capable of incorporating a third modality. The architecture is ready. What we need is the data.

The Data Pipeline Challenge

And here is the catch. Building a VTLA requires paired vision-touch-action datasets at scale, and these basically do not exist yet.

Collecting tactile data is harder than collecting visual data. Every demonstration needs a robot with tactile sensors actually making contact with objects. You cannot extract tactile data from YouTube videos or existing datasets. It has to be collected from scratch, on hardware that has tactile sensors, in environments with diverse objects and tasks.

This is a chicken-and-egg problem. Nobody is collecting large-scale tactile datasets because nobody has the hardware deployed to collect them. Nobody is deploying tactile hardware at scale because there are no models that can use the data effectively. Someone has to break this cycle.

I think the path forward looks like this:

Start with simulation. Build high-fidelity tactile simulators that can generate synthetic touch data alongside vision data. Use this to pretrain VTLA models.
Targeted real-world collection. Equip a small fleet of robots with the best available tactile sensors and collect demonstrations on high-value tasks where touch is clearly essential (deformable objects, fragile items, in-hand manipulation).
Transfer and fine-tune. Use the sim-pretrained VTLA as a starting point and fine-tune on real tactile data. Close the sim-to-real gap specifically for touch.
Deploy and scale. Once you have a working VTLA, every robot equipped with tactile sensors becomes a data collector, and the flywheel starts spinning.

Why This Matters Now

The humanoid companies that integrate tactile sensing early will have a compounding advantage. Every day of tactile data collection is a day of training data that competitors without touch cannot replicate. The VTLA models trained on this data will be capable of tasks that vision-only systems simply cannot do.

I have been thinking about this problem for a while now, and I am increasingly convinced that tactile sensing is not just a nice sensor upgrade. It is the missing modality that separates robots that can do demos from robots that can do real work. The gap between picking up a block in a lab and handling real objects in a real kitchen or factory is largely a gap in tactile intelligence.

The teams that figure out how to build good tactile sensors, collect tactile data at scale, and train VTLA models on that data are going to build robots that make everything else on the market look like it is wearing oven mitts.

And that is a future worth building toward.

Back to Blog