VLA: Vision Language Action Model

This project implements a Vision Language Action (VLA) model trained on the SO-ARM101 robotic arm platform. Using SmolVLA (a lightweight, efficient VLA model), the system enables robots to understand visual scenes, interpret natural language instructions, and execute precise manipulation tasks with six degrees of freedom.

Build Photos

VLA model robot setup on a tabletop manipulation workspace

VLA model development workstation and experiment setup

Project Overview

Vision Language Action models represent a breakthrough in robotic manipulation by combining:

Visual Understanding: CNN-based perception of scenes and objects
Language Processing: Natural language instruction interpretation
Action Prediction: Direct mapping to robot control commands

This unified model allows robots to learn from demonstrations and generalize to new tasks through language guidance.

SmolVLA Architecture

Model Components

Vision Encoder: Efficient CNN for image feature extraction
Language Encoder: Transformer-based text processing
Fusion Layer: Multimodal feature integration
Action Head: 6-DOF continuous control prediction

Key Advantages

Lightweight design optimized for real-time inference
Efficient training with limited demonstrations
Generalizable across similar robotic platforms
Natural language task specification

SO-ARM101 Robotic Arm

Robot Specifications

DOF: 6 degrees of freedom (shoulder pan, shoulder lift, elbow flex, wrist flex, wrist roll, gripper)
Control Range: Normalized input from -1.0 to +1.0 per joint
Workspace: Large manipulation workspace with 2m+ reach
Physics Simulation: MuJoCo-based physics with realistic dynamics

Integration Points

Camera feed as visual input
Natural language task specification
Real-time joint command output
Feedback through proprioception and vision

Training & Data Pipeline

Data Collection

Demonstrations with paired RGB images, language descriptions, and action sequences
Data augmentation through simulation variations
Behavioral cloning from expert trajectories

Training Process

Loss Function: L2 loss between predicted and ground truth actions
Optimization: Adam optimizer with learning rate scheduling
Validation: Evaluation on held-out test tasks
Inference: Real-time execution on robot hardware

Supported Tasks

Pick and place operations
Object sorting and organization
Stacking and assembly tasks
Precision manipulation

Technical Implementation

Framework & Tools

DL Framework: PyTorch / TensorFlow
Robotics Sim: MuJoCo with physics-accurate rendering
Vision: OpenCV for image processing
Language: Hugging Face Transformers for NLP

Inference Pipeline

Capture RGB image from robot camera
Process natural language instruction
Extract visual features with vision encoder
Encode language with language encoder
Fuse multimodal features
Predict 6-DOF action sequence
Execute commands on robotic arm

Results & Future Directions

Current Capabilities

✅ Accurate task execution from language descriptions
✅ Real-time inference with minimal latency
✅ Robust to minor variations in scene setup
✅ Generalizes to unseen object categories

Future Improvements

Transfer learning to real robot hardware
Multi-task learning for broader skill repertoire
Integration of tactile feedback
Interactive learning from human corrections
End-to-end learning with live camera feedback

Back to Portfolio