Skip to content

ozkannceylan/humanoid_vla

Repository files navigation

Humanoid VLA β€” Vision-Language-Action Controlled Humanoid Robot

MuJoCo ROS 2 Jazzy ACT Unitree G1 Python 3.12 License: MIT

Status

TL;DR β€” A natural-language-commanded Unitree G1 humanoid built on MuJoCo + ACT + ROS 2. Trained ACT (Action Chunking with Transformers) policies achieve 89% combined success across 5 manipulation tasks (89/100 episodes), including 100% on physics-based bimanual grasping. An OOD generalization study shows graceful degradation from 90% in-distribution to 55% under combined distribution shift.

Key Results

Task success

Track Success Episodes
Single-arm (4 tasks) 86.2% 69 / 80
Bimanual physics grasp 100% 20 / 20
Combined 89% 89 / 100

OOD generalization (bimanual)

Test distribution Success
In-distribution 90%
OOD position (1.5Γ—) 80%
OOD visual 70%
OOD posture (1.5Γ—) 60%
OOD combined 55%

Author: Ozkan Ceylan Β· Full Report: PROJECT_REPORT.md


Demo Videos

Single-Arm Manipulation (4 Tasks: Reach + Grasp + Pick Up + Place)

Single-arm: reach β†’ grasp β†’ pick β†’ place
Full videos: reach Β· grasp Β· pick Β· place Β· combined

Bimanual Physics-Based Grasping

Bimanual Box Lift β€” Both hands squeeze box via friction only (no weld constraints), full mj_step dynamics
Bimanual lift demo
Full video

Videos show side-by-side overview camera (left) and robot's egocentric view (right). The ACT model receives only the ego camera image as visual input.


Architecture

User: "Pick up the red cube"
  β”‚
  β”œβ”€β”€β”€ Telegram β†’ OpenClaw (Charlie) β†’ RosClaw ───┐
  β”‚         (natural language interface)           β”‚
  β”‚                                                β–Ό
  β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                              β”‚  rosbridge (WebSocket:9090) β”‚
  β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚                                           β–Ό
  β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  └─────────────►│  VLA Task Manager (ROS2 Node)              β”‚
                 β”‚                                            β”‚
                 β”‚  NL Parser: "pick up..." β†’ single_arm,     β”‚
                 β”‚             task_id=2                       β”‚
                 β”‚                                            β”‚
                 β”‚  30Hz Control Loop:                        β”‚
                 β”‚    Camera (480Γ—640 RGB) ──► ACT Model ──►  β”‚
                 β”‚    Joint State (58-d)   ──►  (15.6M)  ──►  β”‚
                 β”‚    Task Embedding       ──►           ──►  β”‚
                 β”‚                         20 joint actions    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚ Joint commands
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  MuJoCo Simulation                         β”‚
                 β”‚  Unitree G1 (29 DOF) + table + objects     β”‚
                 β”‚  Egocentric camera β†’ 480Γ—640 RGB           β”‚
                 β”‚  Physics: 500Hz (mj_step) / Kinematic      β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: The VLA model runs in a tight 30Hz control loop (camera β†’ action). RosClaw/OpenClaw operates at the task dispatch level β€” it sends the command once and monitors completion.


Quick Start

Prerequisites

  • Ubuntu 24.04 (tested), or Ubuntu 22.04
  • NVIDIA GPU with CUDA (RTX 4050 6GB VRAM is sufficient)
  • Python 3.12+, ROS2 Jazzy (or Humble)

Installation

# 1. Clone
git clone https://github.com/ozkanceylan/humanoid_vla.git
cd humanoid_vla

# 2. Install ROS2
chmod +x install_ros2.sh && ./install_ros2.sh

# 3. Python dependencies
pip3 install --break-system-packages -r requirements.txt

# 4. Robot models
cd repos
git clone https://github.com/unitreerobotics/unitree_mujoco
git clone https://github.com/google-deepmind/mujoco_menagerie
cd ..

# 5. Build ROS2 workspace
source /opt/ros/jazzy/setup.bash
cd ros2_ws && colcon build --symlink-install && cd ..
source ros2_ws/install/setup.bash

Full Pipeline (Train β†’ Evaluate β†’ Demo)

# 1. Generate training data (80 single-arm + 30 bimanual demos)
MUJOCO_GL=egl python3 scripts/generate_demos.py --all-tasks --episodes 20
MUJOCO_GL=egl python3 scripts/generate_bimanual_demos.py --episodes 30

# 2. Train ACT models (~2.5 hours total on RTX 4050)
python3 scripts/train_act.py --demos data/demos --epochs 300 --batch-size 32
python3 scripts/train_bimanual.py --epochs 300

# 3. Evaluate
MUJOCO_GL=egl python3 scripts/evaluate.py --checkpoint data/checkpoints/best.pt --episodes 20
MUJOCO_GL=egl python3 scripts/evaluate_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt --episodes 20

# 4. Interactive demos (opens MuJoCo viewer)
python3 scripts/live_demo.py --checkpoint data/checkpoints/best.pt
python3 scripts/live_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt

# 5. Record demo videos
MUJOCO_GL=egl python3 scripts/record_demo_videos.py

Run via ROS2 (Natural Language Interface)

# Terminal 1: Launch full VLA system (task manager + rosbridge)
ros2 launch vla_mujoco_bridge vla_system.launch.py

# Terminal 2: Send natural language commands
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the red cube'"
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the green box with both hands'"

# Terminal 3: Monitor status (JSON)
ros2 topic echo /vla/status

Evaluation Results

Single-Arm Manipulation (Phase C)

Trained for 300 epochs (93 min on RTX 4050, final loss: 0.000009). Evaluated with temporal ensembling and hierarchical task decomposition:

Task Success Rate
Reach the red cube 20/20 100%
Grasp the red cube 18/20 90%
Pick up the red cube 18/20 90%
Place the red cube on blue plate 13/20 65%
Overall 69/80 86.2%

Bimanual Physics-Based Grasping (Phase C2)

Both hands squeeze a 20Γ—15Γ—15cm box using friction only β€” no weld constraints, full mj_step dynamics, PD torque control + gravity compensation:

Metric Value
Success rate 20/20 (100%)
Lift mean=8.5cm, min=6.5cm, max=10.4cm
Contact force L=13.6N, R=13.3N (bilateral)
Physics mj_step at 500Hz, control at 30Hz
Training 300 epochs, 52 min, loss=0.000009

Combined: 5 Tasks, 89/100 (89%)

Key inference techniques:

  1. Temporal ensembling β€” overlapping action chunks with exponential decay weighting
  2. Hierarchical task decomposition β€” composite tasks switch task embedding at grasp trigger
  3. Re-grasp prevention β€” released flag prevents re-triggering after intentional release

Generalization (Phase F β€” Domain Randomization)

Bimanual model trained with position randomization (+/-5cm), random arm starts, and visual domain randomization. Evaluated on out-of-distribution conditions never seen during training:

Test Distribution Success Description
In-distribution 90% Same ranges as training, different seeds
OOD Position (1.5x) 80% Wider object positions
OOD Visual 70% Novel table/light colors
OOD Posture (1.5x) 60% Wider starting arm configs
OOD Combined 55% All OOD factors simultaneously

Graceful degradation from 90% to 55% demonstrates real generalization β€” the model handles unseen conditions rather than catastrophically failing.


Skills Demonstrated

ML for Manipulation

  • ACT (Action Chunking with Transformers) policy, 15.6M params, ~0.67s chunks
  • Frozen ResNet18 (layers 0–6) visual encoder + state + task-embedding fusion
  • Temporal ensembling: overlapping chunks with exponential decay weighting
  • Hierarchical task decomposition: composite tasks switch embeddings at grasp trigger
  • Re-grasp prevention via released latch in evaluation loop

Robot Control & Physics

  • PD torque control with per-joint gravity compensation on Unitree G1 (29 DOF)
  • Friction-only bimanual grasping under full mj_step dynamics β€” no weld constraints
  • Iterative Jacobian IK for scripted expert demo generation (single + bimanual)
  • Auto-grasp trigger on hand-to-object proximity for closed-loop manipulation
  • Bilateral contact force monitoring (β‰₯2N per palm, ~13N measured in eval)

System Integration

  • ROS 2 Jazzy Task Manager node: NL command β†’ ACT inference β†’ MuJoCo at 30Hz
  • rosbridge WebSocket server (port 9090) for external clients (Telegram, JS, Python)
  • Natural language parser routes "pick up" / "lift box" β†’ task ID + arm mode
  • Thread-safe inference loop with daemon threads and ROS2 status publishing
  • JSON-encoded /vla/status topic streaming step, progress, and result fields

Evaluation & Methodology

  • OOD generalization study with progressive difficulty (position / visual / posture / combined)
  • Runtime visual domain randomization (table color, lighting) during training
  • Ablation against in-distribution baseline, holding seeds independent
  • 50 documented engineering lessons (tasks/lessons.md, L001–L050)
  • Six deep-dive study documents covering each subsystem end-to-end

Simulation Engineering

  • MuJoCo MJCF authoring: G1 + table + cameras + objects in sim/g1_with_camera.xml
  • 500Hz mj_step physics decoupled from 30Hz control loop
  • Egocentric 480Γ—640 RGB camera pipeline rendered headless via EGL
  • HDF5 demo recording + LeRobot format converter for portability
  • Two interactive viewers (live_demo.py, live_bimanual.py) for qualitative inspection

ROS2 Integration (Phase D)

The VLA Task Manager accepts natural language commands via ROS2 topics and runs ACT inference in a closed-loop MuJoCo simulation.

ROS2 Interfaces

Direction Topic Type Purpose
Subscribe /vla/task_goal std_msgs/String Natural language command
Publish /vla/status std_msgs/String JSON: step, progress, result
Publish /camera/image_raw sensor_msgs/Image Ego camera during execution

NL Command Examples

Input Mode Task
"pick up the red cube" single_arm pick up the red cube
"reach" single_arm reach the red cube
"lift the box" bimanual pick up the green box with both hands
"bimanual grasp" bimanual pick up the green box with both hands

rosbridge (WebSocket for External Systems)

The launch file co-starts rosbridge_server on port 9090, enabling any WebSocket client (RosClaw, JavaScript, Python) to send commands:

import websocket, json
ws = websocket.create_connection("ws://localhost:9090")
ws.send(json.dumps({
    "op": "publish",
    "topic": "/vla/task_goal",
    "msg": {"data": "pick up the red cube"}
}))

The Robot: Unitree G1

Property Value
DOF 29 torque-controlled joints
Control PD: $\tau = K_p(q_{des} - q) - K_d\dot{q} + \tau_{gravity}$
Camera Egocentric RGB, 480Γ—640, torso-mounted
Fixed base Pelvis frozen at z=0.793m
Right arm 7 DOF (shoulder pitch/roll/yaw, elbow, wrist p/r/y)
Left arm 7 DOF (mirror configuration)

Tasks

ID Task Description Success Criterion
0 Reach Move hand to the red cube Hand within 6cm of cube
1 Grasp Close hand around the cube Auto-grasp triggered (hand < 4cm)
2 Pick Lift the cube off the table Cube z > 0.90m while grasped
3 Place Move cube to the blue plate Cube within 6cm of target, released
4 Bimanual Lift Lift green box with both hands Box β‰₯3cm, dual contact, force β‰₯2N

ACT Model Architecture

Image (480Γ—640Γ—3) ──► ResNet18 (frozen 0-6) ──► AvgPool ──► 512-d ──► Proj ──► 256-d ─┐
                                                                                         β”‚
State (pos + vel) ──► MLP (β†’256β†’256) ──────────────────────────────────────────────────│──► Memory
                                                                                         β”‚    (3 tokens)
Task ("pick up..") ──► Embedding ──► 256-d β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                              β”‚
                                                                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                                β”‚  Transformer Decoder        β”‚
                                                                β”‚  4 layers, 4 heads, d=256   β”‚
                                                                β”‚  20 learnable query tokens   β”‚
                                                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                              β”‚
                                                                Action chunk: (20, action_dim)
Variant Params Trainable State Actions Tasks
Single-arm 15.6M 12.8M 58 (29+29) 29 4
Bimanual 15.6M 12.8M 28 (14+14) 14 1

Chunk size: 20 timesteps (~0.67s). Training: AdamW (lr=1e-4), CosineAnnealing, MSE loss. VRAM: ~1.5GB.


Project Structure

humanoid_vla/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ CLAUDE.md                          # Project vision & phase plan
β”‚
β”œβ”€β”€ sim/                               # MuJoCo simulation
β”‚   β”œβ”€β”€ g1_with_camera.xml             # Scene: G1 + table + objects + cameras
β”‚   β”œβ”€β”€ models/g1_29dof.xml            # Robot model (29 torque-actuated DOF)
β”‚   └── test_g1.py                     # Standalone sim test
β”‚
β”œβ”€β”€ scripts/                           # Training & evaluation pipeline
β”‚   β”œβ”€β”€ act_model.py                   # ACT policy architecture + dataset
β”‚   β”œβ”€β”€ train_act.py                   # Single-arm training
β”‚   β”œβ”€β”€ train_bimanual.py              # Bimanual training
β”‚   β”œβ”€β”€ evaluate.py                    # Single-arm evaluation
β”‚   β”œβ”€β”€ evaluate_bimanual.py           # Bimanual evaluation (contact + lift)
β”‚   β”œβ”€β”€ generate_demos.py              # Single-arm scripted expert (IK + weld)
β”‚   β”œβ”€β”€ generate_bimanual_demos.py     # Bimanual demo generator (friction)
β”‚   β”œβ”€β”€ physics_sim.py                 # Physics wrapper (mj_step, PD, contacts)
β”‚   β”œβ”€β”€ domain_randomization.py        # Runtime visual domain randomization
β”‚   β”œβ”€β”€ eval_generalization.py         # OOD generalization evaluation
β”‚   β”œβ”€β”€ visualize_configs.py           # Render randomization grid
β”‚   β”œβ”€β”€ visualize_perception_action.py # Trajectory strip visualization
β”‚   β”œβ”€β”€ live_demo.py                   # Interactive viewer (single-arm)
β”‚   β”œβ”€β”€ live_bimanual.py               # Interactive viewer (bimanual)
β”‚   β”œβ”€β”€ record_demo_videos.py          # Generate demo clips for README
β”‚   β”œβ”€β”€ visualize_demo.py              # Render videos from HDF5 demos
β”‚   └── convert_to_lerobot.py          # LeRobot format converter
β”‚
β”œβ”€β”€ ros2_ws/src/vla_mujoco_bridge/     # ROS2 package
β”‚   β”œβ”€β”€ vla_mujoco_bridge/
β”‚   β”‚   β”œβ”€β”€ task_manager_node.py       # VLA Task Manager (NL β†’ ACT β†’ MuJoCo)
β”‚   β”‚   β”œβ”€β”€ bridge_node.py             # Low-level MuJoCo ↔ ROS2 bridge
β”‚   β”‚   β”œβ”€β”€ mujoco_sim.py              # Physics engine wrapper
β”‚   β”‚   β”œβ”€β”€ teleop_node.py             # Full-body keyboard teleop
β”‚   β”‚   β”œβ”€β”€ arm_teleop_node.py         # Arm-only keyboard teleop
β”‚   β”‚   └── demo_recorder.py           # HDF5 demonstration recorder
β”‚   └── launch/
β”‚       └── vla_system.launch.py       # Launch: rosbridge + task manager
β”‚
β”œβ”€β”€ media/                             # Demo videos (committed to repo)
β”‚   β”œβ”€β”€ reach.mp4, grasp.mp4           # Individual task demos
β”‚   β”œβ”€β”€ pick.mp4, place.mp4            # Pick and place demos
β”‚   β”œβ”€β”€ bimanual.mp4                   # Bimanual box lift demo
β”‚   └── all_tasks.mp4                  # Combined montage
β”‚
β”œβ”€β”€ data/                              # Generated data (gitignored)
β”‚   β”œβ”€β”€ demos/                         # Single-arm HDF5 episodes
β”‚   β”œβ”€β”€ checkpoints/                   # Single-arm model weights
β”‚   β”œβ”€β”€ bimanual_demos/                # Bimanual HDF5 episodes
β”‚   └── bimanual_checkpoints/          # Bimanual model weights
β”‚
β”œβ”€β”€ study/                             # Deep-dive study documents
β”‚   β”œβ”€β”€ 01_project_deep_dive.md        # MuJoCo, G1, ROS2, camera pipeline
β”‚   β”œβ”€β”€ 02_scripted_expert_demo_generation.md  # IK, kinematic playback
β”‚   β”œβ”€β”€ 03_act_training_and_evaluation.md      # ACT training, debugging
β”‚   β”œβ”€β”€ 04_bimanual_physics_grasping.md        # Physics, PD, friction grasp
β”‚   └── 05_system_integration.md       # Task Manager, rosbridge, NL parsing
β”‚
β”œβ”€β”€ tasks/                             # Project management
β”‚   β”œβ”€β”€ todo.md                        # Phase tracker with milestones
β”‚   └── lessons.md                     # Engineering lessons (L001-L050)
β”‚
└── logs/                              # Training logs
    └── act_training_300ep.log

Documentation

Study Documents (Deep Dives)

# Document Topics
01 Project Deep Dive MuJoCo fundamentals, G1 robot, MJCF XML, PD control, gravity comp, ROS2 bridge, threading, camera pipeline, teleoperation, HDF5 format
02 Scripted Expert Demos Inverse kinematics (iterative Jacobian), kinematic playback, weld constraint, trajectory design
03 ACT Training & Evaluation ACT architecture, action chunking, ResNet18 encoder, task embedding, Transformer decoder, training, evaluation debugging
04 Bimanual Physics Grasping mj_step vs mj_forward, PD torque control, contact physics, friction cones, compliance grasping, bimanual coordination
05 System Integration ROS2 Task Manager, NL parsing, rosbridge WebSocket, thread-safe execution, temporal ensembling, full data flow
06 Domain Randomization Visual augmentation, position noise, posture variation, generalization evaluation, ablation study

Engineering Lessons

tasks/lessons.md β€” 50 concise lessons learned:

  • L001–L008: Environment setup (torque actuators, meshdir, ROS2 Jazzy)
  • L009–L012: Phase B infrastructure (gravity comp, setuptools, cv_bridge)
  • L013–L016: Demo generation (ctrlrange, arm reach, kinematic IK, weld)
  • L017–L023: ACT training (standalone, action chunking, frozen ResNet, auto-grasp)
  • L024–L027: Evaluation (temporal ensembling, hierarchical decomposition, re-grasp)
  • L028–L033: Bimanual physics (leg drift, palm pad, IK, compliance grasping)
  • L034–L040: ROS2 integration (String+JSON, daemon threads, launch files)
  • L041–L050: Domain randomization (memorization, IK validation, progressive difficulty)

Development Phases

Phase Status Duration Summary
A β€” Sim + ROS2 βœ… 2 weeks MuJoCo + G1 + camera + ROS2 bridge + teleop
B β€” Demo Generation βœ… 1 week Scripted expert demos, IK pipeline, 80 episodes
C β€” ACT Training βœ… 2 weeks 4-task ACT model, 86.2% success rate
C2 β€” Bimanual βœ… 2 weeks Physics-based bimanual grasping, 100% success
D β€” Integration βœ… 1 week ROS2 Task Manager, NL commands, rosbridge
E β€” Polish βœ… 1 week Demo videos, documentation, study docs
F β€” Generalization βœ… 1 week Domain randomization, OOD evaluation, 90% in-dist / 55% OOD

Hardware Requirements

Component Minimum Tested On
GPU NVIDIA with CUDA, 4GB+ VRAM RTX 4050 Laptop (6GB)
RAM 16 GB 33 GB
OS Ubuntu 22.04 or 24.04 Ubuntu 24.04
CUDA 12.x 12.8
ROS2 Humble or Jazzy Jazzy

Key References

Papers

  • ACT: Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", RSS 2023
  • GR00T N1: NVIDIA, "An Open Foundation Model for Humanoid Robots", 2025

Repositories


License

MIT

About

End-to-end VLA humanoid robot in MuJoCo: ACT model on Unitree G1 with ROS 2 NL interface and OOD generalization study.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors