Humanoid VLA — Vision-Language-Action Controlled Humanoid Robot

TL;DR — A natural-language-commanded Unitree G1 humanoid built on MuJoCo + ACT + ROS 2. Trained ACT (Action Chunking with Transformers) policies achieve 89% combined success across 5 manipulation tasks (89/100 episodes), including 100% on physics-based bimanual grasping. An OOD generalization study shows graceful degradation from 90% in-distribution to 55% under combined distribution shift.

Key Results

Task success

Track	Success	Episodes
Single-arm (4 tasks)	86.2%	69 / 80
Bimanual physics grasp	100%	20 / 20
Combined	89%	89 / 100

OOD generalization (bimanual)

Test distribution	Success
In-distribution	90%
OOD position (1.5×)	80%
OOD visual	70%
OOD posture (1.5×)	60%
OOD combined	55%

Author: Ozkan Ceylan · Full Report: PROJECT_REPORT.md

Demo Videos

Single-Arm Manipulation (4 Tasks: Reach + Grasp + Pick Up + Place)

_{Full videos: reach · grasp · pick · place · combined}

Bimanual Physics-Based Grasping

Bimanual Box Lift — Both hands squeeze box via friction only (no weld constraints), full mj_step dynamics

_{Full video}

Videos show side-by-side overview camera (left) and robot's egocentric view (right). The ACT model receives only the ego camera image as visual input.

Architecture

User: "Pick up the red cube"
  │
  ├─── Telegram → OpenClaw (Charlie) → RosClaw ───┐
  │         (natural language interface)           │
  │                                                ▼
  │                              ┌─────────────────────────────┐
  │                              │  rosbridge (WebSocket:9090) │
  │                              └────────────┬────────────────┘
  │                                           ▼
  │              ┌────────────────────────────────────────────┐
  └─────────────►│  VLA Task Manager (ROS2 Node)              │
                 │                                            │
                 │  NL Parser: "pick up..." → single_arm,     │
                 │             task_id=2                       │
                 │                                            │
                 │  30Hz Control Loop:                        │
                 │    Camera (480×640 RGB) ──► ACT Model ──►  │
                 │    Joint State (58-d)   ──►  (15.6M)  ──►  │
                 │    Task Embedding       ──►           ──►  │
                 │                         20 joint actions    │
                 └────────────────┬───────────────────────────┘
                                  │ Joint commands
                 ┌────────────────▼───────────────────────────┐
                 │  MuJoCo Simulation                         │
                 │  Unitree G1 (29 DOF) + table + objects     │
                 │  Egocentric camera → 480×640 RGB           │
                 │  Physics: 500Hz (mj_step) / Kinematic      │
                 └────────────────────────────────────────────┘

Key insight: The VLA model runs in a tight 30Hz control loop (camera → action). RosClaw/OpenClaw operates at the task dispatch level — it sends the command once and monitors completion.

Quick Start

Prerequisites

Ubuntu 24.04 (tested), or Ubuntu 22.04
NVIDIA GPU with CUDA (RTX 4050 6GB VRAM is sufficient)
Python 3.12+, ROS2 Jazzy (or Humble)

Installation

# 1. Clone
git clone https://github.com/ozkanceylan/humanoid_vla.git
cd humanoid_vla

# 2. Install ROS2
chmod +x install_ros2.sh && ./install_ros2.sh

# 3. Python dependencies
pip3 install --break-system-packages -r requirements.txt

# 4. Robot models
cd repos
git clone https://github.com/unitreerobotics/unitree_mujoco
git clone https://github.com/google-deepmind/mujoco_menagerie
cd ..

# 5. Build ROS2 workspace
source /opt/ros/jazzy/setup.bash
cd ros2_ws && colcon build --symlink-install && cd ..
source ros2_ws/install/setup.bash

Full Pipeline (Train → Evaluate → Demo)

# 1. Generate training data (80 single-arm + 30 bimanual demos)
MUJOCO_GL=egl python3 scripts/generate_demos.py --all-tasks --episodes 20
MUJOCO_GL=egl python3 scripts/generate_bimanual_demos.py --episodes 30

# 2. Train ACT models (~2.5 hours total on RTX 4050)
python3 scripts/train_act.py --demos data/demos --epochs 300 --batch-size 32
python3 scripts/train_bimanual.py --epochs 300

# 3. Evaluate
MUJOCO_GL=egl python3 scripts/evaluate.py --checkpoint data/checkpoints/best.pt --episodes 20
MUJOCO_GL=egl python3 scripts/evaluate_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt --episodes 20

# 4. Interactive demos (opens MuJoCo viewer)
python3 scripts/live_demo.py --checkpoint data/checkpoints/best.pt
python3 scripts/live_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt

# 5. Record demo videos
MUJOCO_GL=egl python3 scripts/record_demo_videos.py

Run via ROS2 (Natural Language Interface)

# Terminal 1: Launch full VLA system (task manager + rosbridge)
ros2 launch vla_mujoco_bridge vla_system.launch.py

# Terminal 2: Send natural language commands
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the red cube'"
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the green box with both hands'"

# Terminal 3: Monitor status (JSON)
ros2 topic echo /vla/status

Evaluation Results

Single-Arm Manipulation (Phase C)

Trained for 300 epochs (93 min on RTX 4050, final loss: 0.000009). Evaluated with temporal ensembling and hierarchical task decomposition:

Task	Success	Rate
Reach the red cube	20/20	100%
Grasp the red cube	18/20	90%
Pick up the red cube	18/20	90%
Place the red cube on blue plate	13/20	65%
Overall	69/80	86.2%

Bimanual Physics-Based Grasping (Phase C2)

Both hands squeeze a 20×15×15cm box using friction only — no weld constraints, full mj_step dynamics, PD torque control + gravity compensation:

Metric	Value
Success rate	20/20 (100%)
Lift	mean=8.5cm, min=6.5cm, max=10.4cm
Contact force	L=13.6N, R=13.3N (bilateral)
Physics	`mj_step` at 500Hz, control at 30Hz
Training	300 epochs, 52 min, loss=0.000009

Combined: 5 Tasks, 89/100 (89%)

Key inference techniques:

Temporal ensembling — overlapping action chunks with exponential decay weighting
Hierarchical task decomposition — composite tasks switch task embedding at grasp trigger
Re-grasp prevention — released flag prevents re-triggering after intentional release

Generalization (Phase F — Domain Randomization)

Bimanual model trained with position randomization (+/-5cm), random arm starts, and visual domain randomization. Evaluated on out-of-distribution conditions never seen during training:

Test Distribution	Success	Description
In-distribution	90%	Same ranges as training, different seeds
OOD Position (1.5x)	80%	Wider object positions
OOD Visual	70%	Novel table/light colors
OOD Posture (1.5x)	60%	Wider starting arm configs
OOD Combined	55%	All OOD factors simultaneously

Graceful degradation from 90% to 55% demonstrates real generalization — the model handles unseen conditions rather than catastrophically failing.

Skills Demonstrated

ML for Manipulation

ACT (Action Chunking with Transformers) policy, 15.6M params, ~0.67s chunks
Frozen ResNet18 (layers 0–6) visual encoder + state + task-embedding fusion
Temporal ensembling: overlapping chunks with exponential decay weighting
Hierarchical task decomposition: composite tasks switch embeddings at grasp trigger
Re-grasp prevention via released latch in evaluation loop

Robot Control & Physics

PD torque control with per-joint gravity compensation on Unitree G1 (29 DOF)
Friction-only bimanual grasping under full mj_step dynamics — no weld constraints
Iterative Jacobian IK for scripted expert demo generation (single + bimanual)
Auto-grasp trigger on hand-to-object proximity for closed-loop manipulation
Bilateral contact force monitoring (≥2N per palm, ~13N measured in eval)

System Integration

ROS 2 Jazzy Task Manager node: NL command → ACT inference → MuJoCo at 30Hz
rosbridge WebSocket server (port 9090) for external clients (Telegram, JS, Python)
Natural language parser routes "pick up" / "lift box" → task ID + arm mode
Thread-safe inference loop with daemon threads and ROS2 status publishing
JSON-encoded /vla/status topic streaming step, progress, and result fields

Evaluation & Methodology

OOD generalization study with progressive difficulty (position / visual / posture / combined)
Runtime visual domain randomization (table color, lighting) during training
Ablation against in-distribution baseline, holding seeds independent
50 documented engineering lessons (tasks/lessons.md, L001–L050)
Six deep-dive study documents covering each subsystem end-to-end

Simulation Engineering

MuJoCo MJCF authoring: G1 + table + cameras + objects in sim/g1_with_camera.xml
500Hz mj_step physics decoupled from 30Hz control loop
Egocentric 480×640 RGB camera pipeline rendered headless via EGL
HDF5 demo recording + LeRobot format converter for portability
Two interactive viewers (live_demo.py, live_bimanual.py) for qualitative inspection

ROS2 Integration (Phase D)

The VLA Task Manager accepts natural language commands via ROS2 topics and runs ACT inference in a closed-loop MuJoCo simulation.

ROS2 Interfaces

Direction	Topic	Type	Purpose
Subscribe	`/vla/task_goal`	`std_msgs/String`	Natural language command
Publish	`/vla/status`	`std_msgs/String`	JSON: step, progress, result
Publish	`/camera/image_raw`	`sensor_msgs/Image`	Ego camera during execution

NL Command Examples

Input	Mode	Task
"pick up the red cube"	single_arm	pick up the red cube
"reach"	single_arm	reach the red cube
"lift the box"	bimanual	pick up the green box with both hands
"bimanual grasp"	bimanual	pick up the green box with both hands

rosbridge (WebSocket for External Systems)

The launch file co-starts rosbridge_server on port 9090, enabling any WebSocket client (RosClaw, JavaScript, Python) to send commands:

import websocket, json
ws = websocket.create_connection("ws://localhost:9090")
ws.send(json.dumps({
    "op": "publish",
    "topic": "/vla/task_goal",
    "msg": {"data": "pick up the red cube"}
}))

The Robot: Unitree G1

Property	Value
DOF	29 torque-controlled joints
Control	PD: $\tau = K_p(q_{des} - q) - K_d\dot{q} + \tau_{gravity}$
Camera	Egocentric RGB, 480×640, torso-mounted
Fixed base	Pelvis frozen at z=0.793m
Right arm	7 DOF (shoulder pitch/roll/yaw, elbow, wrist p/r/y)
Left arm	7 DOF (mirror configuration)

Tasks

ID	Task	Description	Success Criterion
0	Reach	Move hand to the red cube	Hand within 6cm of cube
1	Grasp	Close hand around the cube	Auto-grasp triggered (hand < 4cm)
2	Pick	Lift the cube off the table	Cube z > 0.90m while grasped
3	Place	Move cube to the blue plate	Cube within 6cm of target, released
4	Bimanual Lift	Lift green box with both hands	Box ≥3cm, dual contact, force ≥2N

ACT Model Architecture

Image (480×640×3) ──► ResNet18 (frozen 0-6) ──► AvgPool ──► 512-d ──► Proj ──► 256-d ─┐
                                                                                         │
State (pos + vel) ──► MLP (→256→256) ──────────────────────────────────────────────────│──► Memory
                                                                                         │    (3 tokens)
Task ("pick up..") ──► Embedding ──► 256-d ────────────────────────────────────────────┘
                                                                              │
                                                                ┌─────────────▼──────────────┐
                                                                │  Transformer Decoder        │
                                                                │  4 layers, 4 heads, d=256   │
                                                                │  20 learnable query tokens   │
                                                                └─────────────┬──────────────┘
                                                                              │
                                                                Action chunk: (20, action_dim)

Variant	Params	Trainable	State	Actions	Tasks
Single-arm	15.6M	12.8M	58 (29+29)	29	4
Bimanual	15.6M	12.8M	28 (14+14)	14	1

Chunk size: 20 timesteps (~0.67s). Training: AdamW (lr=1e-4), CosineAnnealing, MSE loss. VRAM: ~1.5GB.

Project Structure

humanoid_vla/
├── README.md                          # This file
├── CLAUDE.md                          # Project vision & phase plan
│
├── sim/                               # MuJoCo simulation
│   ├── g1_with_camera.xml             # Scene: G1 + table + objects + cameras
│   ├── models/g1_29dof.xml            # Robot model (29 torque-actuated DOF)
│   └── test_g1.py                     # Standalone sim test
│
├── scripts/                           # Training & evaluation pipeline
│   ├── act_model.py                   # ACT policy architecture + dataset
│   ├── train_act.py                   # Single-arm training
│   ├── train_bimanual.py              # Bimanual training
│   ├── evaluate.py                    # Single-arm evaluation
│   ├── evaluate_bimanual.py           # Bimanual evaluation (contact + lift)
│   ├── generate_demos.py              # Single-arm scripted expert (IK + weld)
│   ├── generate_bimanual_demos.py     # Bimanual demo generator (friction)
│   ├── physics_sim.py                 # Physics wrapper (mj_step, PD, contacts)
│   ├── domain_randomization.py        # Runtime visual domain randomization
│   ├── eval_generalization.py         # OOD generalization evaluation
│   ├── visualize_configs.py           # Render randomization grid
│   ├── visualize_perception_action.py # Trajectory strip visualization
│   ├── live_demo.py                   # Interactive viewer (single-arm)
│   ├── live_bimanual.py               # Interactive viewer (bimanual)
│   ├── record_demo_videos.py          # Generate demo clips for README
│   ├── visualize_demo.py              # Render videos from HDF5 demos
│   └── convert_to_lerobot.py          # LeRobot format converter
│
├── ros2_ws/src/vla_mujoco_bridge/     # ROS2 package
│   ├── vla_mujoco_bridge/
│   │   ├── task_manager_node.py       # VLA Task Manager (NL → ACT → MuJoCo)
│   │   ├── bridge_node.py             # Low-level MuJoCo ↔ ROS2 bridge
│   │   ├── mujoco_sim.py              # Physics engine wrapper
│   │   ├── teleop_node.py             # Full-body keyboard teleop
│   │   ├── arm_teleop_node.py         # Arm-only keyboard teleop
│   │   └── demo_recorder.py           # HDF5 demonstration recorder
│   └── launch/
│       └── vla_system.launch.py       # Launch: rosbridge + task manager
│
├── media/                             # Demo videos (committed to repo)
│   ├── reach.mp4, grasp.mp4           # Individual task demos
│   ├── pick.mp4, place.mp4            # Pick and place demos
│   ├── bimanual.mp4                   # Bimanual box lift demo
│   └── all_tasks.mp4                  # Combined montage
│
├── data/                              # Generated data (gitignored)
│   ├── demos/                         # Single-arm HDF5 episodes
│   ├── checkpoints/                   # Single-arm model weights
│   ├── bimanual_demos/                # Bimanual HDF5 episodes
│   └── bimanual_checkpoints/          # Bimanual model weights
│
├── study/                             # Deep-dive study documents
│   ├── 01_project_deep_dive.md        # MuJoCo, G1, ROS2, camera pipeline
│   ├── 02_scripted_expert_demo_generation.md  # IK, kinematic playback
│   ├── 03_act_training_and_evaluation.md      # ACT training, debugging
│   ├── 04_bimanual_physics_grasping.md        # Physics, PD, friction grasp
│   └── 05_system_integration.md       # Task Manager, rosbridge, NL parsing
│
├── tasks/                             # Project management
│   ├── todo.md                        # Phase tracker with milestones
│   └── lessons.md                     # Engineering lessons (L001-L050)
│
└── logs/                              # Training logs
    └── act_training_300ep.log

Documentation

Study Documents (Deep Dives)

#	Document	Topics
01	Project Deep Dive	MuJoCo fundamentals, G1 robot, MJCF XML, PD control, gravity comp, ROS2 bridge, threading, camera pipeline, teleoperation, HDF5 format
02	Scripted Expert Demos	Inverse kinematics (iterative Jacobian), kinematic playback, weld constraint, trajectory design
03	ACT Training & Evaluation	ACT architecture, action chunking, ResNet18 encoder, task embedding, Transformer decoder, training, evaluation debugging
04	Bimanual Physics Grasping	mj_step vs mj_forward, PD torque control, contact physics, friction cones, compliance grasping, bimanual coordination
05	System Integration	ROS2 Task Manager, NL parsing, rosbridge WebSocket, thread-safe execution, temporal ensembling, full data flow
06	Domain Randomization	Visual augmentation, position noise, posture variation, generalization evaluation, ablation study

Engineering Lessons

tasks/lessons.md — 50 concise lessons learned:

L001–L008: Environment setup (torque actuators, meshdir, ROS2 Jazzy)
L009–L012: Phase B infrastructure (gravity comp, setuptools, cv_bridge)
L013–L016: Demo generation (ctrlrange, arm reach, kinematic IK, weld)
L017–L023: ACT training (standalone, action chunking, frozen ResNet, auto-grasp)
L024–L027: Evaluation (temporal ensembling, hierarchical decomposition, re-grasp)
L028–L033: Bimanual physics (leg drift, palm pad, IK, compliance grasping)
L034–L040: ROS2 integration (String+JSON, daemon threads, launch files)
L041–L050: Domain randomization (memorization, IK validation, progressive difficulty)

Development Phases

Phase	Status	Duration	Summary
A — Sim + ROS2	✅	2 weeks	MuJoCo + G1 + camera + ROS2 bridge + teleop
B — Demo Generation	✅	1 week	Scripted expert demos, IK pipeline, 80 episodes
C — ACT Training	✅	2 weeks	4-task ACT model, 86.2% success rate
C2 — Bimanual	✅	2 weeks	Physics-based bimanual grasping, 100% success
D — Integration	✅	1 week	ROS2 Task Manager, NL commands, rosbridge
E — Polish	✅	1 week	Demo videos, documentation, study docs
F — Generalization	✅	1 week	Domain randomization, OOD evaluation, 90% in-dist / 55% OOD

Hardware Requirements

Component	Minimum	Tested On
GPU	NVIDIA with CUDA, 4GB+ VRAM	RTX 4050 Laptop (6GB)
RAM	16 GB	33 GB
OS	Ubuntu 22.04 or 24.04	Ubuntu 24.04
CUDA	12.x	12.8
ROS2	Humble or Jazzy	Jazzy

Key References

Papers

ACT: Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", RSS 2023
GR00T N1: NVIDIA, "An Open Foundation Model for Humanoid Robots", 2025

Repositories

unitree_mujoco — G1/H1 simulation
mujoco_menagerie — Robot models
lerobot — Robot learning framework

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
logs		logs
media		media
repos		repos
ros2_ws/src/vla_mujoco_bridge		ros2_ws/src/vla_mujoco_bridge
scripts		scripts
sim		sim
study		study
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT_REPORT.md		PROJECT_REPORT.md
README.md		README.md
install_ros2.sh		install_ros2.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Humanoid VLA — Vision-Language-Action Controlled Humanoid Robot

Key Results

Demo Videos

Single-Arm Manipulation (4 Tasks: Reach + Grasp + Pick Up + Place)

Bimanual Physics-Based Grasping

Architecture

Quick Start

Prerequisites

Installation

Full Pipeline (Train → Evaluate → Demo)

Run via ROS2 (Natural Language Interface)

Evaluation Results

Single-Arm Manipulation (Phase C)

Bimanual Physics-Based Grasping (Phase C2)

Combined: 5 Tasks, 89/100 (89%)

Generalization (Phase F — Domain Randomization)

Skills Demonstrated

ML for Manipulation

Robot Control & Physics

System Integration

Evaluation & Methodology

Simulation Engineering

ROS2 Integration (Phase D)

ROS2 Interfaces

NL Command Examples

rosbridge (WebSocket for External Systems)

The Robot: Unitree G1

Tasks

ACT Model Architecture

Project Structure

Documentation

Study Documents (Deep Dives)

Engineering Lessons

Development Phases

Hardware Requirements

Key References

Papers

Repositories

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages