TL;DR β A natural-language-commanded Unitree G1 humanoid built on MuJoCo + ACT + ROS 2. Trained ACT (Action Chunking with Transformers) policies achieve 89% combined success across 5 manipulation tasks (89/100 episodes), including 100% on physics-based bimanual grasping. An OOD generalization study shows graceful degradation from 90% in-distribution to 55% under combined distribution shift.
|
Task success
|
OOD generalization (bimanual)
|
Author: Ozkan Ceylan Β· Full Report: PROJECT_REPORT.md

Full videos: reach Β· grasp Β· pick Β· place Β· combined
Bimanual Box Lift β Both hands squeeze box via friction only (no weld constraints), full mj_step dynamics![]() Full video |
Videos show side-by-side overview camera (left) and robot's egocentric view (right). The ACT model receives only the ego camera image as visual input.
User: "Pick up the red cube"
β
ββββ Telegram β OpenClaw (Charlie) β RosClaw ββββ
β (natural language interface) β
β βΌ
β βββββββββββββββββββββββββββββββ
β β rosbridge (WebSocket:9090) β
β ββββββββββββββ¬βββββββββββββββββ
β βΌ
β ββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββΊβ VLA Task Manager (ROS2 Node) β
β β
β NL Parser: "pick up..." β single_arm, β
β task_id=2 β
β β
β 30Hz Control Loop: β
β Camera (480Γ640 RGB) βββΊ ACT Model βββΊ β
β Joint State (58-d) βββΊ (15.6M) βββΊ β
β Task Embedding βββΊ βββΊ β
β 20 joint actions β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β Joint commands
ββββββββββββββββββΌββββββββββββββββββββββββββββ
β MuJoCo Simulation β
β Unitree G1 (29 DOF) + table + objects β
β Egocentric camera β 480Γ640 RGB β
β Physics: 500Hz (mj_step) / Kinematic β
ββββββββββββββββββββββββββββββββββββββββββββββ
Key insight: The VLA model runs in a tight 30Hz control loop (camera β action). RosClaw/OpenClaw operates at the task dispatch level β it sends the command once and monitors completion.
- Ubuntu 24.04 (tested), or Ubuntu 22.04
- NVIDIA GPU with CUDA (RTX 4050 6GB VRAM is sufficient)
- Python 3.12+, ROS2 Jazzy (or Humble)
# 1. Clone
git clone https://github.com/ozkanceylan/humanoid_vla.git
cd humanoid_vla
# 2. Install ROS2
chmod +x install_ros2.sh && ./install_ros2.sh
# 3. Python dependencies
pip3 install --break-system-packages -r requirements.txt
# 4. Robot models
cd repos
git clone https://github.com/unitreerobotics/unitree_mujoco
git clone https://github.com/google-deepmind/mujoco_menagerie
cd ..
# 5. Build ROS2 workspace
source /opt/ros/jazzy/setup.bash
cd ros2_ws && colcon build --symlink-install && cd ..
source ros2_ws/install/setup.bash# 1. Generate training data (80 single-arm + 30 bimanual demos)
MUJOCO_GL=egl python3 scripts/generate_demos.py --all-tasks --episodes 20
MUJOCO_GL=egl python3 scripts/generate_bimanual_demos.py --episodes 30
# 2. Train ACT models (~2.5 hours total on RTX 4050)
python3 scripts/train_act.py --demos data/demos --epochs 300 --batch-size 32
python3 scripts/train_bimanual.py --epochs 300
# 3. Evaluate
MUJOCO_GL=egl python3 scripts/evaluate.py --checkpoint data/checkpoints/best.pt --episodes 20
MUJOCO_GL=egl python3 scripts/evaluate_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt --episodes 20
# 4. Interactive demos (opens MuJoCo viewer)
python3 scripts/live_demo.py --checkpoint data/checkpoints/best.pt
python3 scripts/live_bimanual.py --checkpoint data/bimanual_checkpoints/best.pt
# 5. Record demo videos
MUJOCO_GL=egl python3 scripts/record_demo_videos.py# Terminal 1: Launch full VLA system (task manager + rosbridge)
ros2 launch vla_mujoco_bridge vla_system.launch.py
# Terminal 2: Send natural language commands
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the red cube'"
ros2 topic pub --once /vla/task_goal std_msgs/String "data: 'pick up the green box with both hands'"
# Terminal 3: Monitor status (JSON)
ros2 topic echo /vla/statusTrained for 300 epochs (93 min on RTX 4050, final loss: 0.000009). Evaluated with temporal ensembling and hierarchical task decomposition:
| Task | Success | Rate |
|---|---|---|
| Reach the red cube | 20/20 | 100% |
| Grasp the red cube | 18/20 | 90% |
| Pick up the red cube | 18/20 | 90% |
| Place the red cube on blue plate | 13/20 | 65% |
| Overall | 69/80 | 86.2% |
Both hands squeeze a 20Γ15Γ15cm box using friction only β no weld constraints, full mj_step dynamics, PD torque control + gravity compensation:
| Metric | Value |
|---|---|
| Success rate | 20/20 (100%) |
| Lift | mean=8.5cm, min=6.5cm, max=10.4cm |
| Contact force | L=13.6N, R=13.3N (bilateral) |
| Physics | mj_step at 500Hz, control at 30Hz |
| Training | 300 epochs, 52 min, loss=0.000009 |
Key inference techniques:
- Temporal ensembling β overlapping action chunks with exponential decay weighting
- Hierarchical task decomposition β composite tasks switch task embedding at grasp trigger
- Re-grasp prevention β
releasedflag prevents re-triggering after intentional release
Bimanual model trained with position randomization (+/-5cm), random arm starts, and visual domain randomization. Evaluated on out-of-distribution conditions never seen during training:
| Test Distribution | Success | Description |
|---|---|---|
| In-distribution | 90% | Same ranges as training, different seeds |
| OOD Position (1.5x) | 80% | Wider object positions |
| OOD Visual | 70% | Novel table/light colors |
| OOD Posture (1.5x) | 60% | Wider starting arm configs |
| OOD Combined | 55% | All OOD factors simultaneously |
Graceful degradation from 90% to 55% demonstrates real generalization β the model handles unseen conditions rather than catastrophically failing.
- ACT (Action Chunking with Transformers) policy, 15.6M params, ~0.67s chunks
- Frozen ResNet18 (layers 0β6) visual encoder + state + task-embedding fusion
- Temporal ensembling: overlapping chunks with exponential decay weighting
- Hierarchical task decomposition: composite tasks switch embeddings at grasp trigger
- Re-grasp prevention via
releasedlatch in evaluation loop
- PD torque control with per-joint gravity compensation on Unitree G1 (29 DOF)
- Friction-only bimanual grasping under full
mj_stepdynamics β no weld constraints - Iterative Jacobian IK for scripted expert demo generation (single + bimanual)
- Auto-grasp trigger on hand-to-object proximity for closed-loop manipulation
- Bilateral contact force monitoring (β₯2N per palm, ~13N measured in eval)
- ROS 2 Jazzy Task Manager node: NL command β ACT inference β MuJoCo at 30Hz
- rosbridge WebSocket server (port 9090) for external clients (Telegram, JS, Python)
- Natural language parser routes "pick up" / "lift box" β task ID + arm mode
- Thread-safe inference loop with daemon threads and ROS2 status publishing
- JSON-encoded
/vla/statustopic streaming step, progress, and result fields
- OOD generalization study with progressive difficulty (position / visual / posture / combined)
- Runtime visual domain randomization (table color, lighting) during training
- Ablation against in-distribution baseline, holding seeds independent
- 50 documented engineering lessons (
tasks/lessons.md, L001βL050) - Six deep-dive study documents covering each subsystem end-to-end
- MuJoCo MJCF authoring: G1 + table + cameras + objects in
sim/g1_with_camera.xml - 500Hz
mj_stepphysics decoupled from 30Hz control loop - Egocentric 480Γ640 RGB camera pipeline rendered headless via EGL
- HDF5 demo recording + LeRobot format converter for portability
- Two interactive viewers (
live_demo.py,live_bimanual.py) for qualitative inspection
The VLA Task Manager accepts natural language commands via ROS2 topics and runs ACT inference in a closed-loop MuJoCo simulation.
| Direction | Topic | Type | Purpose |
|---|---|---|---|
| Subscribe | /vla/task_goal |
std_msgs/String |
Natural language command |
| Publish | /vla/status |
std_msgs/String |
JSON: step, progress, result |
| Publish | /camera/image_raw |
sensor_msgs/Image |
Ego camera during execution |
| Input | Mode | Task |
|---|---|---|
| "pick up the red cube" | single_arm | pick up the red cube |
| "reach" | single_arm | reach the red cube |
| "lift the box" | bimanual | pick up the green box with both hands |
| "bimanual grasp" | bimanual | pick up the green box with both hands |
The launch file co-starts rosbridge_server on port 9090, enabling any WebSocket client (RosClaw, JavaScript, Python) to send commands:
import websocket, json
ws = websocket.create_connection("ws://localhost:9090")
ws.send(json.dumps({
"op": "publish",
"topic": "/vla/task_goal",
"msg": {"data": "pick up the red cube"}
}))| Property | Value |
|---|---|
| DOF | 29 torque-controlled joints |
| Control | PD: |
| Camera | Egocentric RGB, 480Γ640, torso-mounted |
| Fixed base | Pelvis frozen at z=0.793m |
| Right arm | 7 DOF (shoulder pitch/roll/yaw, elbow, wrist p/r/y) |
| Left arm | 7 DOF (mirror configuration) |
| ID | Task | Description | Success Criterion |
|---|---|---|---|
| 0 | Reach | Move hand to the red cube | Hand within 6cm of cube |
| 1 | Grasp | Close hand around the cube | Auto-grasp triggered (hand < 4cm) |
| 2 | Pick | Lift the cube off the table | Cube z > 0.90m while grasped |
| 3 | Place | Move cube to the blue plate | Cube within 6cm of target, released |
| 4 | Bimanual Lift | Lift green box with both hands | Box β₯3cm, dual contact, force β₯2N |
Image (480Γ640Γ3) βββΊ ResNet18 (frozen 0-6) βββΊ AvgPool βββΊ 512-d βββΊ Proj βββΊ 256-d ββ
β
State (pos + vel) βββΊ MLP (β256β256) ββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ Memory
β (3 tokens)
Task ("pick up..") βββΊ Embedding βββΊ 256-d βββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β Transformer Decoder β
β 4 layers, 4 heads, d=256 β
β 20 learnable query tokens β
βββββββββββββββ¬βββββββββββββββ
β
Action chunk: (20, action_dim)
| Variant | Params | Trainable | State | Actions | Tasks |
|---|---|---|---|---|---|
| Single-arm | 15.6M | 12.8M | 58 (29+29) | 29 | 4 |
| Bimanual | 15.6M | 12.8M | 28 (14+14) | 14 | 1 |
Chunk size: 20 timesteps (~0.67s). Training: AdamW (lr=1e-4), CosineAnnealing, MSE loss. VRAM: ~1.5GB.
humanoid_vla/
βββ README.md # This file
βββ CLAUDE.md # Project vision & phase plan
β
βββ sim/ # MuJoCo simulation
β βββ g1_with_camera.xml # Scene: G1 + table + objects + cameras
β βββ models/g1_29dof.xml # Robot model (29 torque-actuated DOF)
β βββ test_g1.py # Standalone sim test
β
βββ scripts/ # Training & evaluation pipeline
β βββ act_model.py # ACT policy architecture + dataset
β βββ train_act.py # Single-arm training
β βββ train_bimanual.py # Bimanual training
β βββ evaluate.py # Single-arm evaluation
β βββ evaluate_bimanual.py # Bimanual evaluation (contact + lift)
β βββ generate_demos.py # Single-arm scripted expert (IK + weld)
β βββ generate_bimanual_demos.py # Bimanual demo generator (friction)
β βββ physics_sim.py # Physics wrapper (mj_step, PD, contacts)
β βββ domain_randomization.py # Runtime visual domain randomization
β βββ eval_generalization.py # OOD generalization evaluation
β βββ visualize_configs.py # Render randomization grid
β βββ visualize_perception_action.py # Trajectory strip visualization
β βββ live_demo.py # Interactive viewer (single-arm)
β βββ live_bimanual.py # Interactive viewer (bimanual)
β βββ record_demo_videos.py # Generate demo clips for README
β βββ visualize_demo.py # Render videos from HDF5 demos
β βββ convert_to_lerobot.py # LeRobot format converter
β
βββ ros2_ws/src/vla_mujoco_bridge/ # ROS2 package
β βββ vla_mujoco_bridge/
β β βββ task_manager_node.py # VLA Task Manager (NL β ACT β MuJoCo)
β β βββ bridge_node.py # Low-level MuJoCo β ROS2 bridge
β β βββ mujoco_sim.py # Physics engine wrapper
β β βββ teleop_node.py # Full-body keyboard teleop
β β βββ arm_teleop_node.py # Arm-only keyboard teleop
β β βββ demo_recorder.py # HDF5 demonstration recorder
β βββ launch/
β βββ vla_system.launch.py # Launch: rosbridge + task manager
β
βββ media/ # Demo videos (committed to repo)
β βββ reach.mp4, grasp.mp4 # Individual task demos
β βββ pick.mp4, place.mp4 # Pick and place demos
β βββ bimanual.mp4 # Bimanual box lift demo
β βββ all_tasks.mp4 # Combined montage
β
βββ data/ # Generated data (gitignored)
β βββ demos/ # Single-arm HDF5 episodes
β βββ checkpoints/ # Single-arm model weights
β βββ bimanual_demos/ # Bimanual HDF5 episodes
β βββ bimanual_checkpoints/ # Bimanual model weights
β
βββ study/ # Deep-dive study documents
β βββ 01_project_deep_dive.md # MuJoCo, G1, ROS2, camera pipeline
β βββ 02_scripted_expert_demo_generation.md # IK, kinematic playback
β βββ 03_act_training_and_evaluation.md # ACT training, debugging
β βββ 04_bimanual_physics_grasping.md # Physics, PD, friction grasp
β βββ 05_system_integration.md # Task Manager, rosbridge, NL parsing
β
βββ tasks/ # Project management
β βββ todo.md # Phase tracker with milestones
β βββ lessons.md # Engineering lessons (L001-L050)
β
βββ logs/ # Training logs
βββ act_training_300ep.log
| # | Document | Topics |
|---|---|---|
| 01 | Project Deep Dive | MuJoCo fundamentals, G1 robot, MJCF XML, PD control, gravity comp, ROS2 bridge, threading, camera pipeline, teleoperation, HDF5 format |
| 02 | Scripted Expert Demos | Inverse kinematics (iterative Jacobian), kinematic playback, weld constraint, trajectory design |
| 03 | ACT Training & Evaluation | ACT architecture, action chunking, ResNet18 encoder, task embedding, Transformer decoder, training, evaluation debugging |
| 04 | Bimanual Physics Grasping | mj_step vs mj_forward, PD torque control, contact physics, friction cones, compliance grasping, bimanual coordination |
| 05 | System Integration | ROS2 Task Manager, NL parsing, rosbridge WebSocket, thread-safe execution, temporal ensembling, full data flow |
| 06 | Domain Randomization | Visual augmentation, position noise, posture variation, generalization evaluation, ablation study |
tasks/lessons.md β 50 concise lessons learned:
- L001βL008: Environment setup (torque actuators, meshdir, ROS2 Jazzy)
- L009βL012: Phase B infrastructure (gravity comp, setuptools, cv_bridge)
- L013βL016: Demo generation (ctrlrange, arm reach, kinematic IK, weld)
- L017βL023: ACT training (standalone, action chunking, frozen ResNet, auto-grasp)
- L024βL027: Evaluation (temporal ensembling, hierarchical decomposition, re-grasp)
- L028βL033: Bimanual physics (leg drift, palm pad, IK, compliance grasping)
- L034βL040: ROS2 integration (String+JSON, daemon threads, launch files)
- L041βL050: Domain randomization (memorization, IK validation, progressive difficulty)
| Phase | Status | Duration | Summary |
|---|---|---|---|
| A β Sim + ROS2 | β | 2 weeks | MuJoCo + G1 + camera + ROS2 bridge + teleop |
| B β Demo Generation | β | 1 week | Scripted expert demos, IK pipeline, 80 episodes |
| C β ACT Training | β | 2 weeks | 4-task ACT model, 86.2% success rate |
| C2 β Bimanual | β | 2 weeks | Physics-based bimanual grasping, 100% success |
| D β Integration | β | 1 week | ROS2 Task Manager, NL commands, rosbridge |
| E β Polish | β | 1 week | Demo videos, documentation, study docs |
| F β Generalization | β | 1 week | Domain randomization, OOD evaluation, 90% in-dist / 55% OOD |
| Component | Minimum | Tested On |
|---|---|---|
| GPU | NVIDIA with CUDA, 4GB+ VRAM | RTX 4050 Laptop (6GB) |
| RAM | 16 GB | 33 GB |
| OS | Ubuntu 22.04 or 24.04 | Ubuntu 24.04 |
| CUDA | 12.x | 12.8 |
| ROS2 | Humble or Jazzy | Jazzy |
- ACT: Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", RSS 2023
- GR00T N1: NVIDIA, "An Open Foundation Model for Humanoid Robots", 2025
- unitree_mujoco β G1/H1 simulation
- mujoco_menagerie β Robot models
- lerobot β Robot learning framework
MIT
