name	remote-training
description	Manages remote training infrastructure on Nebius VMs. Use for building/pushing Docker images, starting/stopping VM machines (train, train2, train3), running training jobs, dataset generation, and starting inference servers.

Remote Training Infrastructure

Overview

This skill manages the Positronic training infrastructure on Nebius GPU VMs. It covers Docker image management, VM lifecycle, training jobs, dataset generation, and inference server deployment.

Prerequisites

Docker contexts configured for VMs: vm-train, vm-train2, vm-train3
AWS S3 access configured for checkpoint/dataset storage
Nebius CLI authenticated (for VM start/stop)

Available Machines

Context	GPU	Use Case
`desktop`	RTX 3060 (12GB)	Dataset generation, GR00T inference, lerobot training
`notebook`	RTX 4060 Laptop (8GB)	Light tasks, testing, dataset generation
`vm-train`	H100 (80GB)	GR00T/OpenPI training and inference
`vm-train2`	H100 (80GB)	GR00T/OpenPI training and inference
`vm-train3`	H100 (80GB)	GR00T/OpenPI training and inference

Important: Only GR00T training/inference and OpenPI training/inference require H100. Other jobs (dataset generation, lerobot) can run on desktop.

Docker Images

Image Overview

Image	Source	Depends On	Used For
`positro/positronic`	`positronic/docker/`	-	Dataset conversion, lerobot training/inference
`positro/gr00t`	`positronic/docker/`	`positro/gr00t-base`	GR00T training and inference
`positro/gr00t-base`	`gr00t/docker/`	-	Base image for GR00T
`positro/openpi`	`positronic/docker/`	`positro/openpi-base`	OpenPI training and inference
`positro/openpi-base`	`openpi/docker/`	-	Base image for OpenPI

Build Order for Cross-Repo Changes

If you modify code in ../gr00t or ../openpi:

For gr00t changes:

cd /home/vertix/dev/gr00t/docker
make push  # Pushes positro/gr00t-base
cd /home/vertix/dev/positronic/docker
make push-groot  # Rebuilds and pushes positro/gr00t with new base

For openpi changes:

cd /home/vertix/dev/openpi/docker
make push  # Pushes positro/openpi-base
cd /home/vertix/dev/positronic/docker
make push-openpi  # Rebuilds and pushes positro/openpi with new base

For positronic-only changes:

cd /home/vertix/dev/positronic/docker
make push-training  # Just positro/positronic
# Or for specific images:
make push-groot     # positro/gr00t
make push-openpi    # positro/openpi
make push           # All images

VM Machine Management

Start a VM

../internal/scripts/start.sh train
../internal/scripts/start.sh train2
../internal/scripts/start.sh train3

Note: Requires Nebius CLI authentication. Must be run from a terminal with browser access for OAuth flow.

Check VM Status

ssh -o ConnectTimeout=5 vertix@vm-train 'echo connected'
ssh -o ConnectTimeout=5 vertix@vm-train2 'echo connected'
ssh -o ConnectTimeout=5 vertix@vm-train3 'echo connected'

Docker Contexts

docker context ls                     # List available contexts
docker --context vm-train ps          # Check containers on vm-train
docker --context vm-train2 ps         # Check containers on vm-train2

Pipeline Overview

1. Data Collection (positronic-data-collection)
        ↓
2. Dataset Conversion (positronic-to-lerobot) [desktop]
        ↓
3. [OpenPI only] Generate Stats (openpi-stats) [desktop]
        ↓
4. Training (groot-train / openpi-train) [H100]
        ↓
5. Inference Server (groot-server / openpi-server) [H100 or desktop]
        ↓
6. Inference Client (positronic-inference) [local]

Dataset Generation

Convert Positronic Dataset to LeRobot Format

From docker/ directory (can run on desktop):

docker compose run --rm --pull always positronic-to-lerobot convert \
  --dataset=@positronic.cfg.phail.sim_stack_groot_ft \
  --dataset.observation=.groot_rot6d_joints \
  --dataset.action=.groot_rot6d \
  --output_dir=s3://interim/sim_ft/groot_rot6d_q/ \
  --fps=15

Observation/Action Configs

Observation	Description
`.groot`	EE pose (quaternion)
`.groot_joints`	EE pose + joint positions
`.groot_rot6d`	EE pose (6D rotation)
`.groot_rot6d_joints`	6D rotation + joint positions
`.eepose`	For OpenPI/ACT

Action	Description
`.groot`	EE delta (quaternion)
`.groot_rot6d`	EE delta (6D rotation)
`.absolute_position`	Absolute EE pose

GR00T Training

From docker/ directory, on H100 VM:

docker --context vm-train compose run --rm --pull=always groot-train \
  --input_path=s3://interim/sim_ft/groot_rot6d_q/ \
  --output_path=s3://checkpoints/sim_ft/groot_rot6d_q/ \
  --exp_name=YYMMDD \
  --num_train_steps=20000 \
  --save_steps=2000 \
  --num_workers=4 \
  --modality_config=ee_rot6d_q

GR00T Modality Configs

Config	Description
`ee`	End-effector pose (quaternion)
`ee_q`	EE pose + joint feedback
`ee_rot6d`	EE pose with 6D rotation
`ee_rot6d_q`	6D rotation + joint feedback
`ee_rot6d_rel`	6D rotation, relative actions
`ee_rot6d_q_rel`	6D rotation + joints, relative actions

OpenPI Training

From docker/ directory, on H100 VM:

# 1. Generate stats (can run on desktop)
docker compose run --rm openpi-stats \
  --input_path=s3://interim/my_lerobot_data \
  --output_path=s3://interim/openpi_assets

# 2. Train (requires H100)
docker --context vm-train compose run --rm --pull=always openpi-train \
  --input_path=s3://interim/my_lerobot_data \
  --stats_path=s3://interim/openpi_assets/assets/ \
  --output_path=s3://checkpoints/openpi \
  --exp_name=experiment_v1

Inference Servers

GR00T Server (requires GPU)

docker compose run --rm --service-ports groot-server \
  --checkpoints_dir=s3://checkpoints/sim_ft/groot_rot6d_q/040126/ \
  --modality_config=ee_rot6d_q \
  --port=9000

OpenPI Server (requires H100)

docker --context vm-train compose run --rm --service-ports openpi-server \
  --checkpoints_dir=s3://checkpoints/openpi/pi05_positronic_lowmem/experiment_v1/

LeRobot/ACT Server (can run on desktop)

docker compose run --rm --service-ports lerobot-server \
  --checkpoints_dir=s3://checkpoints/act/experiment_v1/

Inference Client

With GUI (requires display)

uv run positronic-inference sim \
  --policy=.groot_ee_rot6d_joints \
  --policy.base.host=desktop \
  --driver.show_gui

Headless (no display required)

MUJOCO_GL=egl uv run positronic-inference sim \
  --policy=.groot_ee_rot6d_joints \
  --policy.base.host=desktop \
  --driver.show_gui=False \
  --driver.simulation_time=10

Client-Server Config Mapping

Server Modality	Client Policy Config
`ee_rot6d_q`	`groot_ee_rot6d_joints`
`ee_rot6d_q_rel`	`groot_ee_rot6d_joints`
`ee_q`	`groot_ee_joints`
`ee`	`groot_ee`
OpenPI	`openpi_positronic`
LeRobot ACT	`act_absolute`

Monitoring Background Jobs

When running jobs in background:

# Check progress percentage
grep -o '[0-9]*%' /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output | tail -1

# View recent output
tail -50 /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output

# Check for completion/errors
grep -i "error\|complete\|finished" /tmp/claude/-home-vertix-dev-positronic/tasks/<task_id>.output

Common Issues

CUDA Out of Memory

Each GR00T server uses ~6GB GPU memory. On 12GB GPUs (desktop), only run one server at a time.

Port Already Allocated

docker ps -a | grep -E "groot-server|openpi-server"
docker stop <container_id> && docker rm <container_id>

VM Not Reachable

Start the VM: ../internal/scripts/start.sh train2
Verify SSH: ssh -o ConnectTimeout=5 vertix@vm-train2 'echo connected'

Parquet Object Array Error

If dataset generation fails with ValueError: setting an array element with a sequence, the fix is in positronic/dataset/vector.py - use np.stack() to convert object arrays to proper 2D arrays.

gladLoadGL Error (Headless)

Use MUJOCO_GL=egl environment variable for headless rendering:

MUJOCO_GL=egl uv run positronic-inference sim --driver.show_gui=False ...

Nebius Auth (Manual Flow for Headless Environments)

When running from a headless environment without browser access:

Start nebius in background with --no-browser:
```
nebius --no-browser --auth-timeout 5m iam whoami 2>&1
```
Run this in background and extract the auth URL from output.
Give the auth URL to the user - they click it and authenticate in their browser.
User's browser redirects to localhost URL like:
```
http://127.0.0.1:PORT/?code=XXX&state=YYY
```
The page won't load (expected). User copies this full URL from address bar.

Curl the localhost URL on the machine running nebius:

curl -s "http://127.0.0.1:PORT/?code=XXX&state=YYY"
# Returns: "Login is successful, you may close the browser tab"

Auth completes - nebius background process finishes, credentials are cached.

After authentication, VM start scripts will work:

../internal/scripts/start.sh train

remote-training

Install Skill

SKILL.md