Files

Norbert Schmidt 5a04a7133a Add fully local conversational AI pipeline for Reachy Mini

Local STT (Qwen3-ASR), VLM (Gemma 4 26B-A4B), and TTS (Spark-TTS) running
on Apple Silicon via MLX, with bracket-tag action system for nod, shake,
wiggle, dance, photo, and pre-recorded emotions.

2026-05-12 09:24:02 +02:00

8.2 KiB

Raw Blame History

reachy-mlx-vlm

Fully local conversational AI for the Reachy Mini robot — every model runs on your Mac using MLX. No cloud APIs, no realtime websockets, no usage tiers.

The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.

 ┌──────────────┐    mic     ┌─────────────────────────────────────────────┐    speaker  ┌──────────────┐
 │  Reachy Mini │ ─────────▶ │  Mac (Apple Silicon)                        │ ──────────▶ │  Reachy Mini │
 │              │            │                                             │             │              │
 │  mic + cam   │   camera   │   Qwen3-ASR  →  Gemma 4 VLM  →  Spark-TTS   │             │   speaker    │
 │              │ ─────────▶ │     (STT)        (think)         (voice)    │             │              │
 │              │            │                                             │             │              │
 └──────────────┘            └─────────────────────────────────────────────┘             └──────────────┘
        ▲                                                                                       │
        └───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘

What it does

Listens via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
Transcribes with Qwen3-ASR-1.7B (8-bit MLX).
Thinks with Gemma 4 26B-A4B (mlx-community/gemma-4-26b-a4b-it-bf16) — vision-capable, so it can describe what it sees on request.
Speaks with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
Moves while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
Acts on bracketed tags emitted by the model: [nod], [shake], [wiggle], [dance], [photo], [emotion:NAME]. The tags are stripped from speech and executed in parallel with TTS.

Requirements

Hardware: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
Software: macOS, Python 3.12+, uv, GStreamer (brew install gstreamer), sshpass (for the helper scripts: brew install sshpass).
Disk: ~30 GB for the model weights downloaded from Hugging Face on first run.

Setup

git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
cd reachy-mlx-vlm
uv sync

That's it. The first time you run ./run.sh, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and reachy-mini-emotions-library repos from Hugging Face (one-off, then cached).

Note on mlx-vlm versions. The pyproject.toml lock currently resolves to mlx-vlm 0.3.9 because reachy-mini 1.2.3 pins transformers below the floor required by mlx-vlm >= 0.5.0. The codebase works fine on 0.3.9–0.4.4. To run 0.5.0 on the same venv:
VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0
This sidesteps the dependency tree — your Mac never actually runs transformers to drive the robot, so the constraint isn't load-bearing.

Run

./run.sh                          # default — vision on, talks to reachy-mini.local
./run.sh --no-camera              # audio-only mode
./run.sh --host 192.168.1.55      # explicit robot IP
./run.sh --energy-threshold 0.04  # more sensitive VAD
./run.sh --debug                  # verbose logs

The robot's daemon must be reachable at http://reachy-mini.local:8000. The wireless model ships with --no-autostart, so click the wake-up button in the dashboard once before launching, or hit:

curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"

Tool / action system

The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.

Tag	What it does
`[nod]`	Three-pose pitch nod.
`[shake]`	Three-pose yaw shake.
`[wiggle]`	Three antenna oscillations.
`[dance]`	Plays a random move from `pollen-robotics/reachy-mini-dances-library`.
`[photo]`	Saves a JPEG snapshot from the camera to `photos/`.
`[emotion:NAME]`	Plays a named pre-recorded emotion (silent — sound track muted).

The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops [emotion:...] tags whenever the user's utterance does not include an emotion trigger word (happy, sad, look ..., etc.). Configure the trigger list in LocalConversationStream.EMOTION_TRIGGERS.

Customizing the personality

The system prompt lives at the top of conversation.py (SYSTEM_PROMPT). It follows the same structure as Pollen Robotics' official prompt (identity / response rules / tools) and currently bakes in a household context (names, location). Edit those lines to match your environment — or wire it to read from a config file.

Helper scripts

Script	Purpose
`run.sh "..."`	Launch the conversation loop.
`speak.sh "Hello"`	One-shot TTS — generate on the Mac, play through the robot.
`take_photo.sh out.jpg`	Grab a single frame from the robot's camera over SSH.
`record_voice.sh 12 voice_ref.wav`	Capture 12 s of audio through the robot's microphone.
`test_reachy.py`	Minimal SDK smoke test — wiggles the antennas.
`look_at_click.py`	Click the camera feed to make the robot look at a pixel.

The helper scripts assume the default pollen / root SSH credentials shipped with ReachyMiniOS.

How this differs from the official conversation app

pollen-robotics/reachy_mini_conversation_app uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).

This project trades that off:

Pros: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
Cons: ~3–6 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.

If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.

Acknowledgements

Pollen Robotics for the Reachy Mini SDK and the conversation-app architecture this riffs on.
mlx-audio for the STT / TTS adapters.
mlx-vlm for the Gemma 4 VLM runtime.
The mlx project at Apple ML Research.

License

MIT — see LICENSE.

8.2 KiB Raw Blame History Unescape Escape