Local STT (Qwen3-ASR), VLM (Gemma 4 26B-A4B), and TTS (Spark-TTS) running on Apple Silicon via MLX, with bracket-tag action system for nod, shake, wiggle, dance, photo, and pre-recorded emotions.
reachy-mlx-vlm
Fully local conversational AI for the Reachy Mini robot — every model runs on your Mac using MLX. No cloud APIs, no realtime websockets, no usage tiers.
The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.
┌──────────────┐ mic ┌─────────────────────────────────────────────┐ speaker ┌──────────────┐
│ Reachy Mini │ ─────────▶ │ Mac (Apple Silicon) │ ──────────▶ │ Reachy Mini │
│ │ │ │ │ │
│ mic + cam │ camera │ Qwen3-ASR → Gemma 4 VLM → Spark-TTS │ │ speaker │
│ │ ─────────▶ │ (STT) (think) (voice) │ │ │
│ │ │ │ │ │
└──────────────┘ └─────────────────────────────────────────────┘ └──────────────┘
▲ │
└───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘
What it does
- Listens via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
- Transcribes with Qwen3-ASR-1.7B (8-bit MLX).
- Thinks with Gemma 4 26B-A4B (
mlx-community/gemma-4-26b-a4b-it-bf16) — vision-capable, so it can describe what it sees on request. - Speaks with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
- Moves while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
- Acts on bracketed tags emitted by the model:
[nod],[shake],[wiggle],[dance],[photo],[emotion:NAME]. The tags are stripped from speech and executed in parallel with TTS.
Requirements
- Hardware: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
- Software: macOS, Python 3.12+, uv, GStreamer (
brew install gstreamer),sshpass(for the helper scripts:brew install sshpass). - Disk: ~30 GB for the model weights downloaded from Hugging Face on first run.
Setup
git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
cd reachy-mlx-vlm
uv sync
That's it. The first time you run ./run.sh, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and reachy-mini-emotions-library repos from Hugging Face (one-off, then cached).
Note on
mlx-vlmversions. Thepyproject.tomllock currently resolves tomlx-vlm 0.3.9becausereachy-mini 1.2.3pinstransformersbelow the floor required bymlx-vlm >= 0.5.0. The codebase works fine on 0.3.9–0.4.4. To run 0.5.0 on the same venv:VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0This sidesteps the dependency tree — your Mac never actually runs
transformersto drive the robot, so the constraint isn't load-bearing.
Run
./run.sh # default — vision on, talks to reachy-mini.local
./run.sh --no-camera # audio-only mode
./run.sh --host 192.168.1.55 # explicit robot IP
./run.sh --energy-threshold 0.04 # more sensitive VAD
./run.sh --debug # verbose logs
The robot's daemon must be reachable at http://reachy-mini.local:8000. The wireless model ships with --no-autostart, so click the wake-up button in the dashboard once before launching, or hit:
curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"
Tool / action system
The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.
| Tag | What it does |
|---|---|
[nod] |
Three-pose pitch nod. |
[shake] |
Three-pose yaw shake. |
[wiggle] |
Three antenna oscillations. |
[dance] |
Plays a random move from pollen-robotics/reachy-mini-dances-library. |
[photo] |
Saves a JPEG snapshot from the camera to photos/. |
[emotion:NAME] |
Plays a named pre-recorded emotion (silent — sound track muted). |
The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops [emotion:...] tags whenever the user's utterance does not include an emotion trigger word (happy, sad, look ..., etc.). Configure the trigger list in LocalConversationStream.EMOTION_TRIGGERS.
Customizing the personality
The system prompt lives at the top of conversation.py (SYSTEM_PROMPT). It follows the same structure as Pollen Robotics' official prompt (identity / response rules / tools) and currently bakes in a household context (names, location). Edit those lines to match your environment — or wire it to read from a config file.
Helper scripts
| Script | Purpose |
|---|---|
run.sh "..." |
Launch the conversation loop. |
speak.sh "Hello" |
One-shot TTS — generate on the Mac, play through the robot. |
take_photo.sh out.jpg |
Grab a single frame from the robot's camera over SSH. |
record_voice.sh 12 voice_ref.wav |
Capture 12 s of audio through the robot's microphone. |
test_reachy.py |
Minimal SDK smoke test — wiggles the antennas. |
look_at_click.py |
Click the camera feed to make the robot look at a pixel. |
The helper scripts assume the default pollen / root SSH credentials shipped with ReachyMiniOS.
How this differs from the official conversation app
pollen-robotics/reachy_mini_conversation_app uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).
This project trades that off:
- Pros: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
- Cons: ~3–6 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.
If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.
Acknowledgements
- Pollen Robotics for the Reachy Mini SDK and the conversation-app architecture this riffs on.
mlx-audiofor the STT / TTS adapters.mlx-vlmfor the Gemma 4 VLM runtime.- The
mlxproject at Apple ML Research.
License
MIT — see LICENSE.