Files
reachy-mlx-vlm/README.md
Norbert Schmidt 5a04a7133a Add fully local conversational AI pipeline for Reachy Mini
Local STT (Qwen3-ASR), VLM (Gemma 4 26B-A4B), and TTS (Spark-TTS) running
on Apple Silicon via MLX, with bracket-tag action system for nod, shake,
wiggle, dance, photo, and pre-recorded emotions.
2026-05-12 09:24:02 +02:00

121 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# reachy-mlx-vlm
Fully local conversational AI for the [Reachy Mini](https://www.pollen-robotics.com/reachy-mini) robot — every model runs on your Mac using [MLX](https://github.com/ml-explore/mlx). No cloud APIs, no realtime websockets, no usage tiers.
The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.
```
┌──────────────┐ mic ┌─────────────────────────────────────────────┐ speaker ┌──────────────┐
│ Reachy Mini │ ─────────▶ │ Mac (Apple Silicon) │ ──────────▶ │ Reachy Mini │
│ │ │ │ │ │
│ mic + cam │ camera │ Qwen3-ASR → Gemma 4 VLM → Spark-TTS │ │ speaker │
│ │ ─────────▶ │ (STT) (think) (voice) │ │ │
│ │ │ │ │ │
└──────────────┘ └─────────────────────────────────────────────┘ └──────────────┘
▲ │
└───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘
```
## What it does
- **Listens** via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
- **Transcribes** with Qwen3-ASR-1.7B (8-bit MLX).
- **Thinks** with Gemma 4 26B-A4B (`mlx-community/gemma-4-26b-a4b-it-bf16`) — vision-capable, so it can describe what it sees on request.
- **Speaks** with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
- **Moves** while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
- **Acts** on bracketed tags emitted by the model: `[nod]`, `[shake]`, `[wiggle]`, `[dance]`, `[photo]`, `[emotion:NAME]`. The tags are stripped from speech and executed in parallel with TTS.
## Requirements
- **Hardware**: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
- **Software**: macOS, Python 3.12+, [uv](https://github.com/astral-sh/uv), GStreamer (`brew install gstreamer`), `sshpass` (for the helper scripts: `brew install sshpass`).
- **Disk**: ~30 GB for the model weights downloaded from Hugging Face on first run.
## Setup
```bash
git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
cd reachy-mlx-vlm
uv sync
```
That's it. The first time you run `./run.sh`, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and `reachy-mini-emotions-library` repos from Hugging Face (one-off, then cached).
> **Note on `mlx-vlm` versions.** The `pyproject.toml` lock currently resolves to `mlx-vlm 0.3.9` because `reachy-mini 1.2.3` pins `transformers` below the floor required by `mlx-vlm >= 0.5.0`. The codebase works fine on 0.3.90.4.4. To run 0.5.0 on the same venv:
>
> ```bash
> VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0
> ```
>
> This sidesteps the dependency tree — your Mac never actually runs `transformers` to drive the robot, so the constraint isn't load-bearing.
## Run
```bash
./run.sh # default — vision on, talks to reachy-mini.local
./run.sh --no-camera # audio-only mode
./run.sh --host 192.168.1.55 # explicit robot IP
./run.sh --energy-threshold 0.04 # more sensitive VAD
./run.sh --debug # verbose logs
```
The robot's daemon must be reachable at `http://reachy-mini.local:8000`. The wireless model ships with `--no-autostart`, so click the wake-up button in the dashboard once before launching, or hit:
```bash
curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"
```
## Tool / action system
The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.
| Tag | What it does |
| ------------------ | --------------------------------------------------------------------- |
| `[nod]` | Three-pose pitch nod. |
| `[shake]` | Three-pose yaw shake. |
| `[wiggle]` | Three antenna oscillations. |
| `[dance]` | Plays a random move from `pollen-robotics/reachy-mini-dances-library`. |
| `[photo]` | Saves a JPEG snapshot from the camera to `photos/`. |
| `[emotion:NAME]` | Plays a named pre-recorded emotion (silent — sound track muted). |
The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops `[emotion:...]` tags whenever the user's utterance does not include an emotion trigger word (`happy`, `sad`, `look ...`, etc.). Configure the trigger list in `LocalConversationStream.EMOTION_TRIGGERS`.
## Customizing the personality
The system prompt lives at the top of `conversation.py` (`SYSTEM_PROMPT`). It follows the same structure as Pollen Robotics' official prompt (identity / response rules / tools) and currently bakes in a household context (names, location). Edit those lines to match your environment — or wire it to read from a config file.
## Helper scripts
| Script | Purpose |
| ------------------------------------- | ------------------------------------------------------------------ |
| `run.sh "..."` | Launch the conversation loop. |
| `speak.sh "Hello"` | One-shot TTS — generate on the Mac, play through the robot. |
| `take_photo.sh out.jpg` | Grab a single frame from the robot's camera over SSH. |
| `record_voice.sh 12 voice_ref.wav` | Capture 12 s of audio through the robot's microphone. |
| `test_reachy.py` | Minimal SDK smoke test — wiggles the antennas. |
| `look_at_click.py` | Click the camera feed to make the robot look at a pixel. |
The helper scripts assume the default `pollen` / `root` SSH credentials shipped with ReachyMiniOS.
## How this differs from the official conversation app
[`pollen-robotics/reachy_mini_conversation_app`](https://github.com/pollen-robotics/reachy_mini_conversation_app) uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).
This project trades that off:
- **Pros**: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
- **Cons**: ~36 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.
If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.
## Acknowledgements
- [Pollen Robotics](https://www.pollen-robotics.com/) for the Reachy Mini SDK and the conversation-app architecture this riffs on.
- [`mlx-audio`](https://github.com/Blaizzy/mlx-audio) for the STT / TTS adapters.
- [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm) for the Gemma 4 VLM runtime.
- The [`mlx`](https://github.com/ml-explore/mlx) project at Apple ML Research.
## License
MIT — see [LICENSE](LICENSE).