Add fully local conversational AI pipeline for Reachy Mini
Local STT (Qwen3-ASR), VLM (Gemma 4 26B-A4B), and TTS (Spark-TTS) running on Apple Silicon via MLX, with bracket-tag action system for nod, shake, wiggle, dance, photo, and pre-recorded emotions.
This commit is contained in:
120
README.md
Normal file
120
README.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# reachy-mlx-vlm
|
||||
|
||||
Fully local conversational AI for the [Reachy Mini](https://www.pollen-robotics.com/reachy-mini) robot — every model runs on your Mac using [MLX](https://github.com/ml-explore/mlx). No cloud APIs, no realtime websockets, no usage tiers.
|
||||
|
||||
The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.
|
||||
|
||||
```
|
||||
┌──────────────┐ mic ┌─────────────────────────────────────────────┐ speaker ┌──────────────┐
|
||||
│ Reachy Mini │ ─────────▶ │ Mac (Apple Silicon) │ ──────────▶ │ Reachy Mini │
|
||||
│ │ │ │ │ │
|
||||
│ mic + cam │ camera │ Qwen3-ASR → Gemma 4 VLM → Spark-TTS │ │ speaker │
|
||||
│ │ ─────────▶ │ (STT) (think) (voice) │ │ │
|
||||
│ │ │ │ │ │
|
||||
└──────────────┘ └─────────────────────────────────────────────┘ └──────────────┘
|
||||
▲ │
|
||||
└───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘
|
||||
```
|
||||
|
||||
## What it does
|
||||
|
||||
- **Listens** via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
|
||||
- **Transcribes** with Qwen3-ASR-1.7B (8-bit MLX).
|
||||
- **Thinks** with Gemma 4 26B-A4B (`mlx-community/gemma-4-26b-a4b-it-bf16`) — vision-capable, so it can describe what it sees on request.
|
||||
- **Speaks** with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
|
||||
- **Moves** while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
|
||||
- **Acts** on bracketed tags emitted by the model: `[nod]`, `[shake]`, `[wiggle]`, `[dance]`, `[photo]`, `[emotion:NAME]`. The tags are stripped from speech and executed in parallel with TTS.
|
||||
|
||||
## Requirements
|
||||
|
||||
- **Hardware**: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
|
||||
- **Software**: macOS, Python 3.12+, [uv](https://github.com/astral-sh/uv), GStreamer (`brew install gstreamer`), `sshpass` (for the helper scripts: `brew install sshpass`).
|
||||
- **Disk**: ~30 GB for the model weights downloaded from Hugging Face on first run.
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
|
||||
cd reachy-mlx-vlm
|
||||
uv sync
|
||||
```
|
||||
|
||||
That's it. The first time you run `./run.sh`, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and `reachy-mini-emotions-library` repos from Hugging Face (one-off, then cached).
|
||||
|
||||
> **Note on `mlx-vlm` versions.** The `pyproject.toml` lock currently resolves to `mlx-vlm 0.3.9` because `reachy-mini 1.2.3` pins `transformers` below the floor required by `mlx-vlm >= 0.5.0`. The codebase works fine on 0.3.9–0.4.4. To run 0.5.0 on the same venv:
|
||||
>
|
||||
> ```bash
|
||||
> VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0
|
||||
> ```
|
||||
>
|
||||
> This sidesteps the dependency tree — your Mac never actually runs `transformers` to drive the robot, so the constraint isn't load-bearing.
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
./run.sh # default — vision on, talks to reachy-mini.local
|
||||
./run.sh --no-camera # audio-only mode
|
||||
./run.sh --host 192.168.1.55 # explicit robot IP
|
||||
./run.sh --energy-threshold 0.04 # more sensitive VAD
|
||||
./run.sh --debug # verbose logs
|
||||
```
|
||||
|
||||
The robot's daemon must be reachable at `http://reachy-mini.local:8000`. The wireless model ships with `--no-autostart`, so click the wake-up button in the dashboard once before launching, or hit:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"
|
||||
```
|
||||
|
||||
## Tool / action system
|
||||
|
||||
The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.
|
||||
|
||||
| Tag | What it does |
|
||||
| ------------------ | --------------------------------------------------------------------- |
|
||||
| `[nod]` | Three-pose pitch nod. |
|
||||
| `[shake]` | Three-pose yaw shake. |
|
||||
| `[wiggle]` | Three antenna oscillations. |
|
||||
| `[dance]` | Plays a random move from `pollen-robotics/reachy-mini-dances-library`. |
|
||||
| `[photo]` | Saves a JPEG snapshot from the camera to `photos/`. |
|
||||
| `[emotion:NAME]` | Plays a named pre-recorded emotion (silent — sound track muted). |
|
||||
|
||||
The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops `[emotion:...]` tags whenever the user's utterance does not include an emotion trigger word (`happy`, `sad`, `look ...`, etc.). Configure the trigger list in `LocalConversationStream.EMOTION_TRIGGERS`.
|
||||
|
||||
## Customizing the personality
|
||||
|
||||
The system prompt lives at the top of `conversation.py` (`SYSTEM_PROMPT`). It follows the same structure as Pollen Robotics' official prompt (identity / response rules / tools) and currently bakes in a household context (names, location). Edit those lines to match your environment — or wire it to read from a config file.
|
||||
|
||||
## Helper scripts
|
||||
|
||||
| Script | Purpose |
|
||||
| ------------------------------------- | ------------------------------------------------------------------ |
|
||||
| `run.sh "..."` | Launch the conversation loop. |
|
||||
| `speak.sh "Hello"` | One-shot TTS — generate on the Mac, play through the robot. |
|
||||
| `take_photo.sh out.jpg` | Grab a single frame from the robot's camera over SSH. |
|
||||
| `record_voice.sh 12 voice_ref.wav` | Capture 12 s of audio through the robot's microphone. |
|
||||
| `test_reachy.py` | Minimal SDK smoke test — wiggles the antennas. |
|
||||
| `look_at_click.py` | Click the camera feed to make the robot look at a pixel. |
|
||||
|
||||
The helper scripts assume the default `pollen` / `root` SSH credentials shipped with ReachyMiniOS.
|
||||
|
||||
## How this differs from the official conversation app
|
||||
|
||||
[`pollen-robotics/reachy_mini_conversation_app`](https://github.com/pollen-robotics/reachy_mini_conversation_app) uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).
|
||||
|
||||
This project trades that off:
|
||||
|
||||
- **Pros**: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
|
||||
- **Cons**: ~3–6 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.
|
||||
|
||||
If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
- [Pollen Robotics](https://www.pollen-robotics.com/) for the Reachy Mini SDK and the conversation-app architecture this riffs on.
|
||||
- [`mlx-audio`](https://github.com/Blaizzy/mlx-audio) for the STT / TTS adapters.
|
||||
- [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm) for the Gemma 4 VLM runtime.
|
||||
- The [`mlx`](https://github.com/ml-explore/mlx) project at Apple ML Research.
|
||||
|
||||
## License
|
||||
|
||||
MIT — see [LICENSE](LICENSE).
|
||||
Reference in New Issue
Block a user