reachy-mlx-vlm/README.md

# reachy-mlx-vlm

Fully local conversational AI for the [Reachy Mini](https://www.pollen-robotics.com/reachy-mini) robot — every model runs on your Mac using [MLX](https://github.com/ml-explore/mlx). No cloud APIs, no realtime websockets, no usage tiers.

The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.

```
 ┌──────────────┐    mic     ┌─────────────────────────────────────────────┐    speaker  ┌──────────────┐
 │  Reachy Mini │ ─────────▶ │  Mac (Apple Silicon)                        │ ──────────▶ │  Reachy Mini │
 │              │            │                                             │             │              │
 │  mic + cam   │   camera   │   Qwen3-ASR  →  Gemma 4 VLM  →  Spark-TTS   │             │   speaker    │
 │              │ ─────────▶ │     (STT)        (think)         (voice)    │             │              │
 │              │            │                                             │             │              │
 └──────────────┘            └─────────────────────────────────────────────┘             └──────────────┘
        ▲                                                                                       │
        └───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘
```

## What it does

- **Listens** via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
- **Transcribes** with Qwen3-ASR-1.7B (8-bit MLX).
- **Thinks** with Gemma 4 26B-A4B (`mlx-community/gemma-4-26b-a4b-it-bf16`) — vision-capable, so it can describe what it sees on request.
- **Speaks** with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
- **Moves** while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
- **Acts** on bracketed tags emitted by the model: `[nod]`, `[shake]`, `[wiggle]`, `[dance]`, `[photo]`, `[emotion:NAME]`. The tags are stripped from speech and executed in parallel with TTS.

## Requirements

- **Hardware**: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
- **Software**: macOS, Python 3.12+, [uv](https://github.com/astral-sh/uv), GStreamer (`brew install gstreamer`), `sshpass` (for the helper scripts: `brew install sshpass`).
- **Disk**: ~30 GB for the model weights downloaded from Hugging Face on first run.

## Setup

```bash
git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
cd reachy-mlx-vlm
uv sync
```

That's it. The first time you run `./run.sh`, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and `reachy-mini-emotions-library` repos from Hugging Face (one-off, then cached).

> **Note on `mlx-vlm` versions.** The `pyproject.toml` lock currently resolves to `mlx-vlm 0.3.9` because `reachy-mini 1.2.3` pins `transformers` below the floor required by `mlx-vlm >= 0.5.0`. The codebase works fine on 0.3.9–0.4.4. To run 0.5.0 on the same venv:
>
> ```bash
> VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0
> ```
>
> This sidesteps the dependency tree — your Mac never actually runs `transformers` to drive the robot, so the constraint isn't load-bearing.

## Run

```bash
./run.sh                          # default — vision on, talks to reachy-mini.local
./run.sh --no-camera              # audio-only mode
./run.sh --host 192.168.1.55      # explicit robot IP
./run.sh --energy-threshold 0.04  # more sensitive VAD
./run.sh --debug                  # verbose logs
```

The robot's daemon must be reachable at `http://reachy-mini.local:8000`. The wireless model ships with `--no-autostart`, so click the wake-up button in the dashboard once before launching, or hit:

```bash
curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"
```

## Tool / action system

The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.

| Tag                | What it does                                                          |
| ------------------ | --------------------------------------------------------------------- |
| `[nod]`            | Three-pose pitch nod.                                                 |
| `[shake]`          | Three-pose yaw shake.                                                 |
| `[wiggle]`         | Three antenna oscillations.                                           |
| `[dance]`          | Plays a random move from `pollen-robotics/reachy-mini-dances-library`. |
| `[photo]`          | Saves a JPEG snapshot from the camera to `photos/`.                   |
| `[emotion:NAME]`   | Plays a named pre-recorded emotion (silent — sound track muted).      |

The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops `[emotion:...]` tags whenever the user's utterance does not include an emotion trigger word (`happy`, `sad`, `look ...`, etc.). Configure the trigger list in `LocalConversationStream.EMOTION_TRIGGERS`.

## Personalizing the prompt — this is where the project lives

A generic robot says "I am an AI assistant. How can I help you?" A robot that knows your house says "The dogs are out back, the kids should be home from hockey around six." This section is the difference, and it's the part that actually took time to get right.

The full `SYSTEM_PROMPT` lives at the top of `conversation.py`. It follows the same shape Pollen Robotics use in their official conversation app — IDENTITY / RESPONSE RULES / CORE TRAITS / HOUSEHOLD CONTEXT / BEHAVIOR RULES / TOOL & MOVEMENT RULES / FINAL REMINDER — but tuned for a local model that has no realtime memory and no native tool calling.

### What you edit

Open `conversation.py` and find the `## HOUSEHOLD CONTEXT` block. Replace the placeholder with your household:

```markdown
## HOUSEHOLD CONTEXT
You live with a family. Edit this section with the details you want the robot
to know about your household — for example: who lives there, where roughly,
ages of kids if relevant, pets, the family's hobbies or shared interests,
anything else worth remembering.
You cannot recognize voices or faces — you don't know who is talking.
Address the speaker as "you"; only use a name if they introduce themselves
this turn.
```

That single section is the difference between a chatbot and a robot that feels like it belongs in your room.

### Lessons learned tuning it

These quirks are baked into the rest of `SYSTEM_PROMPT` because Gemma kept doing them wrong. Worth keeping when you fork it:

- **Don't address the speaker by name.** Gemma will latch onto the first name in the household context and start every reply with it. The prompt explicitly forbids name-addressing unless the speaker introduces themselves *this turn*.
- **Don't project moods or appearance.** With a camera attached, Gemma will hallucinate "you look comfy" or "you seem tired" every turn. The prompt forbids it; the code only passes a frame to the VLM when the speaker uses a *look trigger* word (`see`, `show`, `what color`, `describe`, …).
- **Don't ask a follow-up question every reply.** Stock LLM behavior is exhausting in voice — every line ends with "What do you think?" The prompt caps follow-ups at roughly one in three turns.
- **Mirror short acks.** If the speaker says "ok", "mhm", "cool" — the robot replies with a one-word ack of its own, not a probing question.
- **Default English, follow on switch.** Multilingual is great until the model randomly switches mid-conversation. The rule is: mirror the speaker's language; default to English.
- **Gate the emotion tool.** The model loves emoting on every reply ("[emotion:curious1] I am Reachy Mini."). Both the prompt and the runtime strip emotion tags unless the user's utterance contains an emotion trigger word — see `EMOTION_TRIGGERS` in `conversation.py`.

### Action examples that actually fire

The tool tags only get used if the prompt shows the model what "good" looks like. The examples in `## RESPONSE EXAMPLES` use `[nod]`, `[shake]`, `[wiggle]` inline — keep them concrete, not abstract.

### Where the household context plugs into runtime

At startup `LocalConversationStream.__init__` loads `pollen-robotics/reachy-mini-emotions-library` and injects the list of available emotion names into the prompt via the `{emotion_list}` placeholder. So the model sees, by name, every emotion it can actually play. If you swap the library, the prompt updates itself.

## Helper scripts

| Script                                | Purpose                                                            |
| ------------------------------------- | ------------------------------------------------------------------ |
| `run.sh "..."`                        | Launch the conversation loop.                                      |
| `speak.sh "Hello"`                    | One-shot TTS — generate on the Mac, play through the robot.        |
| `take_photo.sh out.jpg`               | Grab a single frame from the robot's camera over SSH.              |
| `record_voice.sh 12 voice_ref.wav`    | Capture 12 s of audio through the robot's microphone.              |
| `test_reachy.py`                      | Minimal SDK smoke test — wiggles the antennas.                     |
| `look_at_click.py`                    | Click the camera feed to make the robot look at a pixel.           |

The helper scripts assume the default `pollen` / `root` SSH credentials shipped with ReachyMiniOS.

## How this differs from the official conversation app

[`pollen-robotics/reachy_mini_conversation_app`](https://github.com/pollen-robotics/reachy_mini_conversation_app) uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).

This project trades that off:

- **Pros**: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
- **Cons**: ~3–6 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.

If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.

## Acknowledgements

- [Pollen Robotics](https://www.pollen-robotics.com/) for the Reachy Mini SDK and the conversation-app architecture this riffs on.
- [`mlx-audio`](https://github.com/Blaizzy/mlx-audio) for the STT / TTS adapters.
- [`mlx-vlm`](https://github.com/Blaizzy/mlx-vlm) for the Gemma 4 VLM runtime.
- The [`mlx`](https://github.com/ml-explore/mlx) project at Apple ML Research.

## License

MIT — see [LICENSE](LICENSE).