Norbert Schmidt d18e22cd2a Expand README: foreground the prompt-personalization workflow
The household context and tuning lessons are the main thing here.
Add a dedicated section showing what to edit, why each rule exists,
and how the emotion library plugs into the prompt at runtime.
2026-05-12 10:12:49 +02:00
2026-05-12 09:12:49 +02:00
2026-05-12 09:12:49 +02:00

reachy-mlx-vlm

Fully local conversational AI for the Reachy Mini robot — every model runs on your Mac using MLX. No cloud APIs, no realtime websockets, no usage tiers.

The robot's microphone, speaker and camera stream over WebRTC/IPC to the Mac, where the pipeline runs end-to-end on Apple Silicon, then sends synthesized audio back to the robot.

 ┌──────────────┐    mic     ┌─────────────────────────────────────────────┐    speaker  ┌──────────────┐
 │  Reachy Mini │ ─────────▶ │  Mac (Apple Silicon)                        │ ──────────▶ │  Reachy Mini │
 │              │            │                                             │             │              │
 │  mic + cam   │   camera   │   Qwen3-ASR  →  Gemma 4 VLM  →  Spark-TTS   │             │   speaker    │
 │              │ ─────────▶ │     (STT)        (think)         (voice)    │             │              │
 │              │            │                                             │             │              │
 └──────────────┘            └─────────────────────────────────────────────┘             └──────────────┘
        ▲                                                                                       │
        └───────────────── head pose / antennas / dance / emotion / photo ───────────────────────┘

What it does

  • Listens via the robot's microphone, voice-activity detected with energy-based VAD + hallucination filtering.
  • Transcribes with Qwen3-ASR-1.7B (8-bit MLX).
  • Thinks with Gemma 4 26B-A4B (mlx-community/gemma-4-26b-a4b-it-bf16) — vision-capable, so it can describe what it sees on request.
  • Speaks with Spark-TTS-0.5B-bf16 streamed back through the robot's speaker.
  • Moves while talking — a multi-frequency head-wobble during speech, subtle breathing animation while idle.
  • Acts on bracketed tags emitted by the model: [nod], [shake], [wiggle], [dance], [photo], [emotion:NAME]. The tags are stripped from speech and executed in parallel with TTS.

Requirements

  • Hardware: Apple Silicon Mac (M1+) with ~32 GB RAM, Reachy Mini Wireless on the same Wi-Fi.
  • Software: macOS, Python 3.12+, uv, GStreamer (brew install gstreamer), sshpass (for the helper scripts: brew install sshpass).
  • Disk: ~30 GB for the model weights downloaded from Hugging Face on first run.

Setup

git clone https://github.com/Pocket-science/reachy-mlx-vlm.git
cd reachy-mlx-vlm
uv sync

That's it. The first time you run ./run.sh, MLX will pull the Qwen3-ASR, Gemma 4 VLM, Spark-TTS, and reachy-mini-emotions-library repos from Hugging Face (one-off, then cached).

Note on mlx-vlm versions. The pyproject.toml lock currently resolves to mlx-vlm 0.3.9 because reachy-mini 1.2.3 pins transformers below the floor required by mlx-vlm >= 0.5.0. The codebase works fine on 0.3.90.4.4. To run 0.5.0 on the same venv:

VIRTUAL_ENV=.venv uv pip install --no-deps mlx==0.31.2 mlx-metal==0.31.2 mlx-vlm==0.5.0

This sidesteps the dependency tree — your Mac never actually runs transformers to drive the robot, so the constraint isn't load-bearing.

Run

./run.sh                          # default — vision on, talks to reachy-mini.local
./run.sh --no-camera              # audio-only mode
./run.sh --host 192.168.1.55      # explicit robot IP
./run.sh --energy-threshold 0.04  # more sensitive VAD
./run.sh --debug                  # verbose logs

The robot's daemon must be reachable at http://reachy-mini.local:8000. The wireless model ships with --no-autostart, so click the wake-up button in the dashboard once before launching, or hit:

curl -X POST "http://reachy-mini.local:8000/api/daemon/start?wake_up=true"

Tool / action system

The model can emit bracketed tags anywhere in its reply. They are parsed out, executed asynchronously, and stripped from the spoken text.

Tag What it does
[nod] Three-pose pitch nod.
[shake] Three-pose yaw shake.
[wiggle] Three antenna oscillations.
[dance] Plays a random move from pollen-robotics/reachy-mini-dances-library.
[photo] Saves a JPEG snapshot from the camera to photos/.
[emotion:NAME] Plays a named pre-recorded emotion (silent — sound track muted).

The emotion library is listed in the system prompt at startup so the model can pick by name. To avoid overuse, the runtime drops [emotion:...] tags whenever the user's utterance does not include an emotion trigger word (happy, sad, look ..., etc.). Configure the trigger list in LocalConversationStream.EMOTION_TRIGGERS.

Personalizing the prompt — this is where the project lives

A generic robot says "I am an AI assistant. How can I help you?" A robot that knows your house says "The dogs are out back, the kids should be home from hockey around six." This section is the difference, and it's the part that actually took time to get right.

The full SYSTEM_PROMPT lives at the top of conversation.py. It follows the same shape Pollen Robotics use in their official conversation app — IDENTITY / RESPONSE RULES / CORE TRAITS / HOUSEHOLD CONTEXT / BEHAVIOR RULES / TOOL & MOVEMENT RULES / FINAL REMINDER — but tuned for a local model that has no realtime memory and no native tool calling.

What you edit

Open conversation.py and find the ## HOUSEHOLD CONTEXT block. Replace the placeholder with your household:

## HOUSEHOLD CONTEXT
You live with a family. Edit this section with the details you want the robot
to know about your household — for example: who lives there, where roughly,
ages of kids if relevant, pets, the family's hobbies or shared interests,
anything else worth remembering.
You cannot recognize voices or faces — you don't know who is talking.
Address the speaker as "you"; only use a name if they introduce themselves
this turn.

That single section is the difference between a chatbot and a robot that feels like it belongs in your room.

Lessons learned tuning it

These quirks are baked into the rest of SYSTEM_PROMPT because Gemma kept doing them wrong. Worth keeping when you fork it:

  • Don't address the speaker by name. Gemma will latch onto the first name in the household context and start every reply with it. The prompt explicitly forbids name-addressing unless the speaker introduces themselves this turn.
  • Don't project moods or appearance. With a camera attached, Gemma will hallucinate "you look comfy" or "you seem tired" every turn. The prompt forbids it; the code only passes a frame to the VLM when the speaker uses a look trigger word (see, show, what color, describe, …).
  • Don't ask a follow-up question every reply. Stock LLM behavior is exhausting in voice — every line ends with "What do you think?" The prompt caps follow-ups at roughly one in three turns.
  • Mirror short acks. If the speaker says "ok", "mhm", "cool" — the robot replies with a one-word ack of its own, not a probing question.
  • Default English, follow on switch. Multilingual is great until the model randomly switches mid-conversation. The rule is: mirror the speaker's language; default to English.
  • Gate the emotion tool. The model loves emoting on every reply ("[emotion:curious1] I am Reachy Mini."). Both the prompt and the runtime strip emotion tags unless the user's utterance contains an emotion trigger word — see EMOTION_TRIGGERS in conversation.py.

Action examples that actually fire

The tool tags only get used if the prompt shows the model what "good" looks like. The examples in ## RESPONSE EXAMPLES use [nod], [shake], [wiggle] inline — keep them concrete, not abstract.

Where the household context plugs into runtime

At startup LocalConversationStream.__init__ loads pollen-robotics/reachy-mini-emotions-library and injects the list of available emotion names into the prompt via the {emotion_list} placeholder. So the model sees, by name, every emotion it can actually play. If you swap the library, the prompt updates itself.

Helper scripts

Script Purpose
run.sh "..." Launch the conversation loop.
speak.sh "Hello" One-shot TTS — generate on the Mac, play through the robot.
take_photo.sh out.jpg Grab a single frame from the robot's camera over SSH.
record_voice.sh 12 voice_ref.wav Capture 12 s of audio through the robot's microphone.
test_reachy.py Minimal SDK smoke test — wiggles the antennas.
look_at_click.py Click the camera feed to make the robot look at a pixel.

The helper scripts assume the default pollen / root SSH credentials shipped with ReachyMiniOS.

How this differs from the official conversation app

pollen-robotics/reachy_mini_conversation_app uses the OpenAI Realtime API for STT + LLM + TTS in a single websocket session, with a full function-calling tool system (each tool is a class with a JSON schema, dispatched async with cancellation).

This project trades that off:

  • Pros: zero cloud cost, fully offline, your voice never leaves the Mac, no rate limits.
  • Cons: ~36 s round-trip on a base M-series chip (the 26B VLM is the bottleneck). Tools are bracket-tag strings parsed from plain text, not strict JSON function calls, because MLX's local VLM stack doesn't expose native tool calling.

If you want real-time turn-taking, use the official app. If you want a robot that thinks entirely on your laptop, this is for you.

Acknowledgements

  • Pollen Robotics for the Reachy Mini SDK and the conversation-app architecture this riffs on.
  • mlx-audio for the STT / TTS adapters.
  • mlx-vlm for the Gemma 4 VLM runtime.
  • The mlx project at Apple ML Research.

License

MIT — see LICENSE.

Description
No description provided
Readme 245 KiB
Languages
Python 91.4%
Shell 8.6%