Building a Fully Local Voice AI Agent on a Reachy Mini Robot

Community Article Published April 6, 2026

Written by Curtis Burkhalter, Ph.D. (HP), Matt Allard, Alyss Noland, Allen Bourgoyne, David Messina (NVIDIA), Jeff Boudier (Hugging Face), Remi Fabre (Pollen Robotics) March 2026

A patient walks up to a small robot on a desk. The robot raises its head, wiggles its antennas, and says: "Hi there! I'm your medical triage assistant. How can I help you today?"

The patient says they've had bad back pain for four days. The robot asks follow-up questions: how severe, where exactly, what triggered it. The conversation feels natural. The response time is under 2.5 seconds.

None of this touches the cloud. No audio leaves the room. No patient data hits an external API. The speech recognition, medical reasoning, and voice synthesis all run on a single local device sitting next to the robot.

This post walks through how we built it, what broke along the way, and what we learned about building custom AI applications on the Pollen Robotics Reachy Mini that isn't in any documentation yet. For additional information about building applications on the Reachy Mini using the Reachy Mini SDK there is extensive documentation here https://github.com/pollen-robotics/reachy_mini.

The Setup

The system has two physical components connected over a local network:

Reachy Mini - a Pollen Robotics desktop robot with an onboard mic, speaker, camera, motorized head, and nine actuators to move its head, body and antennae actuators. It runs Linux on a Raspberry Pi Compute Module 4, with a system daemon handling hardware I/O and a growing ecosystem of apps powered by Hugging Face Spaces.

HP ZGX Nano - an NVIDIA GB10 Grace Blackwell-based device with 128GB unified memory It runs the AI inference stack inside a Docker container.

The robot records audio from its mic, sends it to the HP ZGX Nano over HTTP, and plays back the synthesized response through its speaker. The HP ZGX Nano runs three AI models in sequence:

  1. Speech-to-text: whisper.cpp transcribes the patient's speech
  2. Medical reasoning: Llama 3.1 8B Instruct (AWQ INT4 quantized) generates a triage response via vLLM. We used a smaller LLM to achieve lower overall latency.
  3. Text-to-speech: Piper synthesizes the response as audio

The full pipeline completes in under 2.5 seconds from end of speech to hearing the robot's response, all running locally on the HP ZGX Nano.

Reachy Mini (mic) ──POST /process──► HP ZGX Nano (Docker)
                                      ├── faster-whisper STT
                                      ├── vLLM + Llama 3.1 8B
                                      └── Piper TTS
Reachy Mini (speaker) ◄──WAV audio──┘

A browser-based dashboard polls the API and displays the conversation transcript in real time, allowing a clinician or demo observer to follow along.

Why Local Matters

The obvious answer is data privacy. Healthcare organizations bound by HIPAA, government agencies with data sovereignty requirements, and defense contractors in air-gapped environments need AI inference that never leaves the premises.

But there's a less obvious reason: deployment reliability. Cloud-dependent AI demos can fail depending on network configurations. For example, Hotel WiFi throttles API calls. Corporate networks block outbound traffic to inference endpoints. An air-gapped system that runs on local hardware such as the HP ZGX Nano, just works, regardless of the network environment.

We learned this the hard way when deploying the demo at NVIDIA GTC. The venue WiFi had client isolation enabled, which blocks device-to-device traffic.

Building on the Reachy Mini: What the Docs Don't Tell You

The Reachy Mini is a compelling platform. It's expressive, affordable, programmable in Python, and has an app ecosystem powered by HuggingFace Spaces. But building a production-quality app on it required solving several problems that aren't covered in the current documentation.

Audio Input: The SDK's Mic Detection Doesn't Always Work

The Reachy Mini has a USB audio device with both a microphone and speaker. The SDK provides methods like start_recording() and get_audio_sample() for audio capture.

In our testing, the SDK's GStreamer-based DeviceMonitor consistently failed to detect the audio input device when running from the daemon's app subprocess. The hardware was present and functional - just invisible to GStreamer in that execution context.

The workaround was to bypass the SDK entirely and record directly via ALSA using arecord through a subprocess call:

import subprocess

def record_audio(duration_seconds):
    """Record from the Reachy Mini mic via ALSA.
    Key: must use stereo (-c 2). Mono fails on this hardware."""
    cmd = [
        "arecord", "-D", "reachymini_audio_src",
        "-f", "S16_LE", "-r", "16000", "-c", "2",
        "-d", str(int(duration_seconds)),
        "-t", "raw", "-q", "-"
    ]
    result = subprocess.run(
        cmd, capture_output=True,
        timeout=duration_seconds + 5
    )
    return result.stdout  # Raw stereo PCM bytes

Two details that took time to discover: the ALSA device name is reachymini_audio_src (not the generic hw:0,0), and the device requires stereo recording. Passing -c 1 for mono fails with a cryptic "Channels count non available" error.

Audio playback through the SDK works reliably, even when input detection fails. We used push_audio_sample() with float32 numpy arrays for all output.

Note: This was a bug in earlier versions of the Reachy Mini software. On up-to-date versions, the SDK's audio input methods should work correctly. The ALSA approach was our workaround at the time.

Motor Control: The Start-Stop-Start Pattern

The robot starts in a sleep position with its head down. The SDK provides wake_up() to raise the head and initialize motors.

We observed that motor commands issued immediately after app startup would execute without errors but produce no physical movement. The motors appeared to need a "priming" cycle: start the app, stop it, wait a few seconds, then start it again. On the second launch, motors responded normally.

We built this into our launch script as an automated start/stop/start sequence. It adds about fifteen seconds to startup but guarantees the robot is physically responsive when the greeting plays.

We also found that passing numpy arrays for antenna positions (np.array([0.3, -0.3])) was more reliable than plain Python lists, though the SDK type hints accept both.

Typical Integration Problems

The Reachy Mini daemon runs apps as Python subprocesses. Variables set in /etc/environment on the robot are not available to the app process via os.getenv().

This matters when your app needs to know the IP address of an external server (like the HP ZGX Nano running the AI stack). That IP changes every time you connect to a new network.

Our solution was to use sed to patch the IP address directly in the app's source file on the robot before launching. It's crude but debuggable - you can SSH in and grep the file to see exactly what address the app is using. We automated this in the launch script so it runs automatically whenever the network changes.

The sed based approach works but can be a bit fragile. In setups where protocols like "Bonjour" work, you can address the machines without the IP addresses (e.g. reachy-mini.local). If the IP address is needed, some typical approaches involve CLI arguments or an .env file.

Silent Crashes

When a Reachy Mini app crashes during import or within the first few seconds of execution, the daemon logs "App finished" with no traceback. No error message, no stack trace, just… finished.

The fix is to SSH into the robot and test the import manually:

sudo /venvs/apps_venv/bin/python3 -c "
from your_app.main import YourAppClass
"

This will print the actual Python exception. Common causes include missing packages in the robot's virtualenv, stale __pycache__ bytecode from a previous version, and network connectivity issues if the app tries to reach an external service during initialization.

The single most valuable debugging practice we adopted was logging every configuration value at app startup:

def run(self, reachy_mini, stop_event):
    logger.info("=" * 60)
    logger.info(f"  API URL: {API_URL}")
    logger.info(f"  Audio device: {ALSA_DEVICE}")
    logger.info("=" * 60)

When the demo broke at GTC, this was the first thing we checked - and it immediately showed the app was pointing at an old IP address.

Making the Robot Expressive

A voice agent that just talks is less compelling than one that moves. We added head and antenna movements to signal the robot's state:

  • Ready: antennas up, head neutral
  • Listening: head tilted slightly, antennas forward
  • Thinking: head tilted up, antennas down
  • Speaking: head leaning forward, gentle antenna movement
from reachy_mini.utils import create_head_pose
import numpy as np

def expr_thinking(reachy_mini):
    """Head up, antennas down - processing the patient's words."""
    head = create_head_pose(yaw=0, pitch=8, roll=0, degrees=True)
    reachy_mini.goto_target(
        head=head,
        antennas=np.array([-0.2, 0.2]),
        duration=0.4
    )

These transitions happen automatically in the conversation loop. When speech is detected, the robot shifts to listening pose. When audio is sent to the API, it shifts to thinking. When the response starts playing, it shifts to speaking. The effect is subtle but it makes the interaction feel significantly more natural - observers consistently commented on how "alive" the robot felt.

VAD-Based Recording

Fixed-duration recording (e.g., "record for 5 seconds") wastes time when the speaker finishes early and cuts off longer utterances. We implemented voice activity detection (VAD) using chunked recording:

  1. Record in 1-second chunks via arecord
  2. Compute RMS energy of each chunk
  3. When energy exceeds the threshold, speech has started
  4. Keep recording until 3 consecutive silent chunks are detected
  5. Send the accumulated audio to the API
SILENCE_THRESHOLD = 400  # RMS for 16-bit PCM
SILENCE_CHUNKS = 3       # 3 seconds of silence = done

# Wait for speech
while not stop_event.is_set():
    chunk = record_chunk_alsa(1)  # 1-second chunk
    rms = compute_rms(chunk)
    if rms > SILENCE_THRESHOLD:
        break  # Speech detected

# Record until silence
chunks = [chunk]
silent_count = 0
while silent_count < SILENCE_CHUNKS:
    chunk = record_chunk_alsa(1)
    chunks.append(chunk)
    if compute_rms(chunk) < SILENCE_THRESHOLD:
        silent_count += 1
    else:
        silent_count = 0

This approach adapts to the speaker naturally. A short "yes" takes about 4 seconds total (1s speech + 3s silence detection). A longer description of symptoms takes as long as the speaker needs, plus 3 seconds.

Deploying at a Venue

The system was demonstrated at NVIDIA GTC. The deployment process we settled on:

  1. Connect both devices to the same network (mobile hotspot if venue WiFi has client isolation)
  2. Run a single launch script that auto-detects IPs, patches the robot's config, starts the Docker container, waits for the LLM to load, primes the motors, and launches the app
  3. Open the dashboard URL that the script prints

The launch script includes preflight checks that verify Docker is running, the AI models are present, the robot is reachable, SSH is configured, and the app is installed. If anything fails, it tells you exactly what's wrong and how to fix it.

Total time from cold start to "Hi there, how can I help you today?": about two minutes, mostly waiting for the LLM to load into GPU memory.

What We'd Do Differently

Log everything from day one. The hours we spent debugging silent crashes and wrong IP addresses would have been minutes if we'd built comprehensive startup logging from the start.

Don't use environment variables for configuration on the Reachy Mini. We wasted significant time on an approach that fundamentally doesn't work with the daemon's subprocess model. Patch the source file directly.

Test on the target network early. Our system worked perfectly in the office and broke at the venue. Network assumptions are the most common failure mode for multi-device demos.

Keep the AI simple, spend time on integration. The AI pipeline (Whisper + vLLM + Piper) worked on the first try. The robot integration took days. If you're building on the Reachy Mini, budget your time accordingly.

Get Started

We published everything from this project as open resources:

If you're building AI applications on the Reachy Mini, or exploring on-premise voice AI for regulated environments, we hope these resources save you some late nights.

Additional Thanks

Curtis Burkhalter would also like to thank Rick Gosalvez (Product Manager, HP) and Prashant Salgaocar (DevOps Manager, HP) for their assistance and troubleshooting of the Reachy Mini demo at NVIDIA GTC 2026.

Community

Sign up or log in to comment