Building a Jarvis-inspired voice activated LLM powered virtual assistant

Just another day at the office
Just another day at the office

I’d like my computer to be smarter and more interactive and handle boring stuff for me and I’d also like to play around with some LLM / AI stuff… which brings me to this project.  I’ve got a ton of basic things I’d love for it to do – manage lists, reminders, some Outlook functions, some media functions, and then also be able to interact with me – all via voice commands.  Yes, you can do this with ChatGPT and probably others – but I am loathe to provide any outside resource with more of “me” (DNA, biometrics, voice, ambient noises, etc) than absolutely necessary.  Plus, I’ve been tinkering with these little LLM’s for a while now and see just what I can build out of them and with their assistance.

I’m not great at Python1 , so I admittedly enlisted the help of some very large LLM’s.  I started the main project in conjunction with ChatGPT, used Gemini to answer some basic questions about programming in Python syntax, etc, and Claude for random things.  The reason for keeping my general questions in Gemini versus ChatGPT was so that I could not “pollute” the ChatGPT flow of discussions with irrelevant sidetracks.  This was the same reason for separating out the Claude discussions too.  I find Claude reasonably helpful for coding tasks, but the use limits are too restrictive.

My kiddo asked me how much of the code was written by these models versus my own code.  I’d say the raw code was mostly written by LLM’s – but I’m able to tinker, debug, and… above all learn.  I’d rather be the one writing the code from scratch, but I’m treating these LLM’s like water wings.  I know I’m not keeping myself fully afloat – but I’m actually the one treading water, putting it all together, and learning how to do it myself.  Also… said kiddo was interested in building one too – so I’m helping teach someone else manually, and learning more that way.2

Ingredients

As with many of projects, I started by testing the individual pieces to see if I could get things working.  In order I started with validating individual pieces of the process:

  • Could I get Python to record audio?
  • Could I get Python to transcribe that audio?
  • Could I get Python to use an API to run queries in LM Studio?
    • Yep!  Using the openai API, I could use python to send queries to LM Studio after an LLM had been loaded into memory
  • Could I get Python to get my computer to respond to a “wakeword”?
    • Yep!  There’s another Python module for using  “wakewords” using PocketSphinx.  This was an interesting romp.  I found that I had to really tinker with the data being sent to the Wakeword to be properly recognized and then fiddle with the timing to make sure what came after the wakeword was properly captured before being sent to the LLM.  Otherwise, I ended up with “Jarvis, set a timer for 15 minutes” would become… “Jarvis, for 15 minutes” since the “Jarvis” would get picked up by the wakeword but the rest not caught in time to be processed by whisper.
  • Can I get Python to verbally recite statements out loud?
    • Yep!  I used text to speech using Piper.  However, this process took a while.  One thing I learned was that you needed not just the voice model’s *.ONNX file, but the *.JSON file associated with it.

Until this point, I had wanted to try running LLM’s with the training wheels from LM Studio’s API.  I really like the LM Studio program, but I don’t want to be dependent upon their service when I’m trying to roll my own LLM interface.  Python can run LLM’s directly using “llama-cpp-python” – except that it will throw errors on the version of Python I was running (3.14) and was known to work with a prior version (3.11).

This lead me to learning about running “virtual environments” within Python so that I can keep both versions of Python on my computer, but basically run my code within a specific container tied to the version I need.  Typing this command created the virtual environment within my project folder.  The second command will “activate” that virtual environment.

  • py -3.11 -m venv venv
    • This created the virtual environment, locked to Python 3.11
  • .venv\Scripts\activate
    • This activates the virtual environment, so I can start working inside it

Back to work!

The man's got a job to do
The man’s got a job to do

Building a Pipeline

This is where things really seemed to take off.  I was able to disconnect my script from LM Studio and use Python to directly call the LLM’s I’ve downloaded.  These were reasonably straightforward – and I was suddenly able to go from: Wakeword -> whisper transcribed LLM query -> LLM response -> Piper recited reply.  Then, it was reasonably easy to have the script listen for certain words, and perform certain actions (setting timers was the first such instance).

Optimizations, Problems, Solutions

Complicating factors
Complicating factors

Building something that kind worked brought me to a new and interesting  ideas, challenges, and problems:

  • The original cobbled together process was something like:  record audio, transcribe through Whisper, delete the recording, pass the transcribed statement to the LLM, give that statement to Piper, generate a new recording, play that recording.  However, this process has some obvious “slop” where I’m making and deleting two temporary audio files.  The solution was to find ways to feed the recording process directly into Whisper and feed Piper’s response directly to the speakers, cutting out the two audio files.
  • I realized that I wanted the script to do more than just shove everything I have to say / ask into an LLM – to be really useful, the script would have to do more than just be a verbal interface for a basic LLM.  This is where I started bolting on a few other things – like trying to call a very small LLM to try and parse the initial request to either:
    1. Something that can be easily accomplished by a Python script (such as setting a timer)
    2. Something that needed to be handled by a larger LLM (summarize, translate, explain)
    3. Something that maybe a small model could address easily (provide simple answer to a simple question)
  • I ran into some problems at this point.  I spent a lot of time trying to constrain a small LLM3 to figure out what the user wanted and assign labels/tasks accordingly.  After a lot of fiddling, it turns out that an LLM is generally a “generative” model and it wants to “make” something.  My trying to force it to make a choice among only a dozen “words”4 was really bumping into problems where it would have trouble choosing between two options, choose inconsistently, and sometimes just make up new keywords.  Now, I could come up with a simple Python script which just did basic word-matching to sort the incoming phrases – but it seemed entirely counterproductive to build a Python word-matching process to help a tiny AI.  I then tried building a small “decision tree” of multiple small LLM calls to properly sort between “easy Python script call” and “better call a bigger LLM to help understand what this guy is talking about” and quickly stopped.  Again, my building a gigantic decision tree out of little LLM calls was proving to be a bigger task, adding latency and error with each call.  I was hoping to use a small LLM to make the voice interaction with the computer simple and seamless and then pass bigger tasks to a larger LLM for handling, sprinkling in little verbal acknowledgements and pauses to help everything feel more natural.  Instead I was spending too much time building ways to make a small LLM stupider, doing this repeatedly, and then still ending up with too much slop.
  • And, frankly, it felt weird to try and lobotomize a small LLM into doing something as simple as “does the user’s request best fall into one of 12 categories?”  Yes, small LLM’s can easily start to hallucinate, they can lose track of a conversation, make mistakes, etc.  But, to constrain one so tightly that I’m telling it that it may only reply with one of 12 words feels… odd?
Tell me what I want to hear and this can all stop
Tell me what I want to hear and this can all stop

Over the last few days I’ve been tinkering with building an “intent classifier” or “intent encoder” to do the kind of automatic sorting I was trying to force an LLM to do.  As I understand this process, you feed the classifier a bunch of example statements that have been pre-sorted into different “intent slugs.”  The benefit of a classifier is that it can only reply with one of these “intent slugs” and will never produce anything else.  It’s also way faster.  Calling a small5 LLM with a sorting question could produce a sometimes reliable6 answer in about 0.2 ms, which is almost unnoticeable.  Calling a classifier to sort should enable a 97% reliable result within 0.05 ms.  This is so fast it is imperceptible.

I haven’t tried this yet.  I’ve built up a pile of “examples” from largely synthetic data to feed into a classifier, produce an ONNX file7 , and try out.  However, I wanted to pause at this juncture to write up what I’ve been working on.  I say synthetic data because I didn’t hand write more than 3,000 examples on some 50 different intent slugs.  I wrote a list of slugs, described what each one should be associated with, created a small set of examples, and then asked Gemini to produce reasonable sounding examples based on this information. 8 This list appeared pretty good – but needed to be manually edited and also tidied up.  I wanted to remove most of the punctuation and adjust the ways numbers and statements showed up, because I’m simply not confident that Whisper will be able to accurately match “Add bananas to shopping list” to “Add bananas to ‘shopping list'” to something that the classifier will correctly interpret.

As I tinker with this project… I’m also looking at how I might be able to extend it into further projects.  Not only might it be a great way to help me be more productive, but I might be able to create a really small version that could be put into a companion bot.  A little companion bot with limited space, power, inputs, and abilities to emote could be far more lifelike, independent, and non-deterministic in it’s responses and actions.

Project Jarvis
  1. Building a Jarvis-inspired voice activated LLM powered virtual assistant

 

 

  1. Yet!! []
  2. Thanks Mr. Fenyman! []
  3. Giving it limited context windows, limited tokens to use, highly restrictive system prompts []
  4. Make timer, list timers, make a reminder, add to a list, recite a list, media buttons, etc []
  5. ~1B parameter []
  6. Let’s say 65% reliable []
  7. Yes!  Just like the voice models!! []
  8. I know, more self-reflecting LLM garbage… []

Companion Robots and Maker Faire Season!

I’m super excited for Maker Faire Bay Area / Mare Island and Mini Maker Faire Rocklin.1  I’m not just excited to see everything, but to show all the things I’ve been working on for a while now.  It’s also time to pick up all the little dev boards I’ve somehow accumulated and see if I can make anything with them to show off.

  1. Project Boards
    1. Wemos D1 Mini.  A small insanely cheap (~$3?!) WiFi enabled dev board2 , which has 4MB onboard and can run Arduino.  I think it can also run MicroPython, but I haven’t tested this yet.
    2. Wemos 600 Pico.  An even smaller, even cheaper (~$2 when ordered from China) WiFi enabled dev board that runs… MicroPython?  I think??  I’m saying “I think” because I haven’t been able to get it to do anything yet.
      1. Since starting this blog post, I found a guide on installing MicroPython on Wemos boards that seems promising.
        1. Flashing MicroPython on an ESP8266
        2. https://github.com/espressif/esptool/tree/master
        3. Arguing with Python to let me use “esptool.py”
          1. esptool -p COM13 -c esp8266 flash_id
      2. As promising as that series of blog posts looked, I eventually scrapped the Wemos because it was just too much of a pain to get going with MicroPython.  I think I could have made it work, but for $7 I could also just use the Adafruit QtPy I already have.  The advantages of simply uploading code over a USB cable into a virtual drive just can’t be overstated.
    3. Other Boards
      1. I have a bad habit of picking up dev boards.  I’ve got several Adafruit QtPy’s, several Adafruit Trinkets, an Adafruit FX Sound Board, Raspberry Pi Pico (non-WiFi), various Digispark boards, a small handful of ATTiny85’s, and an even weirder assortment of VERY small programmable circuit boards (ISD1806B-COB) designed to go in greeting cards (just 6 seconds), etc.
  2. Companion Robot
    1. Background.
      1. I started this post at least a month ago when I only had a vague idea of what I wanted to make and even fewer skills.  After seeing my kid’s companion robot take shape, I wanted to get in on the action and make my own.  I decided to make a really small companion robot with just some LED’s, piezo, and small microcontroller unit.  I’d taken a stab at making a companion robot a few years ago, but set it aside for a variety of reasons and never went back.
      2. The idea for this new robot would be something a little less ambitious, make more use of NeoPixels than in prior projects3, with a little more interactivity, trying out some CircuitPython, and… let’s be real… more pizzazz!
    2. Idea:  Friendly Cloud/Vapor/Flame
      1. I still really like the copper-toned PLA I’ve been using since it has something of a steampunk flair to it.  I settled on repurposing a small plastic enclosure with a clear dome as the “body” for the robot.  I wanted it to look something like a small entrapped / captive / domesticated4 sentient cloud of vapor or perhaps flame held within a steampunk enclosure.
      2. As a very small, inexpensive board that could run either Arduino or CircuitPython, I decided on the Adafruit QtPy M0.  It could run NeoPixels, there were lots of cool guides on it, plenty of pinouts, and could definitely fit within the confines of my enclosure.
    3. Enclosure:
      1. I started the enclosure by trying to design and 3D print a part to mate with the clear plastic dome.  It took a few tries.

        This slideshow requires JavaScript.

      2. Once I had that, I extended the base so it could hold more electronics.  I could definitely have shoehorned everything into the dome, especially if I took up some of the space inside the dome, but even with an “elevated base” it was still plenty small and could use a battery pack rather than a rechargeable lipo.
      3. Once I had a good design for the enclosure, I tried to make it work with an existing 3xAAA battery pack.  In the process I yanked off the connector and ended up soldering the battery pack leads directly into the circuits.
    4. Internal Electronics
      1. I’m just not a great electrical engineer and am still copy/pasting from various guides, tinkering, changing bits of code, swapping out parts, and using “close enough” resistors.  Wiring up some LEDs or a piezo to a project isn’t very difficult – it’s some of the more fiddly bits.
      2. Piezo Element Speaker
        1. I wanted to use a piezo buzzer/speaker because they’re large and incredibly thin.  They’re not without their downsides.  The crystal wafer is also thin and a little fragile.  The piezo buzzer without additional electronics has the potential to act as a knock sensor and can generate a high voltage spike which can fry a board.  And, without additional electronics, the piezo just isn’t very loud.  There are some libraries for the Arduino that basically double the volume of a piezo by connecting it to two pins and then running each opposite of the other, doubling the voltage difference, but they only work for Arduino chips.5
        2. After searching for various ways to increase the sound of the piezo elements, I settled on trying to use the Adafruit piezo amp.  I bought two – and tried desoldering the terminal blocks.  This completely ruined one.  The other one worked great, but for the modest volume gain it was just too big in an already cramped enclosure.
        3. After searching around, I found some amplifier circuits using a small number of common parts.6
        4. Then I tried building an amplifier circuit using an NPN transistor.  After reviewing the datasheets for my NPN transistors (and PNP transistors), and breadboarding the circuit with resistors, I sketched it a few times, laid it down with copper tape, soldered it in place with SMD resistors, then pulled it off and placed it onto a piece of Kapton tape and put another piece on top – “laminating” it in place.
      3. Capacitive Touch
        1. Buttons are great and all, but with a capacitive touch pad, I could add metallic elements to my robot rather than a much bulkier button.  I bought a few brass upholstery tacks because they looked great – but they just would not accept molten solder.  I ended up cutting the prongs short with wire cutters, wrapping the stub with copper tape, then soldering the wires to the tape.  I’d also added a little piece of heat shrink tubing over the connection to help keep it together.  It’s been working well so far.
      4. LED Animations
        1. As we know from Phillip Burgess‘ incredible “power sipping NeoPixel” guide, we can conserve power and increase the impact of the LED’s by reducing the number of LED’s, keeping max brightness ~20% for a disproportionately large impact, running fewer LED’s at a time, and even running fewer colors at a time.  Between Phillip’s work, Todbot’s guide, and the specialized QtPy NeoPixel guide by Kattni Rembor, I was able to put together a few neat animations.
      5. Piezo Sounds
        1. I had a heck of time getting the piezo buzzer to do anything interesting.  Fortunately, with my kid helped convert the piano music for “Paint It Black” into tones for me.  I haven’t gotten all the note timings right, but I’m working on it!
  3. Future Modifications
    1. More Accessible Enclosure.  Right now the “lid” with a hole for the LED ring just sits on the enclosure with a light friction fit.  One idea is a hinged lid, either with a conventional hinge or perhaps a hidden swivel hinge.  The problem with that, of course, is it requires even more internal space.  Other ideas include a ring on top that screws down, holding the top down and in place.  I’m crap at designing screw threads, so I’ve avoided this.

      Hinged lid for enclosure
      Hinged lid for enclosure
    2. Piezo Knocks.  Perhaps the next version will include some kind of tap / double tap / knock sensors using one or more piezo elements.
    3. Knobs.  There’s not a ton of room inside the enclosure, but by including a gear within a gear, I might be able to rotate part of the case and have it manipulate a potentiometer.

      Offset gear within gear, manipulating an off-center internal potentiometer
      Offset gear within gear, manipulating an off-center internal potentiometer
    4. Motors.  A robot that just flashes lights and makes a few beeps can still be pretty interesting.  However, I have some neat potential features that could be added with just one or two motors.  There are some interesting limitations with the current incarnation of this robot and using a QtPy.  I’ve only got 10 pinouts7 , 1 for NeoPixels, 1 for the piezo, 6 in use for the capacitive touch sensors, leaving 2 for other potential tasks.8  However, space is already tight so one or two micro servos would be a big space commitment.  I’ve seen some really tiny micro servos that might work, but I have no idea where to source them.  One silly idea is a “weapons system” using a spring loaded projectile activated by a very small servo.

      A small spring loaded projectile launcher, actuated by a small servo
      A small spring loaded projectile launcher, actuated by a small servo
    5. Creating Tone Library.  The basic piezo tones are easy enough to play, but including the entire list of tones and the frequencies associated with them seems eat up the poor little QtPy’s memory.  I think compressing them into a library might be the way around this issue.
    6. Playing WAV files.  WAV files are bulky, but that’s the only sound file format a QtPy M0 can play.  However, with the extra 2MB from the SPI chip installed, this shouldn’t be a huge problem.  I used Audacity to mix the sound clip down to mono then to 22 KHz sample rate.  My preliminary tests worked – but it was incredibly quiet.  I haven’t run it through the audio amplifier yet, but I’m planning to.
    7. Sleep / Deep Sleep.  Ever since I swapped out the tiny LiPo for a 3xAAA battery pack, I’ve had a lot more battery life, so adding sleep / deep sleep functions haven’t been a priority.  However, this inclusion just couldn’t hurt.
  4. Other QtPy and CircuitPython Resources
    1. Adafruit’s QtPy CircuitPython PWM resource
    2. TodBot’s CircuitPython tricks
Companion Robots: Building Robot Friends
  1. Cephalopod Robot Friend, the story so far
  2. Cephalopod Robot Friend Progress
  3. CuttleBot Body and OpenSCAD Design Tips
  4. An Assembled CuttleBot Body
  5. Building the Monocle Top Hat Cat for #MicrobitVirtualConcert
  6. Companion Robots and Maker Faire Season!
  1. I just got a notice they’re no longer a “Mini”! []
  2. pinouts for my future reference []
  3. LED goggles and a Marvel Universe inspired set of “Infinity Knuckles” []
  4. OMG dome-sticated?! []
  5. This is just my very basic understanding of how it works.  I’m entirely positive this is far too simplified. []
  6. And one very long article about using a lot of parts []
  7. 12 if you want to count the onboard NeoPixel []
  8. Or 4… []