Building a Jarvis-inspired voice activated LLM powered virtual assistant

I’d like my computer to be smarter and more interactive and handle boring stuff for me and I’d also like to play around with some LLM / AI stuff… which brings me to this project. I’ve got a ton of basic things I’d love for it to do – manage lists, reminders, some Outlook functions, some media functions, and then also be able to interact with me – all via voice commands. Yes, you can do this with ChatGPT and probably others – but I am loathe to provide any outside resource with more of “me” (DNA, biometrics, voice, ambient noises, etc) than absolutely necessary. Plus, I’ve been tinkering with these little LLM’s for a while now and see just what I can build out of them and with their assistance.

I’m not great at Python1 , so I admittedly enlisted the help of some very large LLM’s. I started the main project in conjunction with ChatGPT, used Gemini to answer some basic questions about programming in Python syntax, etc, and Claude for random things. The reason for keeping my general questions in Gemini versus ChatGPT was so that I could not “pollute” the ChatGPT flow of discussions with irrelevant sidetracks. This was the same reason for separating out the Claude discussions too. I find Claude reasonably helpful for coding tasks, but the use limits are too restrictive.

My kiddo asked me how much of the code was written by these models versus my own code. I’d say the raw code was mostly written by LLM’s – but I’m able to tinker, debug, and… above all learn. I’d rather be the one writing the code from scratch, but I’m treating these LLM’s like water wings. I know I’m not keeping myself fully afloat – but I’m actually the one treading water, putting it all together, and learning how to do it myself. Also… said kiddo was interested in building one too – so I’m helping teach someone else manually, and learning more that way.2

Ingredients

As with many of projects, I started by testing the individual pieces to see if I could get things working. In order I started with validating individual pieces of the process:

Could I get Python to record audio?
- Yep! Using import sounddevice, soundfile!
Could I get Python to transcribe that audio?
- Yep! Using speech to text and Whisper (well, Whisper small)
Could I get Python to use an API to run queries in LM Studio?
- Yep! Using the openai API, I could use python to send queries to LM Studio after an LLM had been loaded into memory
Could I get Python to get my computer to respond to a “wakeword”?
- Yep! There’s another Python module for using “wakewords” using PocketSphinx. This was an interesting romp. I found that I had to really tinker with the data being sent to the Wakeword to be properly recognized and then fiddle with the timing to make sure what came after the wakeword was properly captured before being sent to the LLM. Otherwise, I ended up with “Jarvis, set a timer for 15 minutes” would become… “Jarvis, for 15 minutes” since the “Jarvis” would get picked up by the wakeword but the rest not caught in time to be processed by whisper.
Can I get Python to verbally recite statements out loud?
- Yep! I used text to speech using Piper. However, this process took a while. One thing I learned was that you needed not just the voice model’s *.ONNX file, but the *.JSON file associated with it.

Until this point, I had wanted to try running LLM’s with the training wheels from LM Studio’s API. I really like the LM Studio program, but I don’t want to be dependent upon their service when I’m trying to roll my own LLM interface. Python can run LLM’s directly using “llama-cpp-python” – except that it will throw errors on the version of Python I was running (3.14) and was known to work with a prior version (3.11).

This lead me to learning about running “virtual environments” within Python so that I can keep both versions of Python on my computer, but basically run my code within a specific container tied to the version I need. Typing this command created the virtual environment within my project folder. The second command will “activate” that virtual environment.

py -3.11 -m venv venv
- This created the virtual environment, locked to Python 3.11
.venv\Scripts\activate
- This activates the virtual environment, so I can start working inside it

Back to work!

The man's got a job to do — The man’s got a job to do

Building a Pipeline

This is where things really seemed to take off. I was able to disconnect my script from LM Studio and use Python to directly call the LLM’s I’ve downloaded. These were reasonably straightforward – and I was suddenly able to go from: Wakeword -> whisper transcribed LLM query -> LLM response -> Piper recited reply. Then, it was reasonably easy to have the script listen for certain words, and perform certain actions (setting timers was the first such instance).

Optimizations, Problems, Solutions

Building something that kind worked brought me to a new and interesting ideas, challenges, and problems:

The original cobbled together process was something like: record audio, transcribe through Whisper, delete the recording, pass the transcribed statement to the LLM, give that statement to Piper, generate a new recording, play that recording. However, this process has some obvious “slop” where I’m making and deleting two temporary audio files. The solution was to find ways to feed the recording process directly into Whisper and feed Piper’s response directly to the speakers, cutting out the two audio files.
I realized that I wanted the script to do more than just shove everything I have to say / ask into an LLM – to be really useful, the script would have to do more than just be a verbal interface for a basic LLM. This is where I started bolting on a few other things – like trying to call a very small LLM to try and parse the initial request to either:
1. Something that can be easily accomplished by a Python script (such as setting a timer)
2. Something that needed to be handled by a larger LLM (summarize, translate, explain)
3. Something that maybe a small model could address easily (provide simple answer to a simple question)
I ran into some problems at this point. I spent a lot of time trying to constrain a small LLM3 to figure out what the user wanted and assign labels/tasks accordingly. After a lot of fiddling, it turns out that an LLM is generally a “generative” model and it wants to “make” something. My trying to force it to make a choice among only a dozen “words”4 was really bumping into problems where it would have trouble choosing between two options, choose inconsistently, and sometimes just make up new keywords. Now, I could come up with a simple Python script which just did basic word-matching to sort the incoming phrases – but it seemed entirely counterproductive to build a Python word-matching process to help a tiny AI. I then tried building a small “decision tree” of multiple small LLM calls to properly sort between “easy Python script call” and “better call a bigger LLM to help understand what this guy is talking about” and quickly stopped. Again, my building a gigantic decision tree out of little LLM calls was proving to be a bigger task, adding latency and error with each call. I was hoping to use a small LLM to make the voice interaction with the computer simple and seamless and then pass bigger tasks to a larger LLM for handling, sprinkling in little verbal acknowledgements and pauses to help everything feel more natural. Instead I was spending too much time building ways to make a small LLM stupider, doing this repeatedly, and then still ending up with too much slop.
And, frankly, it felt weird to try and lobotomize a small LLM into doing something as simple as “does the user’s request best fall into one of 12 categories?” Yes, small LLM’s can easily start to hallucinate, they can lose track of a conversation, make mistakes, etc. But, to constrain one so tightly that I’m telling it that it may only reply with one of 12 words feels… odd?

Tell me what I want to hear and this can all stop

Over the last few days I’ve been tinkering with building an “intent classifier” or “intent encoder” to do the kind of automatic sorting I was trying to force an LLM to do. As I understand this process, you feed the classifier a bunch of example statements that have been pre-sorted into different “intent slugs.” The benefit of a classifier is that it can only reply with one of these “intent slugs” and will never produce anything else. It’s also way faster. Calling a small5 LLM with a sorting question could produce a sometimes reliable6 answer in about 0.2 ms, which is almost unnoticeable. Calling a classifier to sort should enable a 97% reliable result within 0.05 ms. This is so fast it is imperceptible.

I haven’t tried this yet. I’ve built up a pile of “examples” from largely synthetic data to feed into a classifier, produce an ONNX file7 , and try out. However, I wanted to pause at this juncture to write up what I’ve been working on. I say synthetic data because I didn’t hand write more than 3,000 examples on some 50 different intent slugs. I wrote a list of slugs, described what each one should be associated with, created a small set of examples, and then asked Gemini to produce reasonable sounding examples based on this information. 8 This list appeared pretty good – but needed to be manually edited and also tidied up. I wanted to remove most of the punctuation and adjust the ways numbers and statements showed up, because I’m simply not confident that Whisper will be able to accurately match “Add bananas to shopping list” to “Add bananas to ‘shopping list'” to something that the classifier will correctly interpret.

As I tinker with this project… I’m also looking at how I might be able to extend it into further projects. Not only might it be a great way to help me be more productive, but I might be able to create a really small version that could be put into a companion bot. A little companion bot with limited space, power, inputs, and abilities to emote could be far more lifelike, independent, and non-deterministic in it’s responses and actions.

Project Jarvis

Building a Jarvis-inspired voice activated LLM powered virtual assistant

Yet!! [↩]
Thanks Mr. Fenyman! [↩]
Giving it limited context windows, limited tokens to use, highly restrictive system prompts [↩]
Make timer, list timers, make a reminder, add to a list, recite a list, media buttons, etc [↩]
~1B parameter [↩]
Let’s say 65% reliable [↩]
Yes! Just like the voice models!! [↩]
I know, more self-reflecting LLM garbage… [↩]

Prusa Lack Stack, LED Lighting, CircuitPython Tweaks

Much like those recipes on the internet where the author tells you their life story or inspiration, I’ve got a lot to share before I get to the punchline of this blog post (a bunch of CircuitPython tweaks). Edit: On second thought:

Keep the lines of code <250
Try using mpy-cross.exe to compress the *.py to a *.mpy file

This is a bit of a winding road, so buckle up.

Admission time – I bought a Prusa1 about three years ago, but never powered it on until about a month ago. It was just classic analysis paralysis / procrastineering. I wanted to set up the Prusa Lack enclosure – but most of the parts couldn’t be printed on my MonoPrice Mini Delta, which meant I had to set up the Prusa first and find a place to set it up. But, I also wanted to install the Pi Zero W upgrade so I could connect to it wirelessly – but there was a Pi shortage and it was hard to find the little headers too. Plus, that also meant printing a new plate to go over where the Pi Zero was installed, a plate that I could only print on the Prusa, but I didn’t have a place to set it up…

ANYHOW, we’ve since moved, I set up the Prusa (without the Pi Zero installed yet), printed a Prusa Lack stack connector to house/organize my printers. Unlike the official version, I didn’t have to drill any pilot holes or screw anything into the legs of the Lack tables.

Once the Lack tables were put together, I set about putting in some addressable LEDs off Amazon. I found a strip that had the voltage (5V for USB power), density (60 LED’s per meter), and the length (5 meters) I wanted at a pretty good price <$14, shipped. I did find one LED with a badly soldered SMD component which caused a problem, but I cut the strip to either side of the it, then soldered it back together. Faster and less wasteful than a return at the cost of a single pixel and bit of solder.

The Lack stack is three tables tall, keeps extra filament under the bottom of the first table, my trusty Brother laser printer on top of the first table, my trusty Monoprice Mini Delta (Roberto) on top of the second table, and the Prusa (as yet unnamed Futurama robot reference… Crushinator?) on top. Since I don’t need to illuminate the laser printer, I didn’t run any LED’s above it. I did run a bunch of LED’s around the bottom of the third printer… this is difficult to explain, so I should just show a picture.

When Adafruit launched their QtPy board about four years ago, I picked up several of them. I found CircuitPython was a million times easier for me to code than Adafruit, not least of which because it meant I didn’t have to compile, upload, then run – I could just hit “save” in Mu and see whether the code worked. I also started buying their 2MB flash chips solder onto the backs of the QtPy’s to a ton of extra space. Whenever I put a QtPy into a project, I would just buy another one (or two) to replace them. There’s one in my Cloud-E robot and my wife’s octopus robot. Now, there’s one powering the LED’s in my Lack Stack too.

I soldered headers and the 2MB chip into one of the QtPy’s, which now basically lives in a breadboard so I can experiment with it before I commit those changes to a final project. After I got some decent code to animate the 300 or so pixels, I soldered an LED connector directly into a brand new QtPy and uploaded the code – and it worked!

Or, so I thought. The code ran – which is good. But, it ran slowly, really slowly – which was bad. The extra flash memory shouldn’t have impacted the little MCU’s processor or the onboard RAM – just given it more space to store files. The only other difference I could think of was that the QtPy + SOIC chip required a different bootloader from the stock QtPy bootloader to recognize the chip. I tried flashing the alternate “Haxpress” bootloader to the new QtPy, but that didn’t help either. Having exhausted my limited abilities, I turned to the Adafruit discord.

I’ll save you from my blind thrashing about and cut to the chase:

Two very kind people, Neradoc and anecdata, figured out the reason the unmodified QtPy was running slower was because the QtPy + 2MB chip running Haxpress “puts the CIRCUITPY drive onto the flash chip, freeing a lot of space in the internal flash to put more things.”
- This bit of code shows how to test how quickly the QtPy was able to update the LED strip.
  - from supervisor import ticks_ms
  - t0 = ticks_ms()
  - pixels.fill(0xFF0000)
  - t1 = ticks_ms()
  - print(t1 – t0, “ms”)
- It turns out the stock QtPy needed 192ms to update 300 LED’s. This doesn’t seem like a lot, until you realize that’s 1/5th of a second, or 5 frames a second. For animation to appear fluid, you need at least 24 frames per second. If you watched a cartoon at 5 frames per second, it would look incredibly choppy.
- The Haxpress QtPy with the 2MB chip could update 300 LED’s at just 2ms or 500 frames per second. This was more than enough for an incredibly fluid looking animation.
- Solution 1: Just solder in my last 2MB chip. Adafruit has been out of these chips for several months now. My guess is they’re going to come out with a new version of the QtPy which has a lot more space on board.
  - Even so, I’ve got several QtPy’s and they could all use the speed/space boost. I’m not great at reading/interpreting a component’s data sheet, but using the one on Adafruit, it looks like these on Digikey would be a good match.
The second item was a kept running into a “memory allocation” error while writing animations for these LED’s. This seemed pretty strange since just adding a single very innocuous line of code could send the QtPy into “memory allocation” errors.
- Then I remembered that there’s a limit of about 250 lines of code. Just removing vestigial code and removing some comments helped tremendously.
- The next thing that I could do would be to compress some of the animations from python (*.py) code into *.mpy files which use less memory. I found a copy of the necessary compression/compiler program on my computer (mpy-cross.exe), but it appeared to be out of date. I didn’t save the location where I found the file, so I had to search for it all over again. Only after giving up and moving on to search for “how many lines of code for circuitpython on a microcontroller” did I find the location again by accident.. Adafruit, of course. :)
- I’m pretty confident I will need to find the link to the latest mpy-cross.exe again in the future. On that day, when I google for a solution I’ve already solved, I hope this post is the first result. :)

The animations for the Lack table are coming along. I’ve got a nice “pulse” going, a rainbow pattern, color chases, color wipes, and a “matrix rain” / sparkle effect that mostly works.

Animated GIF

I started this blog post roughly 7 months ago2 by the time I finally hit publish. After all that fuss, ended up switching from CircuitPython (which I find easy to read, write, maintain, update) to Arduino because it was able to hold more code and run more animations. Besides the pulse animations, rainbow patterns, color chases, color wipes, and a matrix rain, it’s also got this halo animation, some Nyan cat inspired chases, and plays the animations at a lower brightness for 12 hours a day (which is intended to be less harsh at night). I could probably add a light sensor, but I don’t really want to take everything apart to add one component.

The i3 MK3S+! [↩]
January 7, 2025 [↩]

Python Practice with an LLM

I’ve been tinkering with Python more recently. When used on a MCU1 or a PC, it’s such a nice experience being able to write some code, run it without having to compile, see what happens, and adjust as necessary. Now, since I’m a newb at this, I’m getting help from… *shudder* LLM’s.2 Now, in the past I’d turn to Googling, looking at reliable and friendly forums such as Adafruit and Arduino, but I’d invariably need to check out Stack Overflow as well.3

As you might imagine, Stack Overflow was something of a victim of it’s own success. It’s content was good enough to train the LLM’s of the world – and those LLM’s can parrot / offer all the insights gleaned from Stack Overflow without the caustic haughty condescending replies typical of the comment sections on Stack Overflow / SlashDot / HackADay. Thus, it’s no small wonder the following graphic was circulating on Reddit:

Where was I? Oh, yeah… I was using some LLM’s to help with Python. I don’t have any fancy GPU’s, BitCoin mining rigs, etc, so I’m just using my non-gaming PC’s modest 16 GB VRAM to run the smaller local LLM’s. I can run things up to about 8B parameters, like the various Llama flavors, at 8 bit quantization with reasonable speed. I’ve found for my system that Qwen3 4B to be fast, thoughtful, and helpful.

I’ve realized this blog post is woefully low on actual Python related content. Here’s some things for future-me to remember:

pip list
- Will give me all the names of all packages installed
pip install requests Pillow reportlab PyPDF2
- Will install multiple packages, one after another

Python Programming Practice

Python Practice with an LLM

Microcontroller unit [↩]
Large language models such as [↩]
I bought their April Fool’s joke keyboard turned real product and once I’d remapped the keys, got significant use out of it for a long time. Between the construction, packaging, and accessories, at $30 this is still a total no-brainer if you need a small extra keyboard dedicated to some specific tasks. [↩]