This week I revisited Week 42 - Real Time Speech to Image and rebuilt it from scratch so it runs locally and much faster.
Tools I used:
Real Time Whisper - to run whisper locally and fast
ComfyUI - a node based editor to control Stable Diffusion models locally
Stable Diffusion XL Turbo 1.0 - a fast image model
Redis - to pass text between the models
Tkinter - to display fullscreen images using Python
Context
I was planning to write about 3D printing again this week (I’ve been tinkering a lot with my new printer and have a particularly special project in the works) but it turns out there’s a decently steep learning curve to 3D modeling and 3D printing and I need more time. I was trying a 2 part overnight print, and I woke up the next morning to this:

So instead I decided to switch gears and revisit my project from Week 42

and follow through on my next steps:

Process
I started by ditching TouchDesigner and APIs and seeing if I could write everything using Python to run locally on my laptop.
The 3 models I used last time were:
Voice-to-Text
An LLM
Text-to-Image
I decided for now that I didn’t need the LLM and I’d let the image model take the raw text input. So that left 2 models to get running on my machine.
The voice-to-text part I’d already mostly done, but previously it was pretty slow. Luckily, I found a kind soul who’d done the hard work for me with Real Time Whisper - very easy to get up and running with a model size of my choice (more on that later).
Generating images locally was less familiar territory. After a bit of research I discovered there were a few popular Stable Diffusion interfaces - the main two being AUTOMATIC1111 and ComfyUI. Both seemed well supported, I chose the latter because I read that it performed better on weaker machines (since I don’t have a beefy GPU). After installing a bunch of packages, I got it up and running with a familiar node based interface:

The idea is you can play with all the parameters to get your images exactly how you want. I actually didn’t need too many of these features, I rather just wanted to have this running as a local endpoint so I could generate images programatically. To do that, first I found this ComfyUI workflow recently uploaded by StabilityAI staff (they have a crazy fast GPU so they were able to do almost instantaneous image generation), and secondly I followed a guide on controlling ComfyUI via API.
Cool, so now I could transcribe audio in real time, and continuously generate images from text within a few seconds, I just needed a simple way to connect the two and then display the most recent image.
To pass the text from one to the other I set up a Redis server (with just two lines of code) and pass the transcribed text to a channel at regular intervals. Then I could subscribe to that channel both from the image generation script and a terminal so I can see what’s coming through in parallel to the images. Finally, I used Tkinter to display in a window the most recently generated image.

Here’s how it all fits together:

Now to test it. I tried a bunch of different random gibberish words but that gets difficult after a while, so I got an LLM to write me a short evocative story, which I read and recorded my screen in real time. I had to tweak a bunch of settings to get it fluid (and I’m sure I can tweak it more) - eg the time it allows for recording before transcribing, the amount of silence that counts as a pause, the number of lines of text to concatenate to generate the image, the size of the Whisper model (it’s a tradeoff between accuracy and speed). I didn’t tweak the image output yet, eg it’s currently a static seed so between testing I saw the same image a few times. Lots more to explore!
Still, I’m pretty happy with my final result for this week (see the video at the top of the post). With a bit more work, I think it’s ready to trial at a live event, either behind a DJ or a storyteller, or dance, or something else!
Learnings
It’s exciting to build on the top of what others build
I’m at the point that one small upgrade or release inspires me - there are so many possibilities here in combining different AI models
Next steps
Optimize it so I’m confident it can run without errors
Use it for live visuals behind a DJ or storytelling event
Perform with it myself
Add in the ability to set persistent or evolving image styles
Add in an LLM to process the text before image gen
Add audio - eg voice narration or sound effects
Add a camera and pose detection so it can be controlled with bodies and movement as well as voice (!)
Build a UI with buttons eg on/off voice recording, change image styles