Project Roo 🤖

December 1, 2023

How AI can help more as a great voice buddy

Roo is a hobby project I made for myself and is a long way from maturity for everyone to use.

Current:

Uses off-the-shelf trained neural models for voices – Piper
bash script to pipe the text on the terminal to the model and the virtual audio device
State of the art Whisper tiny-en for auto-transcription of others voices.
Google's Flan-T5 LLM to generate possible auto-responses to others voice
Personal factual data and repeated greetings are auto text expanders and replied to by voice model. For example, typing ;intro will make the model speak pre-defined introductory greetings of Boson, and ;bio will say what I am doing currently ...
Give auto heads-up of saying “Roo” before sending a whole speech of what Roo has to say
Repeats the previous message with one key.
Respells letter by letter when a word Roo spoke earlier was confusing
Eleven labs integration for multilingual and fallback
inserts random words or phrases into the speech – pig Latin for voice!
can play music files
funny sounds – laugh, cry, blow a raspberry, song lines, dialogues, ...

migrate to Coqui tts v2 or LLVC
Integrate dictionary for pronunciation correction on the tricky words [maintain a cache of audio]
Combine all Piper, Whisper, and Flan-T5 in a single pipeline.
Auto-pause and repeat the voice at interruptions [auto detect on headphones output] – kind of TCP backoff
Keep single instances of models pinned in RAM for inference – How? :cries:
auto typos correction in the typed text; personal dictionary that constantly updates
auto word completion options while typing – basic or tiny LLM 🤔

sentence predictions with optimised LLM fine-tuned on Boson's previous text patterns – FlanT5 or dlite
Add integration for Indian languages
Auto Google translate to speak directly in other languages while I chat in English
Auto translation of other languages in VCs back to English for text on my screen
Indian models from Google API again
Occasionally insert a joke, meme, or pun into the conversation.

integrate API calls to the wiki and knowledge DB API and fact-check typed messages. Toolformer or GPT assistants
Add pose-detection from the camera and rig motions onto a 3d avatar of myself
Diatirize transcription to make replies personalised from my past chats with each user
Train on my voice
Ask LLM something within the same server VC with a trigger word. hey roo, who solved Fermat's last theorem?
prank others. Switching voices between LLM voice replies of Roo, Boo and mine.
clone others in real-time. on consent.
AI understands video and screen-shared content while making replies
Crossword buddy as it types automatically on the website, given all the above capabilities, while everyone talks and enjoys.
sell it and make millions. :rubs-hands: