Project Roo π€
How AI can help more as a great voice buddy
Roo is a hobby project I made for myself and is a long way from maturity for everyone to use.
Current:
- Uses off-the-shelf trained neural models for voices β Piper
- bash script to pipe the text on the terminal to the model and the virtual audio device
- State of the art Whisper tiny-en for auto-transcription of others voices.
- Google's Flan-T5 LLM to generate possible auto-responses to others voice
- Personal factual data and repeated greetings are auto text expanders and replied to by voice model. For example, typing
;intro
will make the model speak pre-defined introductory greetings of Boson, and;bio
will say what I am doing currently ... - Give auto heads-up of saying βRooβ before sending a whole speech of what Roo has to say
- Repeats the previous message with one key.
- Respells letter by letter when a word Roo spoke earlier was confusing
- Eleven labs integration for multilingual and fallback
- inserts random words or phrases into the speech β pig Latin for voice!
- can play music files
- funny sounds β laugh, cry, blow a raspberry, song lines, dialogues, ...
Immediate:
- migrate to Coqui tts v2 or LLVC
- Integrate dictionary for pronunciation correction on the tricky words [maintain a cache of audio]
- Combine all Piper, Whisper, and Flan-T5 in a single pipeline.
- Auto-pause and repeat the voice at interruptions [auto detect on headphones output] β kind of TCP backoff
- Keep single instances of models pinned in RAM for inference β How? :cries:
- auto typos correction in the typed text; personal dictionary that constantly updates
- auto word completion options while typing β basic or tiny LLM π€
Next:
- sentence predictions with optimised LLM fine-tuned on Boson's previous text patterns β FlanT5 or dlite
- Add integration for Indian languages
- Auto Google translate to speak directly in other languages while I chat in English
- Auto translation of other languages in VCs back to English for text on my screen
- Indian models from Google API again
- Occasionally insert a joke, meme, or pun into the conversation.
Longterm:
- integrate API calls to the wiki and knowledge DB API and fact-check typed messages. Toolformer or GPT assistants
- Add pose-detection from the camera and rig motions onto a 3d avatar of myself
- Diatirize transcription to make replies personalised from my past chats with each user
- Train on my voice
- Ask LLM something within the same server VC with a trigger word.
hey roo, who solved Fermat's last theorem?
- prank others. Switching voices between LLM voice replies of Roo, Boo and mine.
- clone others in real-time.
on consent
. - AI understands video and screen-shared content while making replies
- Crossword buddy as it types automatically on the website, given all the above capabilities, while everyone talks and enjoys.
- sell it and make millions. :rubs-hands: