TaSTT (modular avatar)

8 ratings

This is an OSC speech-to-text chatbox that you can attach to your avatar with modular avatar.

You will need the corresponding free OSC app (download). This app transcribes your voice locally using an optimized version of OpenAI's Whisper algorithm, then sends the transcript into VRChat with OSC.

Features:

Extremely low avatar performance impact - 1 material slot, 12 triangles, and as little as 300 KB of texture memory (default is 5.33MB).

Readable in mirrors. The shader detects when it's in a mirror and flips the text.
World space lock: chatbox locks in world space when spawned (default) or when done speaking.
Adjustable size via radial control in game.
Phonemes: chatbox makes noise when you speak (optional, off by default).
Text filters: lowercase, uppercase, uwu, profanity.

Decent accuracy and latency. Cloud-based solutions like TTS Voice Wizard are better, but you have to pay to use them. Expect 1-5 seconds of latency when using this in game.
Fast transcription in 100 languages (speak language X, get transcript in language X). This is a native Whisper feature.
Slow translation into 200 languages (speak language X, get transcript in language Y). This is done by feeding Whisper's transcripts into Meta's NLLB translation algorithm.
Free as in beer.
Open source under MIT license (github).

Requirements:

PCVR (no Quest compatibility)
Modular avatar
NVIDIA GPU
- Tested on 1000-4000 series GPUs.
2.2GB of disk space
- 900MB: CUDNN
- 470MB: transcription models
- 430MB: python environment
- 320MB: git (required to acquire python dependencies at runtime)

2GB of memory
2GB of VRAM

Avatar performance:

Material slots: 1
Polygons: 12
Sync parameters: 108 bits
- 80 bits: used to send 10 characters at a time (8 bits each)
- 8 bits: active region selection
- 8 bits: chatbox scale (optional)
- 6 bits: in-game phonemes (optional)
- 1 bit: ellipsis animation (optional)
- 1 bit: enable/disable character animations (required for robustness)
- 1 bit: instantly clear board
- 1 bit: toggle on/off
- 1 bit: lock board in world space
- 1 bit: dummy parameter (tech debt; required now but will be removed in a later release)

Texture memory: 5.33 MB
- May be lowered by reducing the font texture's resolution. The default is 2k; I recommend going as low as 512x512.
Audio sources: 5
- Optional. Delete them if you don't want them, it won't break anything.

Strawman FAQ:

Why does this exist? To provide a baseline STT service for everyone.
Why is it free? Because it can be, and because speech shouldn't be paywalled.
Why should I use it instead of cloud options? I resent having 10 million subscriptions and don't want yet another one. That's why I use this instead of cloud options.
Does it lag me? When it's actively transcribing, it does reduce your framerate. You can use smaller models to get better performance at the cost of accuracy.
Are there other projects doing STT? Yes, see the README on github.
How does the chatbox only have 12 polygons? I'm using a technique called raymarching to simulate more complicated geometry inside of a box (6 faces = 12 triangles). The raymarching implementation is heavily optimized and should not lag anyone.

Legal:

This asset, as well as the OSC app, is licensed under the MIT license. Commercial use is allowed as long as a copy of the license is provided with any copies of the asset. The provided .unitypackage includes the LICENSE file, so there's nothing to worry about when using it.

Name a fair price:

I want this!

249 downloads

.unitypackage with modular avatar prefab

Ratings

(8 ratings)

5 stars

100%

4 stars

3 stars

2 stars

1 star