The goal of this project was initally to customize my announcer for the VcNarrator plugin for Vencord. However in that process I realized the plugin would call whatever engine speech-dispatcherd uses. This led me down a relatively striaghtforward path trying to get my own custom voice for my entire system. This whole project was completed within a day, including two hours of fine-tuning.
My process for this whole project was not nearly ideal (pip sucks) and for that reason I will only go over the high level steps involved so you can address it in whatever way is more optimal for your use case.
Training Data
Downloading the content
Ideally an hour or two of single speaker clips of your ’target’ voice. I used yt-dlp and exported to m4a. Find out your supported formats with:
yt-dlp https://your.info/here --list-formats
I then manually trimmed off the intro and outro sequences with audacity and exported as wav. The audio also has to be compliant with the piper tts engine. You can change it with this ffmpeg command:
ffmpeg -i source.wav -ac 1 -ar 22050 compliant.wav
Transcription and formatting
Use whisper to transcribe the audio.
I donwloaded the whisper.cpp repo and built it myself. If you are able to get it through your package manager you should probably do that. The amount of time I wasted to get GPU acceleration took longer than just letting the python CPU version run through two hours of audio. Make sure to download the gglm-base.en.bin model file. The command should look something like this:
whisper-cli data/compliant.wav \
--model /path/to/models/ggml-base.en.bin \
--language en --output_format json \
--output_dir /path/to/data/
Clip the audio
Now we have to split the audio into chunks. I did so with a janky python script. If you’re aware of a better way, go with that.
import json
import subprocess
import os
os.makedirs('data/wavs', exist_ok=True)
#these files are your whisper speech to text output
with open('data/compliant.wav.json', 'r') as f:
data = json.load(f)
with open('data/metadata.csv', 'w') as meta:
clip_num = 0
for segment in data['transcription']:
start_ms = segment['offsets']['from']
end_ms = segment['offsets']['to']
start = start_ms / 1000.0
end = end_ms / 1000.0
duration = end - start
text = segment['text'].strip()
if duration < 1.0 or duration > 15.0:
continue
output_file = f'data/wavs/{clip_num:05d}.wav'
cmd = [
'ffmpeg', '-i', 'data/compliant.wav',
'-ss', str(start),
'-to', str(end),
'-ar', '22050',
'-ac', '1',
'-acodec', 'pcm_s16le',
'-y',
output_file
]
result = subprocess.run(cmd, capture_output=True)
if result.returncode == 0:
meta.write(f'wavs/{clip_num:05d}.wav|{text}\n')
print(f"{clip_num}: ({duration:.1f}s) {text[:60]}...")
clip_num += 1
else:
print(f"ERROR processing segment {clip_num}")
print(f"\nCreated {clip_num} clips!")
print("Check data/wavs/ and data/metadata.csv")
Fine-tuning and testing
Download a base model
Download whatever base model you want from https://huggingface.co/rhasspy/piper-voices/ I went with ryan medium but it should be close to your targets voice.
Set up your training enviornment
I was struggling to get the piper recording studio working with my systems python configuration. I have no more patience for pip left in me so I just went with a docker container I could pass my GPU through to utilize torch CUDA acceleration. I don’t really recommend this since it’s a bunch of extra headaches and for that reason I’ll leave this step vague.
Clone the piper repo https://github.com/rhasspy/piper.git or download the release version.
cd into the src/python dir and install the module:
pip install -e .
Then build the monotonic align extension
./build_monotonic_align.sh
Fine-tuning
How exactly you run the piper_train module is dependent on your configuration. Since I did mine in a docker enviornment I don’t have a record of what exact arguments I passed. At the very least it should look something like this since I recommend about 500 epochs more than the current epoch value. Since the model is already trained it already has a high epoch value. I don’t really know a better way to find that out than to run the piper_train module and let it yell at you for exceeding the max epochs. Just add 500 to that number.
python3 -m piper_train
--dataset-dir /output/dir \
--accelerator gpu \ #or cpu if you arent using rocm or cuda
--devices 1 \
--batch-size 32 \ #switch to 16 for less ram/vram usage
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 500 \ #add 500 to whatever value it says it currently is
--resume_from_checkpoint /data/model/base_checkpoint.ckpt \ #the ckpt you downloaded from huggingface
--checkpoint-epochs 50 \ #can change it to whatever you want but I would reccomend checking each checkpoint for regressions.
--precision 32 #can set to 16 if it errors
You can check out the model as it trains by copying the baseline json file from the huggingface repo and renaming it to your current model’s name. It’s worth noting the json MUST include the onnx extension! Here’s an example:
echo "Testing testing 1 2 3" | \
piper-tts --model checkpoint.onnx \
--output-raw |aplay -r 22050 -f S16_LE -c 1
In the case above the json file should be checkpoint.onnx.json.
Implementation
Depends on what you want…
So depending on what your goals are you can do a number of things. However, as far as I’m aware you’re required to use the piper-tts engine (or other projects which use piper’s VITS format) to use the model. I would recommend just plugging the piper-tts engine into the speech-dispatcherd system so it replaces your default system voice.
Example: speech-dispatcherd piper-tts engine
Here’s an example of what to add/alter to your speechd.conf file
DefaultVoiceType "MALE1" #or female, whatever you set
DefaultVoice "YOURVOICEFILEHERE" #don't include extension
DefaultLanguage en
AddModule "piper-tts-generic" "sd_generic" "piper-tts-generic.conf"
DefaultModule piper-tts-generic
Add your onnx and json files to the piper-voices directory. Mine was /usr/share/piper-voices/en/en_US/voice/type/ where type is high, medium ,or low. You also need to add a conf file in the speech-dispacher/modules directory.
Example: ~/.config/speech-dispatcher/modules/piper-tts-generic.conf
GenericExecuteSynth "echo '$DATA' | piper-tts --model /usr/share/piper-voices/en/en_US/VOICENAME/TYPE/voicefile.onnx -f - | mpv --volume=80 --no-terminal --keep-open=no -"
AddVoice "en_US" "MALE1" "voicefile" #don't include extension
DefaultVoice "voicefile" #don't include extension
Restart the speech-dispatcherd service and you can test with:
spd-say "testing testing 1 2 3"
It should reflect your new voice. If not, a system reboot could be required. For some reason mine didn’t work until I restarted the service three times so I think it’s a little janky.
Enjoy
The benefit of using speech-dispatcherd is that application speech and accessibility plugins will use this voice without extra work. You can also add multiple voices and select between them by application.