Creating Custom TTS Voices With Piper

The goal of this project was initally to customize my announcer for the VcNarrator plugin for Vencord. However in that process I realized the plugin would call whatever engine speech-dispatcherd uses. This led me down a relatively straightforward path trying to get my own custom voice for my entire system. This whole project was completed within a day, including two hours of fine-tuning.

My process for this whole project was not nearly ideal and for that reason I will only go over the high level steps involved so you can address it in whatever way is more optimal for your situation.

Training Data

Downloading the content

Ideally an hour or two of single speaker clips of your ’target’ voice. I used yt-dlp and exported to m4a. Find out your supported formats with:

yt-dlp https://your.info/here --list-formats

I then manually trimmed off the intro and outro sequences with audacity and exported as wav. The audio also has to be compliant with the piper tts engine. You can change it with this ffmpeg command:

ffmpeg -i source.wav -ac 1 -ar 22050 compliant.wav

Transcription and formatting

Use whisper to transcribe the audio.

I downloaded the whisper.cpp repo and built it myself. If you are able to get it through your package manager you should probably do that. The amount of time I wasted to get GPU acceleration took longer than just letting the python CPU version run through two hours of audio. Make sure to download the gglm-base.en.bin model file. The command should look something like this:

whisper-cli data/compliant.wav \
--model /path/to/models/ggml-base.en.bin \
--language en --output_format json \ 
--output_dir /path/to/data/

Clip the audio

Now we have to split the audio into chunks. I did so with a janky python script. If you’re aware of a better way, go with that.

import json
import subprocess
import os

os.makedirs('data/wavs', exist_ok=True)

#these files are your whisper speech to text output
with open('data/compliant.wav.json', 'r') as f:
    data = json.load(f)

with open('data/metadata.csv', 'w') as meta:
    clip_num = 0
    
    for segment in data['transcription']:
        start_ms = segment['offsets']['from']
        end_ms = segment['offsets']['to']
        
        start = start_ms / 1000.0
        end = end_ms / 1000.0
        duration = end - start
        
        text = segment['text'].strip()
        
        if duration < 1.0 or duration > 15.0:
            continue
        
        output_file = f'data/wavs/{clip_num:05d}.wav'
        
        cmd = [
            'ffmpeg', '-i', 'data/compliant.wav',
            '-ss', str(start),
            '-to', str(end),
            '-ar', '22050',
            '-ac', '1',
            '-acodec', 'pcm_s16le',
            '-y',
            output_file
        ]
        
        result = subprocess.run(cmd, capture_output=True)
        
        if result.returncode == 0:
            meta.write(f'wavs/{clip_num:05d}.wav|{text}\n')
            print(f"{clip_num}: ({duration:.1f}s) {text[:60]}...")
            clip_num += 1
        else:
            print(f"ERROR processing segment {clip_num}")

print(f"\nCreated {clip_num} clips!")
print("Check data/wavs/ and data/metadata.csv")

Fine-tuning and testing

Download a base model

Download whatever base model you want from https://huggingface.co/rhasspy/piper-voices/ I went with ryan medium but it should be close to your targets voice.

Set up your training enviornment

I was struggling to get the piper recording studio working with my systems Python configuration. Instead of using a versioning utility like uv or pyenv I just went with a docker container. I used the Nvidia container toolkit build so I could pass my GPU through to utilize torch CUDA acceleration. I don’t really recommend this since it’s a bunch of extra headaches and for that reason I’ll leave this step vague.

Clone the piper repo https://github.com/rhasspy/piper.git or download the release version.

cd into the src/python dir and install the module:

pip install -e .

Then build the monotonic align extension

./build_monotonic_align.sh

Fine-tuning

How exactly you run the piper_train module is dependent on your configuration. Since I did mine in a docker enviornment I don’t have a record of what exact arguments I passed. At the very least it should look something like this since I recommend about 500 epochs more than the current epoch value. Since the checkpoint is already trained it will have a high epoch value. I’m sure theres a better way to find out the current value but I just added 500 to the value the piper_train module reported in the error message.

python3 -m piper_train 
  --dataset-dir /output/dir \
  --accelerator gpu \ #or cpu if you arent using rocm or cuda
  --devices 1 \
  --batch-size 32 \ #switch to 16 for less ram/vram usage
  --validation-split 0.0 \
  --num-test-examples 0 \
  --max_epochs 500 \ #add 500 to whatever value it says it currently is
  --resume_from_checkpoint /data/model/base_checkpoint.ckpt \ #the ckpt you downloaded from huggingface
  --checkpoint-epochs 50 \ #can change it to whatever you want but I would reccomend checking each checkpoint for regressions. 
  --precision 32 #can set to 16 if it errors

You can check out the model as it trains by copying the baseline json file from the huggingface repo and renaming it to your current model’s name. It’s worth noting the json MUST include the onnx extension! Here’s an example:

echo "Testing testing 1 2 3" | \
piper-tts --model checkpoint.onnx \
--output-raw |aplay -r 22050 -f S16_LE -c 1

In the case above the json file should be checkpoint.onnx.json.

Implementation

Depends on what you want…

Depending on what your goals are, you can do a number of things. However, as far as I’m aware you’re required to use the piper-tts engine (or other projects which use the VITS format) to use the model. I would recommend just plugging the piper-tts engine into the speech-dispatcherd system so it replaces your default system voice.

Example: speech-dispatcherd piper-tts engine

Here’s an example of what to add/alter to your speechd.conf file

 DefaultVoiceType  "MALE1" #or female, whatever you set
 DefaultVoice "YOURVOICEFILEHERE" #don't include extension

 DefaultLanguage   en
 AddModule "piper-tts-generic" "sd_generic" "piper-tts-generic.conf"
 DefaultModule   piper-tts-generic

Add your onnx and json files to the piper-voices directory. Mine was /usr/share/piper-voices/en/en_US/voice/type/ where type is high, medium ,or low. You also need to add a conf file in the speech-dispacher/modules directory. Example: ~/.config/speech-dispatcher/modules/piper-tts-generic.conf

GenericExecuteSynth "echo '$DATA' | piper-tts --model /usr/share/piper-voices/en/en_US/VOICENAME/TYPE/voicefile.onnx -f - | mpv --volume=80 --no-terminal --keep-open=no -"

AddVoice "en_US" "MALE1" "voicefile" #don't include extension
DefaultVoice "voicefile" #don't include extension

Restart the speech-dispatcherd service and you can test with:

spd-say "testing testing 1 2 3"

It should reflect your new voice. If not, a system reboot could be required. For some reason mine didn’t work until I restarted the service three times so give it a few tries if it doesn’t work right away.

Enjoy

The benefit of using speech-dispatcherd is that any application speech and accessibility feature using the system defaults will use this voice without extra work. You can also add multiple voices and select between them by application.