Skip to content

Speech Synthesis

The olaverse.speech module provides a flexible, robust Text-to-Speech (TTS) architecture.

TTS Pipeline

The TTSPipeline coordinates the entire flow from raw text to output waveforms, automatically handling text normalization and diacritic restoration before passing inputs to your Acoustic models.

olaverse.speech.TTSPipeline

End-to-end Text-to-Speech pipeline.

Orchestrates the entire flow: 1. Text Normalization (Numbers, abbreviations) 2. Tone Restoration (Diacritization) 3. Acoustic Modeling (Text -> Mel-spectrogram) 4. Vocoding (Mel-spectrogram -> Audio Waveform)

Functions

synthesize(text)

Synthesize raw text into an audio waveform.

Model Interfaces

If you are training or integrating custom Acoustic models and Vocoders, ensure they inherit from these base classes.

olaverse.speech.BaseAcousticModel

Bases: ABC

Abstract base class for Acoustic Models (e.g., FastSpeech, Tacotron). These models convert normalized/diacritized phonetic text into acoustic features like Mel-spectrograms.

Functions

load_weights(path) abstractmethod

Load PyTorch/ONNX model weights from the specified path.

forward(text) abstractmethod

Convert text into acoustic features.

Parameters:

Name Type Description Default
text str

Phonetically normalized text.

required

Returns: Acoustic features (e.g., a Mel-spectrogram tensor).

olaverse.speech.BaseVocoder

Bases: ABC

Abstract base class for Vocoders (e.g., HiFi-GAN, WaveGlow). These models convert acoustic features (like Mel-spectrograms) into raw audio waveforms.

Functions

load_weights(path) abstractmethod

Load PyTorch/ONNX model weights from the specified path.

generate(acoustic_features) abstractmethod

Convert acoustic features into a raw audio waveform.

Parameters:

Name Type Description Default
acoustic_features

Output from an Acoustic Model.

required

Returns: Audio waveform array/tensor.