Skip to content

NLP & Tokenization

The olaverse.nlp module provides tools for text manipulation, optimized tokenization, speech-ready normalization, diacritization for African languages, and PII masking.


Text Preprocessing & Cleaning

Utilities to strip HTML markup, clean up whitespace, and remove URLs from raw text.

from olaverse.nlp import clean_text

clean_text("Read more at <a href='https://olaverse.ai'>Olaverse</a>!   ")
# → 'Read more at Olaverse!'

olaverse.nlp.clean_text(text, remove_urls=True, remove_html=True)

General text cleaning utility. Strips extra whitespaces, and optionally removes URLs and HTML tags.


Global PII Masking

Securely redacts sensitive personal identifiers — emails, global phone numbers, credit card numbers, and social security numbers — from any text.

from olaverse.nlp import mask_pii

mask_pii("Contact me at support@olaverse.co.uk or call +1-800-555-0199")
# → 'Contact me at [EMAIL] or call [PHONE]'

mask_pii("My card is 4111-1111-1111-1111 and SSN 123-45-6789")
# → 'My card is [CREDIT_CARD] and SSN [SSN]'

olaverse.nlp.mask_pii(text)

Mask general Personally Identifiable Information (PII) globally. Replaces Emails, Credit Cards, Social Security Numbers (SSN), and Phone numbers.


Text Normalization (TTS)

TTSNormalizer converts raw text into a form suitable for speech synthesis — expanding numbers, currencies, abbreviations, and date formats into their spoken equivalents.

from olaverse.nlp import TTSNormalizer

norm = TTSNormalizer()
print(norm.normalize("She paid $1,200.50 on 12/25/2023"))
# → 'She paid one thousand two hundred dollars and fifty cents on December twenty-fifth twenty twenty-three'

olaverse.nlp.TTSNormalizer

Normalizes text for TTS processing by expanding numbers, symbols, and abbreviations into full phonetic/spoken words based on the target language.

Functions

expand_numbers(text)

Expands digits into spoken words. Currently implements a basic digit-by-digit reading as a fallback. Can be extended to handle full localized number expansion logic.

normalize(text)

Applies all TTS normalization pipelines sequentially.


Language Detection

LIDLite5 — Lightweight (Zero-dependency)

Model Card: olaverse/lid-lite-5

LIDLite5 uses a custom TF-IDF (unigram + bigram) feature extractor with a Logistic Regression classifier. The entire model is a single ~1.1 MB JSON file. No deep learning required.

Property Value
Model Type TF-IDF + Logistic Regression
Vocabulary 5,000 top n-gram terms
File Size 1.10 MB
Accuracy 98.12% (Macro)
Latency 0.014 ms/sentence
Dependencies Pure Python + NumPy
pip install olaverse  # no extra dependencies needed
from olaverse import LIDLite5

detector = LIDLite5()

# Predict dominant language
lang = detector.predict("Bawo ni, se daadaa ni?")
print(lang)  # → 'yor'

# Get confidence scores across all 5 classes
probs = detector.predict_proba("How far, wetin dey happen?")
print(probs)
# → {'eng': 0.006, 'hau': 0.001, 'ibo': 0.002, 'pcm': 0.989, 'yor': 0.002}

detect_language — Quick one-liner

from olaverse.nlp import detect_language

detect_language("E kaaro, bawo ni?")  # → 'yor'

olaverse.nlp.detect_language(text, model_path='lid-lite-5.json')

Detect the language of the given text using LIDLite5. Returns: 'yor' (Yoruba), 'hau' (Hausa), 'ibo' (Igbo), 'pcm' (Pidgin), or 'eng' (English).

olaverse.nlp.LIDLite5

Lightweight, zero-dependency TF-IDF + Logistic Regression Language Detector for 5 languages: Yoruba ('yor'), Hausa ('hau'), Igbo ('ibo'), Pidgin ('pcm'), and English ('eng').

Functions

predict(text)

Predict the language of the given text. Returns: 'yor', 'hau', 'ibo', 'pcm', or 'eng'.

predict_proba(text)

Predict the language probabilities using softmax over logits.


Diacritization

Yoruba Diacritizer

Model Card: olaverse/diacnet-yor · olaverse/diacnet-yor-x

Two complementary Yoruba diacritization models are available:

Model Architecture Word Accuracy Character Accuracy Size
diacnet-yor (dot-below only) BiLSTM 93.35% char 2.4 MB
diacnet-yor-x (full tonal + dot-below) AfriBERTa Transformer 82.46% 503 MB

The Diacritizer class uses a Viterbi decoding approach on top of the transformer model for full tonal restoration, achieving 90.0% word accuracy on evaluation sets.

from olaverse.nlp import diacritize_yoruba, diacritize_yoruba_dot_below

# Full tonal diacritics (uses diacnet-yor-x with Viterbi decoding)
diacritize_yoruba("Ojo lo si oja lana")
# → 'Ọjọ́ ló sí ọjà lànà'

# Dot-below only (fast, uses diacnet-yor BiLSTM)
diacritize_yoruba_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja'

Igbo Diacritizer

from olaverse.nlp import diacritize_igbo

diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'

olaverse.nlp.Diacritizer

Unified interface for restoring diacritics in African languages using model IDs.

olaverse.nlp.diacritize_yoruba(text, model_path=None)

olaverse.nlp.diacritize_igbo(text, model_path=None)


Tokenization — OTK-BPE-50k

Model Card: olaverse/otk-bpe-50k

Tokenizer wraps the OTK-BPE-50k family of Byte-Level BPE tokenizers, each trained on dedicated African language corpora. They achieve 0% out-of-vocabulary (OOV) tokens via raw UTF-8 byte mapping.

Model Language Vocab vs GPT-4
otk-bpe-50k-yo Yoruba 50k 63% fewer tokens
otk-bpe-50k-ig Igbo 50k ~60% fewer tokens
otk-bpe-50k-ha Hausa 50k ~58% fewer tokens
otk-bpe-50k-pcm Nigerian Pidgin 50k ~55% fewer tokens
otk-bpe-50k-naija Unified (all 4) 50k Balanced
pip install olaverse  # tokenizer library included
from olaverse.nlp import Tokenizer

# Load by short code: "yo", "ig", "ha", "pcm", or "naija"
tok = Tokenizer("yo")

tokens = tok.encode("Ẹ kú àbọ̀")
print(tokens)   # → [124, 381]

decoded = tok.decode(tokens)
print(decoded)  # → 'Ẹ kú àbọ̀'

olaverse.nlp.Tokenizer

A unified BPE Tokenizer for Nigerian languages. Supports 'yoruba', 'igbo', 'hausa', 'pidgin', and 'naija' (unified).

Functions

encode(text)

Encode input text into a list of token IDs.

decode(ids)

Decode a list of token IDs back into a string.