Overview
This page provides accuracy information for each tool in lexprep. Where official benchmarks exist, they are cited with references. Where no formal evaluation exists, estimates are clearly marked.
English Tools
G2P (Grapheme-to-Phoneme)
g2p-enHow it works:
- Spells out numbers and currency symbols
- Disambiguates homographs using POS tags
- Looks up CMU Pronouncing Dictionary (~134,500 words)
- Predicts OOV words using neural seq2seq model
Dictionary words: 100% accuracy (by definition - dictionary lookup)
OOV words: Neural prediction (no published benchmark)
Syllable Count
pyphenUses TeX-compatible hyphenation dictionaries from LibreOffice (Hunspell format) Hyphenates words using language-specific patterns; syllable count is inferred from hyphenation segments.
POS Tagging
spaCyModel: en_core_web_sm (trained on OntoNotes 5)
| Metric | Score |
|---|---|
| Token accuracy | 97% |
| POS accuracy | 97% |
| Sentence segmentation F1 | 91% |
Persian Tools
G2P (Grapheme-to-Phoneme)
PersianG2pHow it works:
- Normalizes text
- Looks up dictionary (~48,000 words with
use_large=True) - Predicts unknown words using neural network
Dictionary size: ~48,000 Persian words
OOV handling: Neural model prediction
Syllable Count
HeuristicMethods available:
- Orthographic: Counts vowel patterns in Persian spelling (~65-80%)
- Phonetic: Counts vowels in G2P output (~90-95%, depends on G2P accuracy)
No formal benchmark exists. These are estimates. The heuristic may fail on loanwords, compound words, and words without diacritics. For research, manually verify a sample.
POS Tagging
StanzaOfficial benchmarks (Stanza v1.5.1 on UD v2.12):
| Dataset | Tokens | UPOS | Lemma |
|---|---|---|---|
| Persian Seraji | 100% | 97.69% | 98.18% |
Output: Universal POS tags (UPOS) + Lemma
Treebank: UD Persian Seraji
Japanese Tools
POS Tagging (Stanza)
StanzaOfficial benchmarks (Stanza v1.5.1 on UD v2.12):
| Dataset | Tokens | UPOS | Lemma |
|---|---|---|---|
| Japanese GSD | 97.37% | 96.38% | 96.02% |
| Japanese GSDLUW | 96.32% | 95.10% | 94.67% |
Output: Universal POS tags (17 categories)
Best for: Cross-lingual research, UD compatibility
POS Tagging (UniDic)
Fugashi + UniDicMeCab morphological analyzer with UniDic dictionary developed by NINJAL (National Institute for Japanese Language and Linguistics).
Output: Detailed Japanese POS tags
Best for: Japanese linguistics, detailed morphological analysis
Recommendations for Researchers
- Report tool versions in your papers (e.g., "spaCy v3.7", "Stanza v1.5.1")
- Validate on your domain - accuracy varies by text type
- Cite underlying tools - see each library's citation format on the References page
- For Persian syllables: manually verify a sample since no formal benchmark exists