Overview

This page provides accuracy information for each tool in lexprep. Where official benchmarks exist, they are cited with references. Where no formal evaluation exists, estimates are clearly marked.

Tool Library Accuracy Source
English G2P g2p-en High Park & Kim (2019)
English Syllables pyphen No benchmark Pyphen
English POS spaCy 96-97% Explosion
Persian G2P PersianG2p No benchmark ~48K dictionary
Persian Syllables Heuristic ~85-95% Estimate
Persian POS Stanza 97.4% Qi et al. (2020)
Japanese POS (Stanza) Stanza 96-97% Qi et al. (2020)
Japanese POS (UniDic) Fugashi+UniDic High NINJAL Standard

US English Tools

G2P (Grapheme-to-Phoneme)

g2p-en
High accuracy

How it works:

  1. Spells out numbers and currency symbols
  2. Disambiguates homographs using POS tags
  3. Looks up CMU Pronouncing Dictionary (~134,500 words)
  4. Predicts OOV words using neural seq2seq model

Dictionary words: 100% accuracy (by definition - dictionary lookup)
OOV words: Neural prediction (no published benchmark)

Syllable Count

pyphen
No benchmark

Uses TeX-compatible hyphenation dictionaries from LibreOffice (Hunspell format) Hyphenates words using language-specific patterns; syllable count is inferred from hyphenation segments.

POS Tagging

spaCy
96-97% accuracy

Model: en_core_web_sm (trained on OntoNotes 5)

Metric Score
Token accuracy 97%
POS accuracy 97%
Sentence segmentation F1 91%

IR Persian Tools

G2P (Grapheme-to-Phoneme)

PersianG2p
No formal benchmark

How it works:

  1. Normalizes text
  2. Looks up dictionary (~48,000 words with use_large=True)
  3. Predicts unknown words using neural network

Dictionary size: ~48,000 Persian words
OOV handling: Neural model prediction

Syllable Count

Heuristic
~85-95% estimated

Methods available:

  • Orthographic: Counts vowel patterns in Persian spelling (~65-80%)
  • Phonetic: Counts vowels in G2P output (~90-95%, depends on G2P accuracy)

No formal benchmark exists. These are estimates. The heuristic may fail on loanwords, compound words, and words without diacritics. For research, manually verify a sample.

POS Tagging

Stanza
97.4% accuracy

Official benchmarks (Stanza v1.5.1 on UD v2.12):

Dataset Tokens UPOS Lemma
Persian Seraji 100% 97.69% 98.18%

Output: Universal POS tags (UPOS) + Lemma
Treebank: UD Persian Seraji

JP Japanese Tools

POS Tagging (Stanza)

Stanza
96-97% accuracy

Official benchmarks (Stanza v1.5.1 on UD v2.12):

Dataset Tokens UPOS Lemma
Japanese GSD 97.37% 96.38% 96.02%
Japanese GSDLUW 96.32% 95.10% 94.67%

Output: Universal POS tags (17 categories)
Best for: Cross-lingual research, UD compatibility

POS Tagging (UniDic)

Fugashi + UniDic
Widely used

MeCab morphological analyzer with UniDic dictionary developed by NINJAL (National Institute for Japanese Language and Linguistics).

Output: Detailed Japanese POS tags
Best for: Japanese linguistics, detailed morphological analysis

Recommendations for Researchers

  1. Report tool versions in your papers (e.g., "spaCy v3.7", "Stanza v1.5.1")
  2. Validate on your domain - accuracy varies by text type
  3. Cite underlying tools - see each library's citation format on the References page
  4. For Persian syllables: manually verify a sample since no formal benchmark exists