What is lexprep?
lexprep is an open-source toolkit that processes wordlists for linguistic research. It provides essential NLP features specifically designed for researchers working with isolated words.
Unlike traditional NLP libraries that focus on sentence and document analysis, lexprep treats each word independently which is perfect for experimental stimulus preparation, lexical database annotation, and controlled wordlist studies.
✨ Key Features
- G2P Transcription — Convert words to phonetic representations (IPA, ARPAbet)
- Syllable Counting — Count syllables using orthographic or phonetic methods
- POS Tagging — Assign part-of-speech tags to isolated words
- Stratified Sampling — Sample wordlists by frequency or other scores
- Multi-format Support — Read and write Excel, CSV, TSV, and plain text files
🌍 Supported Languages
Persian (فارسی)
fa
G2P transcription, syllable counting (orthographic & phonetic), and POS tagging with Stanza
English
en
ARPAbet G2P, syllable counting with pyphen, and POS tagging with spaCy
Japanese (日本語)
ja
POS tagging with UniDic (detailed tags) or Stanza (universal tags)
👥 Who is this for?
- Psycholinguists designing experimental stimuli and word lists
- Linguists annotating lexical databases with phonetic and grammatical features
- Researchers creating controlled word lists for reading research
- Language scientists doing cross-linguistic comparisons
- Students studying about linguistic analysis techniques
💡 Why lexprep?
Most NLP tools (spaCy, Stanza, NLTK) are designed for text processing like analyzing sentences, documents, and running text. They work great for named entity recognition, dependency parsing, and document classification.
But psycholinguistic research often involves wordlists: spreadsheets of isolated words that need phonetic transcription, syllable counts, or POS tags. Standard NLP tools can struggle with this because they expect sentence context.
lexprep bridges this gap by providing wordlist processing with a unified interface across languages. Use the visual web interface for quick processing or the command-line tools for batch automation to make it easy to prepare research materials without writing custom scripts for each task.
📖 Open Source
lexprep is released under the MIT License, making it free to use, modify, and distribute for both academic and commercial purposes. The project welcomes contributions from the community.
The source code is available on GitHub.