Simple normalized token counter with Python.
This tool normalises text and outputs the frequency of each token. One word is one token and the punctuation marks are removed during processing. There are eight normalizers to choose from:
- Lower casing
- Upper casing
- Stopword removal
- Stemming
- Lemmatization
- Expand contractions
Before running the program make sure the Contractions package is installed.
To install:
pip install contractions
Download the token_counter.py file and run the command below:
python token_counter.py <path to txt file> <arguments>
Available arugments:
- --lower -> lower casing
- --upper -> upper casing
- --stop -> stopword removal
- --stem -> stemming
- --lemm -> lemmatization
- --cont -> expand contractions
Normalises text by lowercasing all the characters, stemming, and expanding contraction.
python token_counter.py "test.txt" --lower --stem --cont
Text is from the book "Little Women"
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
christma 2
will 1
not 1
be 1
without 1
ani 1
present 1
grumbl 1
jo 1
lie 1
on 1
the 1
rug 1
Normalises text by lowercasing all the characters, removing stopwords, and lemmatizing.
python token_counter.py "test.txt" --lower --stop --lemm
Text is from the book "Little Women"
"We've got father and mother and each other," said Beth contentedly,
from her corner.
weve 1
got 1
father 1
mother 1
said 1
beth 1
contentedly 1
corner 1