Normalized Token Counter

Simple normalized token counter with Python.

Description

This tool normalises text and outputs the frequency of each token. One word is one token and the punctuation marks are removed during processing. There are eight normalizers to choose from:

Lower casing
Upper casing
Stopword removal
Stemming
Lemmatization
Expand contractions

Setup

Prerequisites

Before running the program make sure the Contractions package is installed.
To install:

pip install contractions

To run

Download the token_counter.py file and run the command below:

python token_counter.py <path to txt file> <arguments>

Available arugments:

--lower -> lower casing
--upper -> upper casing
--stop -> stopword removal
--stem -> stemming
--lemm -> lemmatization
--cont -> expand contractions

Examples

Command

Normalises text by lowercasing all the characters, stemming, and expanding contraction.

python token_counter.py "test.txt" --lower --stem --cont

Input

Text is from the book "Little Women"

"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.

Output

christma 2
will 1
not 1
be 1
without 1
ani 1
present 1
grumbl 1
jo 1
lie 1
on 1
the 1
rug 1

Command

Normalises text by lowercasing all the characters, removing stopwords, and lemmatizing.

python token_counter.py "test.txt" --lower --stop --lemm

Input

Text is from the book "Little Women"

"We've got father and mother and each other," said Beth contentedly,
from her corner.

Output

weve 1
got 1
father 1
mother 1
said 1
beth 1
contentedly 1
corner 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Normalized Token Counter

Description

Setup

Prerequisites

To run

Examples

Command

Input

Output

Command

Input

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

Normalized Token Counter

Description

Setup

Prerequisites

To run

Examples

Command

Input

Output

Command

Input

Output