Do What?
-
Tokenization: Breaking down Arabic text into individual words or tokens.
-
Normalization: Standardizing text by converting characters to their base forms Additionally, the removal of non-Arabic characters.
from to أ-إ-آ ا ى ي ة ه الْعَرَبِيَّة العربية العـــربية العربية -
Stop Word Removal: Eliminating common and less informative words like articles and conjunctions.(You can also review the text file containing stop words and make modifications as needed in the
src
file) -
Stemming: Reducing words to their root forms to enhance text analysis and information retrieval. (التجذيع)
These preprocessing steps are essential for enhancing the quality and usability of Arabic text data in various NLP
and machine learning tasks.
Input
Outout
Feel free to use and contribute to this repository to advance our Arabic text processing projects. ')