Skip to content

SssiiiSssiii/ArabicTextCleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

ArabicTextCleaner

Do What?

  • Tokenization: Breaking down Arabic text into individual words or tokens.

  • Normalization: Standardizing text by converting characters to their base forms Additionally, the removal of non-Arabic characters.

    from to
    أ-إ-آ ا
    ى ي
    ة ه
    الْعَرَبِيَّة العربيه
    العـــربية العربيه
  • Stop Word Removal: Eliminating common and less informative words like articles and conjunctions.(You can also review the text file containing stop words and make modifications as needed in the src file)

    Alt text

  • Stemming: Reducing words to their root forms to enhance text analysis and information retrieval. (التجذيع)

These preprocessing steps are essential for enhancing the quality and usability of Arabic text data in various NLP and machine learning tasks.

Test

Input

Alt text

Outout

Alt text

Note

Feel free to use and contribute to this repository to advance our Arabic text processing projects. ')