featExtract
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
The AQMAR Arabic Tagger was primarily developed to support the named entity detection experiments described in Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012), Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL. For the paper and resources used, see http://www.ark.cs.cmu.edu/AQMAR/ This directory contains scripts for extracting features on new data in the manner of the named entity (NE) system described in the paper. Feature extraction is implemented as a preprocessing step that must precede training or predicting with the (Java-based) tagger, and requires a local installation of MADA+TOKAN (a toolkit for Arabic morphological processing which can be obtained from http://www1.ccls.columbia.edu/MADA/). Feature extraction for the Arabic NE tagger requires two files: 1. An NE-tagged file with one token per line, where the word and its tag (B, I, or O) are separated by a space, i.e.: word1 tag word2 tag word3 tag ... Sentences should be separated with blank lines. See sample.bio for an example. 2. The .mada file generated by MADA (the experiments described in the paper used MADA version 3.1). See sample.mada for an example. To extract NE features, first ensure that the MADA_HOME environment variable points to the MADA installation directory. Then (supposing the second file described above is called sample.mada) run ./featExtract.sh sample This will extract MADA's features by calling a MADA utility on the .mada file, then pass those and the .bio file described above to featExtraction.py to produce features of the form used in the named entity tagger, as described below. The resulting file sample.bio.nerFeats should be provided as input to the Java tagger. ----------------------- Named Entity Featureset The feature extraction pipeline extracts affix, MADA and additional lexical features from Arabic Wikipedia NE and non-NE lexicons. The .nerFeats file includes the following features: 0. Token (T1,T2,...Tn-1, Tn) 1. T1 2. Tn 3. T1,T2 4. Tn-1,Tn 5. T1,T2,T3 6. Tn-2,Tn-1,Tn 7. T2,T3 8. Tn-2,Tn-1 9. T2,T3,T4 10. Tn-3,Tn-2,Tn-1 11. T3,T4 12. Tn-4,Tn-3 13. token length 14. is the word's gloss translation capitalized? 15. POS 16. case 17. aspect 18. number 19. person 20. gender 21. state 22. normalized spelling (romanized) 23. is there a mada analysis 24. is the word Arabic UTF-8 (vs. Latin) 25. is base form same as the normalized form 26. Wikipedia: is this a hyperlink or regular text (noisy; turned off) 27. Wikipedia dict: noisy (noisy; turned off) 28. is the token utf-8 or ascii (noisy; turned off) 29. does the term have any vowels (noisy; turned off) 30. is the token in Wiki NE dict 31. is curr+next token in Wiki NE dict 32. is curr+prev token in Wiki NE dict 33. is the token in Wiki non-NE dict 34. is curr+next token in Wiki non-NE dict 35. is curr+prev token in Wiki non-NE dict 36. class A few of the extracted features are highly noisy, and were found not to help NE tagging performance. When training and predicting with the tagger it is therefore recommended to omit these features via the --excludeFeatures option. Alternatively, this may be accomplished by specifying --properties with the included sample.properties file.