Skip to content

Latest commit

 

History

History

featExtract

The AQMAR Arabic Tagger was primarily developed to support the named entity 
detection experiments described in

  Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012),
  Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL.

For the paper and resources used, see

  http://www.ark.cs.cmu.edu/AQMAR/

This directory contains scripts for extracting features on new data in the manner of 
the named entity (NE) system described in the paper. Feature extraction is implemented 
as a preprocessing step that must precede training or predicting with the (Java-based) 
tagger, and requires a local installation of MADA+TOKAN (a toolkit for Arabic 
morphological processing which can be obtained from http://www1.ccls.columbia.edu/MADA/).

Feature extraction for the Arabic NE tagger requires two files:

1. An NE-tagged file with one token per line, where the word and its tag 
(B, I, or O) are separated by a space, i.e.:

word1 tag
word2 tag
word3 tag
...

Sentences should be separated with blank lines. See sample.bio for an example.

2. The .mada file generated by MADA (the experiments described in the 
paper used MADA version 3.1). See sample.mada for an example.


To extract NE features, first ensure that the MADA_HOME environment variable 
points to the MADA installation directory. Then (supposing the second file
described above is called sample.mada) run

 ./featExtract.sh sample

This will extract MADA's features by calling a MADA utility on the .mada 
file, then pass those and the .bio file described above to featExtraction.py 
to produce features of the form used in the named entity tagger, as described 
below. The resulting file sample.bio.nerFeats should be provided as input to 
the Java tagger.

-----------------------

Named Entity Featureset

The feature extraction pipeline extracts affix, MADA and additional
lexical features from Arabic Wikipedia NE and non-NE lexicons.  The
.nerFeats file includes the following features:

0. Token (T1,T2,...Tn-1, Tn)
1. T1
2. Tn
3. T1,T2
4. Tn-1,Tn
5. T1,T2,T3
6. Tn-2,Tn-1,Tn
7. T2,T3
8. Tn-2,Tn-1
9. T2,T3,T4
10. Tn-3,Tn-2,Tn-1
11. T3,T4
12. Tn-4,Tn-3
13. token length
14. is the word's gloss translation capitalized?
15. POS
16. case
17. aspect
18. number
19. person
20. gender
21. state
22. normalized spelling (romanized)
23. is there a mada analysis
24. is the word Arabic UTF-8 (vs. Latin)
25. is base form same as the normalized form
26. Wikipedia: is this a hyperlink or regular text (noisy; turned off)
27. Wikipedia dict: noisy (noisy; turned off)
28. is the token utf-8 or ascii (noisy; turned off)
29. does the term have any vowels (noisy; turned off)
30. is the token in Wiki NE dict
31. is curr+next token in Wiki NE dict
32. is curr+prev token in Wiki NE dict
33. is the token in Wiki non-NE dict
34. is curr+next token in Wiki non-NE dict
35. is curr+prev token in Wiki non-NE dict
36. class

A few of the extracted features are highly noisy, and were found not to 
help NE tagging performance. When training and predicting with the tagger 
it is therefore recommended to omit these features via the --excludeFeatures 
option. Alternatively, this may be accomplished by specifying --properties 
with the included sample.properties file.