arabic-tagger/featExtract at master · nschneid/arabic-tagger

History

Name		Name	Last commit message	Last commit date
parent directory ..
lexicons		lexicons
README.txt		README.txt
bio.labels		bio.labels
featExtract.sh		featExtract.sh
featExtraction.py		featExtraction.py
sample.bio		sample.bio
sample.bio.nerFeats		sample.bio.nerFeats
sample.labels		sample.labels
sample.mada		sample.mada
sample.madaFeats		sample.madaFeats
sample.nerFeats		sample.nerFeats

README.txt

The AQMAR Arabic Tagger was primarily developed to support the named entity
detection experiments described in

Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012),
Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL.

For the paper and resources used, see

http://www.ark.cs.cmu.edu/AQMAR/

This directory contains scripts for extracting features on new data in the manner of
the named entity (NE) system described in the paper. Feature extraction is implemented
as a preprocessing step that must precede training or predicting with the (Java-based)
tagger, and requires a local installation of MADA+TOKAN (a toolkit for Arabic
morphological processing which can be obtained from http://www1.ccls.columbia.edu/MADA/).

Feature extraction for the Arabic NE tagger requires two files:

1. An NE-tagged file with one token per line, where the word and its tag
(B, I, or O) are separated by a space, i.e.:

word1 tag
word2 tag
word3 tag
...

Sentences should be separated with blank lines. See sample.bio for an example.

2. The .mada file generated by MADA (the experiments described in the
paper used MADA version 3.1). See sample.mada for an example.

To extract NE features, first ensure that the MADA_HOME environment variable
points to the MADA installation directory. Then (supposing the second file
described above is called sample.mada) run

./featExtract.sh sample

This will extract MADA's features by calling a MADA utility on the .mada
file, then pass those and the .bio file described above to featExtraction.py
to produce features of the form used in the named entity tagger, as described
below. The resulting file sample.bio.nerFeats should be provided as input to
the Java tagger.

-----------------------

Named Entity Featureset

The feature extraction pipeline extracts affix, MADA and additional
lexical features from Arabic Wikipedia NE and non-NE lexicons. The
.nerFeats file includes the following features:

0. Token (T1,T2,...Tn-1, Tn)
1. T1
2. Tn
3. T1,T2
4. Tn-1,Tn
5. T1,T2,T3
6. Tn-2,Tn-1,Tn
7. T2,T3
8. Tn-2,Tn-1
9. T2,T3,T4
10. Tn-3,Tn-2,Tn-1
11. T3,T4
12. Tn-4,Tn-3
13. token length
14. is the word's gloss translation capitalized?
15. POS
16. case
17. aspect
18. number
19. person
20. gender
21. state
22. normalized spelling (romanized)
23. is there a mada analysis
24. is the word Arabic UTF-8 (vs. Latin)
25. is base form same as the normalized form
26. Wikipedia: is this a hyperlink or regular text (noisy; turned off)
27. Wikipedia dict: noisy (noisy; turned off)
28. is the token utf-8 or ascii (noisy; turned off)
29. does the term have any vowels (noisy; turned off)
30. is the token in Wiki NE dict
31. is curr+next token in Wiki NE dict
32. is curr+prev token in Wiki NE dict
33. is the token in Wiki non-NE dict
34. is curr+next token in Wiki non-NE dict
35. is curr+prev token in Wiki non-NE dict
36. class

A few of the extracted features are highly noisy, and were found not to
help NE tagging performance. When training and predicting with the tagger
it is therefore recommended to omit these features via the --excludeFeatures
option. Alternatively, this may be accomplished by specifying --properties
with the included sample.properties file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

featExtract

featExtract

README.txt

Files

featExtract

Directory actions

More options

Directory actions

More options

Latest commit

History

featExtract

Folders and files

parent directory

README.txt