-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make valid parallel corpus? #510
Comments
You're right on the bleeding edge of what we're doing with BlackLab! There is no user interface to query parallel corpora yet; I'm working on that. You can use the included QueryTool (in blacklab-tools) to test parallel corpora. I will write indexing documentation soon, but here's an untested example (UPDATE: tested it, had to fix some stuff, updated below). Command line (make sure to use integrated index!):
Configuration file # Indexing a simple parallel corpus
# For displaying in user interface (optional)
displayName: "Parallel test format"
corpusConfig:
specialFields:
pidField: pid
# Use Saxon for XInclude and better XPath support
# (required for indexing relations, such as parallel corpus alignment relations)
processor: saxon
# What element starts a new document?
# (the only absolute XPath; the rest is relative)
documentPath: //doc
# Annotated, CQL-searchable fields.
# We usually have just one, named "contents".
annotatedFields:
# Dutch version of the document
contents__nl:
containerPath: text[@lang = 'nl']
wordPath: .//w
tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
punctPath: .//text()[not(ancestor::w)] # = "all text nodes (under containerPath) not inside a <w/> element"
annotations:
- name: word
displayName: Word
valuePath: .
sensitivity: sensitive_insensitive
standoffAnnotations:
- path: ancestor-or-self::doc/alignment/link[matches(@target, '^#nl')]
type: relation
relationClass: al # alignment relation
targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
valuePath: "@type" # relation type
sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"
inlineTags:
- path: .//s # sentence
tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
# English version of the document
contents__en:
containerPath: text[@lang = 'en']
wordPath: .//w
tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
punctPath: .//text()[not(ancestor::w)] # = "all text nodes (under containerPath) not inside a <w/> element"
annotations:
- name: word
displayName: Word
valuePath: .
sensitivity: sensitive_insensitive
standoffAnnotations:
- path: ancestor-or-self::doc/alignment/link[matches(@target, '^#en')]
type: relation
relationClass: al # alignment relation
targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
valuePath: "@type" # relation type
sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"
inlineTags:
- path: .//s # sentence
tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
# Our (embedded) document metadata
metadata:
- containerPath: .
fields:
# Use the document id as our persistent identifier
- name: pid
type: untokenized
valuePath: "@xml:id" Input document <?xml version="1.0" encoding="UTF-8" ?>
<corpus>
<doc xml:id='d1'>
<text lang="nl">
<s xml:id="nl.d1.s1">
<w xml:id="nl.d1.w1">De</w>
<w xml:id="nl.d1.w2">snelle</w>
<w xml:id="nl.d1.w3">bruine</w>
<w xml:id="nl.d1.w4">vos</w>.
</s>
</text>
<text lang="en">
<s xml:id="en.d1.s1">
<w xml:id="en.d1.w1">The</w>
<w xml:id="en.d1.w2">quick</w>
<w xml:id="en.d1.w3">brown</w>
<w xml:id="en.d1.w4">fox</w>.
</s>
</text>
<alignment>
<link type="sentence-alignment" target="#nl.d1.s1 #en.d1.s1"/>
<link type="word-alignment" target="#nl.d1.w1 #en.d1.w1"/>
<link type="word-alignment" target="#nl.d1.w2 #en.d1.w2"/>
<link type="word-alignment" target="#nl.d1.w3 #en.d1.w3"/>
<link type="word-alignment" target="#nl.d1.w4 #en.d1.w4"/>
<link type="sentence-alignment" target="#en.d1.s1 #nl.d1.s1"/>
<link type="word-alignment" target="#en.d1.w1 #nl.d1.w1"/>
<link type="word-alignment" target="#en.d1.w2 #nl.d1.w2"/>
<link type="word-alignment" target="#en.d1.w3 #nl.d1.w3"/>
<link type="word-alignment" target="#en.d1.w4 #nl.d1.w4"/>
</alignment>
</doc>
</corpus> I hope this helps you to get started with this. Stay tuned for the UI. |
Thank you! Could you please support the Pharaoh alignment format, which is supported by some deep learning library, i.e. https://github.com/neulab/awesome-align If so, we can greatly utilize AI tools. |
Good luck! Unfortunately, we have no plans to add support for specific non-XML formats. You can either convert this format to a suitable XML version that BlackLab can index, or implement a DocIndexer in Java to handle this format. |
It only says how to index and query parallel corpus here. But how to make valid parallel corpus?
The text was updated successfully, but these errors were encountered: