Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make valid parallel corpus? #510

Closed
fishfree opened this issue Apr 16, 2024 · 3 comments
Closed

How to make valid parallel corpus? #510

fishfree opened this issue Apr 16, 2024 · 3 comments

Comments

@fishfree
Copy link

It only says how to index and query parallel corpus here. But how to make valid parallel corpus?

@jan-niestadt
Copy link
Member

jan-niestadt commented Apr 16, 2024

You're right on the bleeding edge of what we're doing with BlackLab! There is no user interface to query parallel corpora yet; I'm working on that. You can use the included QueryTool (in blacklab-tools) to test parallel corpora.

I will write indexing documentation soon, but here's an untested example (UPDATE: tested it, had to fix some stuff, updated below).

Command line (make sure to use integrated index!):

java -cp "blacklab*.jar:lib/*" nl.inl.blacklab.tools.IndexTool create --index-type integrated index test.xml par

Configuration file par.blf.yaml:

# Indexing a simple parallel corpus

# For displaying in user interface (optional)
displayName: "Parallel test format"

corpusConfig:
  specialFields:
    pidField: pid

# Use Saxon for XInclude and better XPath support
# (required for indexing relations, such as parallel corpus alignment relations)
processor: saxon

# What element starts a new document?
# (the only absolute XPath; the rest is relative)
documentPath: //doc

# Annotated, CQL-searchable fields.
# We usually have just one, named "contents".
annotatedFields:

  # Dutch version of the document
  contents__nl:
    containerPath: text[@lang = 'nl']
    wordPath: .//w
    tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
    punctPath: .//text()[not(ancestor::w)]   # = "all text nodes (under containerPath) not inside a <w/> element"

    annotations:
    - name: word
      displayName: Word
      valuePath: .
      sensitivity: sensitive_insensitive

    standoffAnnotations:
    - path: ancestor-or-self::doc/alignment/link[matches(@target, '^#nl')]
      type: relation
      relationClass: al  # alignment relation
      targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
      valuePath: "@type"   # relation type
      sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
      targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"

    inlineTags:
    - path: .//s   # sentence
      tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations


  # English version of the document
  contents__en:
    containerPath: text[@lang = 'en']
    wordPath: .//w
    tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
    punctPath: .//text()[not(ancestor::w)]   # = "all text nodes (under containerPath) not inside a <w/> element"

    annotations:
    - name: word
      displayName: Word
      valuePath: .
      sensitivity: sensitive_insensitive

    standoffAnnotations:
    - path: ancestor-or-self::doc/alignment/link[matches(@target, '^#en')]
      type: relation
      relationClass: al  # alignment relation
      targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
      valuePath: "@type"   # relation type
      sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
      targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"

    inlineTags:
    - path: .//s   # sentence
      tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations


# Our (embedded) document metadata
metadata:
  - containerPath: .
    fields:
    # Use the document id as our persistent identifier
    - name: pid
      type: untokenized
      valuePath: "@xml:id"

Input document test.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<corpus>

    <doc xml:id='d1'>

        <text lang="nl">
            <s xml:id="nl.d1.s1">
                <w xml:id="nl.d1.w1">De</w>
                <w xml:id="nl.d1.w2">snelle</w>
                <w xml:id="nl.d1.w3">bruine</w>
                <w xml:id="nl.d1.w4">vos</w>.
            </s>
        </text>

        <text lang="en">
            <s xml:id="en.d1.s1">
                <w xml:id="en.d1.w1">The</w>
                <w xml:id="en.d1.w2">quick</w>
                <w xml:id="en.d1.w3">brown</w>
                <w xml:id="en.d1.w4">fox</w>.
            </s>
        </text>

        <alignment>
            <link type="sentence-alignment" target="#nl.d1.s1 #en.d1.s1"/>
            <link type="word-alignment" target="#nl.d1.w1 #en.d1.w1"/>
            <link type="word-alignment" target="#nl.d1.w2 #en.d1.w2"/>
            <link type="word-alignment" target="#nl.d1.w3 #en.d1.w3"/>
            <link type="word-alignment" target="#nl.d1.w4 #en.d1.w4"/>

            <link type="sentence-alignment" target="#en.d1.s1 #nl.d1.s1"/>
            <link type="word-alignment" target="#en.d1.w1 #nl.d1.w1"/>
            <link type="word-alignment" target="#en.d1.w2 #nl.d1.w2"/>
            <link type="word-alignment" target="#en.d1.w3 #nl.d1.w3"/>
            <link type="word-alignment" target="#en.d1.w4 #nl.d1.w4"/>
        </alignment>

    </doc>

</corpus>

I hope this helps you to get started with this. Stay tuned for the UI.

@fishfree
Copy link
Author

Thank you! Could you please support the Pharaoh alignment format, which is supported by some deep learning library, i.e. https://github.com/neulab/awesome-align If so, we can greatly utilize AI tools.

@jan-niestadt
Copy link
Member

jan-niestadt commented Apr 16, 2024

Good luck!

Unfortunately, we have no plans to add support for specific non-XML formats. You can either convert this format to a suitable XML version that BlackLab can index, or implement a DocIndexer in Java to handle this format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants