How to make valid parallel corpus? #510

fishfree · 2024-04-16T00:46:04Z

It only says how to index and query parallel corpus here. But how to make valid parallel corpus?

jan-niestadt · 2024-04-16T08:49:53Z

You're right on the bleeding edge of what we're doing with BlackLab! There is no user interface to query parallel corpora yet; I'm working on that. You can use the included QueryTool (in blacklab-tools) to test parallel corpora.

I will write indexing documentation soon, but here's an untested example (UPDATE: tested it, had to fix some stuff, updated below).

Command line (make sure to use integrated index!):

java -cp "blacklab*.jar:lib/*" nl.inl.blacklab.tools.IndexTool create --index-type integrated index test.xml par

Configuration file par.blf.yaml:

# Indexing a simple parallel corpus

# For displaying in user interface (optional)
displayName: "Parallel test format"

corpusConfig:
  specialFields:
    pidField: pid

# Use Saxon for XInclude and better XPath support
# (required for indexing relations, such as parallel corpus alignment relations)
processor: saxon

# What element starts a new document?
# (the only absolute XPath; the rest is relative)
documentPath: //doc

# Annotated, CQL-searchable fields.
# We usually have just one, named "contents".
annotatedFields:

  # Dutch version of the document
  contents__nl:
    containerPath: text[@lang = 'nl']
    wordPath: .//w
    tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
    punctPath: .//text()[not(ancestor::w)]   # = "all text nodes (under containerPath) not inside a <w/> element"

    annotations:
    - name: word
      displayName: Word
      valuePath: .
      sensitivity: sensitive_insensitive

    standoffAnnotations:
    - path: ancestor-or-self::doc/alignment/link[matches(@target, '^#nl')]
      type: relation
      relationClass: al  # alignment relation
      targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
      valuePath: "@type"   # relation type
      sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
      targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"

    inlineTags:
    - path: .//s   # sentence
      tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations


  # English version of the document
  contents__en:
    containerPath: text[@lang = 'en']
    wordPath: .//w
    tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations
    punctPath: .//text()[not(ancestor::w)]   # = "all text nodes (under containerPath) not inside a <w/> element"

    annotations:
    - name: word
      displayName: Word
      valuePath: .
      sensitivity: sensitive_insensitive

    standoffAnnotations:
    - path: ancestor-or-self::doc/alignment/link[matches(@target, '^#en')]
      type: relation
      relationClass: al  # alignment relation
      targetVersionPath: "replace(./@target, '^.+ #(\\w+)\\..+$', '$1')"
      valuePath: "@type"   # relation type
      sourcePath: "replace(./@target, '^#(.+) .+$', '$1')"
      targetPath: "replace(./@target, '^.+ #(.+)$', '$1')"

    inlineTags:
    - path: .//s   # sentence
      tokenIdPath: "@xml:id" # remember id so we can refer to it in standoff annotations


# Our (embedded) document metadata
metadata:
  - containerPath: .
    fields:
    # Use the document id as our persistent identifier
    - name: pid
      type: untokenized
      valuePath: "@xml:id"

Input document test.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<corpus>

    <doc xml:id='d1'>

        <text lang="nl">
            <s xml:id="nl.d1.s1">
                <w xml:id="nl.d1.w1">De</w>
                <w xml:id="nl.d1.w2">snelle</w>
                <w xml:id="nl.d1.w3">bruine</w>
                <w xml:id="nl.d1.w4">vos</w>.
            </s>
        </text>

        <text lang="en">
            <s xml:id="en.d1.s1">
                <w xml:id="en.d1.w1">The</w>
                <w xml:id="en.d1.w2">quick</w>
                <w xml:id="en.d1.w3">brown</w>
                <w xml:id="en.d1.w4">fox</w>.
            </s>
        </text>

        <alignment>
            <link type="sentence-alignment" target="#nl.d1.s1 #en.d1.s1"/>
            <link type="word-alignment" target="#nl.d1.w1 #en.d1.w1"/>
            <link type="word-alignment" target="#nl.d1.w2 #en.d1.w2"/>
            <link type="word-alignment" target="#nl.d1.w3 #en.d1.w3"/>
            <link type="word-alignment" target="#nl.d1.w4 #en.d1.w4"/>

            <link type="sentence-alignment" target="#en.d1.s1 #nl.d1.s1"/>
            <link type="word-alignment" target="#en.d1.w1 #nl.d1.w1"/>
            <link type="word-alignment" target="#en.d1.w2 #nl.d1.w2"/>
            <link type="word-alignment" target="#en.d1.w3 #nl.d1.w3"/>
            <link type="word-alignment" target="#en.d1.w4 #nl.d1.w4"/>
        </alignment>

    </doc>

</corpus>

I hope this helps you to get started with this. Stay tuned for the UI.

fishfree · 2024-04-16T12:56:00Z

Thank you! Could you please support the Pharaoh alignment format, which is supported by some deep learning library, i.e. https://github.com/neulab/awesome-align If so, we can greatly utilize AI tools.

jan-niestadt · 2024-04-16T13:23:52Z

Good luck!

Unfortunately, we have no plans to add support for specific non-XML formats. You can either convert this format to a suitable XML version that BlackLab can index, or implement a DocIndexer in Java to handle this format.

jan-niestadt closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make valid parallel corpus? #510

How to make valid parallel corpus? #510

fishfree commented Apr 16, 2024

jan-niestadt commented Apr 16, 2024 •

edited

Loading

fishfree commented Apr 16, 2024

jan-niestadt commented Apr 16, 2024 •

edited

Loading

How to make valid parallel corpus? #510

How to make valid parallel corpus? #510

Comments

fishfree commented Apr 16, 2024

jan-niestadt commented Apr 16, 2024 • edited Loading

fishfree commented Apr 16, 2024

jan-niestadt commented Apr 16, 2024 • edited Loading

jan-niestadt commented Apr 16, 2024 •

edited

Loading

jan-niestadt commented Apr 16, 2024 •

edited

Loading