Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to index CoNLL-U sub-features? #515

Closed
fishfree opened this issue Apr 18, 2024 · 5 comments
Closed

How to index CoNLL-U sub-features? #515

fishfree opened this issue Apr 18, 2024 · 5 comments

Comments

@fishfree
Copy link

I notcie here, they are commented out. What's you roadmap or any other hacking workarounds? Many thanks!

@jan-niestadt
Copy link
Member

You should be able to configure this yourself in your own version of conll-u.blf.yaml. Use process instructions, specifically the replace instruction for regex replace. You can isolate each subfeature this way an index it in a separate annotation.

@fishfree
Copy link
Author

@jan-niestadt Could you pls show me an example?

@jan-niestadt
Copy link
Member

If your features column contains values like Number=Plur|Person=3|Tense=Pres, you should be able to index Number and Person like this (untested, but I hope you get the idea):

- name: feats
  displayName: Features
  valuePath: 6
  multipleValues: true

- name: number
  valuePath: 6
  process:
  - action: replace
    find: "^.*Number=([^\\|]+).*$"
    replace: "$1"

- name: person
  valuePath: 6
  process:
  - action: replace
    find: "^.*Person=([^\\|]+).*$"
    replace: "$1"

@fishfree
Copy link
Author

@jan-niestadt Thank you very much! If some of my corpus CoNLL-U files have no FEATs part ( for the language doesn't support this output), will it be automatically ignored and bypassed? Hence, other corpus files with the FEATs part keep working?

@jan-niestadt
Copy link
Member

If column 6 is empty in those files, the regex won't match, so nothing will be replaced, so the original empty value will be indexed for Person. That shouldn't be a problem.

What could be a problem is when FEAT sometimes contains Person and sometimes contains only other features (but not Person). For a value without Person, the replace action would do nothing, so the entire original FEAT value would be indexed.

Maybe you could solve this with an extra replace action that empties any FEAT value where Person doesn't occur (using negative lookahead):

- name: person
  valuePath: 6
  process:
  # If string doesn't contain "Person=", make it empty
  - action: replace
    find: "^(?!.*Person=).*$"   # matches any string that doesn't contain "Person="
    replace: ""
  # Remove everything except the value of the Person FEAT (or leave the string unmodified if regex doesn't match)
  - action: replace
    find: "^.*Person=([^\\|]+).*$"    # match the value for Person in group 1
    replace: "$1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants