Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relations between words #411

Merged
merged 29 commits into from
Sep 14, 2023
Merged

Relations between words #411

merged 29 commits into from
Sep 14, 2023

Conversation

jan-niestadt
Copy link
Member

@jan-niestadt jan-niestadt commented Mar 29, 2023

Allows searching for (dependency) relations between words.

More details in plan-relations.md.

Done (relations):

  • Support for indexing and searching relations (such as dependency relations) was added. Relations have a class (e.g. "dep" for dependency relations, also the default class), a type (e.g. "det" for a determiner relation), a source and target (both spans of adjacent tokens), and may have arbitrary attributes (just as inline tags can have attributes).
  • Both relations and inline tags are indexed in the _relation annotation (which replaces the ill-named starttag annotation)
  • Inline tags are now indexed as a special relation type pointing from the start to the end tag position, both 0-length).
  • The API for indexing inline tags from a DocIndexer was abstracted, so no implementation details need to live in the DocIndexers.
  • Both low-level relations functions and a high-level dependency relations operator were added to our CQL implementation. See CQL extension for dependency relations
  • A DocIndexer for the CoNLL-U format was added, which can include dependency relations
  • Capture groups were generalized to "match info", also capturing any inline tags or relations that were matched, including their type and attributes. BLS will return matched capture groups, relations and inline tags separately in the response. Use new matchinfo parameter with value captures to get the old behaviour (for compatibility).
  • Match info is now taken into account for hit uniqueness. This means that several hits with the same start and end position may be returned, if they have different match info (e.g. several different ways groups can be captured in this same hit span)
  • When matching relations, there is always one "active" relation, which is the one used in subsequent relation matching. You can adjust the current span to the source, target, or full span of the active relation. You can also adjust it to cover all matched relations if needed.
  • The rmatch() operation is used in place of & (AND) when matching relations clauses. rmatch() ensures that all matched relations are unique, so we can find e.g. two different "det" relations with the same source, and not get a result with the same "det" relation twice. It also ensures that sets of matched relations are unique, so we don't get separate hits matching A;B and B;A. All this ensures that things work as expected for users.
  • Constraints (after :: operator) may now use e.g. start(A) > start(B) to check the ordering of captures. end(A) also works. Previously :: constraints were only allowed at the top-level of the query, but they may now also occur in parenthesized subqueries, as long as they don't refer to groups captured outside that subquery.
  • Relation queries are optimized where possible. When searching for relations where the source or target is only specified as []{m,n}, e.g. [] for one word, are resolved using a new length filter instead of finding all n-grams (this is used for any applicable AND(NOT) query). If you capture the source/target specified as an n-gram, this should be optimized as well by lifting the capture operation so the AND can be optimized as described.
  • group by capture was improved as relations queries may capture spans outside of the requested hit+context. This is now no longer a problem.
  • You can change the name of captured relations and inline tags to avoid collisions if you're capturing the same relation or tag twice in a query. Use e.g. _ A:-det-> _ to capture all det relations as A and A:<s/> to capture all sentences as A. The latter is a slightly incompatible change as this would formerly have been captured as a regular span, but is now captures as the inline tag (with tag name and attributes, in inlineTags section in BLS). If you really need the old behaviour, you can use A:_ident(<s/>). _ident() is a debug function that will not affect the meaning of the query, just prevent certain query rewrites from being applied. The third parameter of rel() can now be used to name the capture group as well. If the empty string is used, a name is automatically assigned (default behaviour). The 4th param is the direction filter that used to be the 3rd param.
  • capture all relations for matched sentence with e.g. rcapture(sen:<s/>, 'sen', 'sen-rels')
  • _ is now the "don't care" / "default" value for the relation operator(s) as well as for function calls. So _ --> _ now means "all relations. That is, _ is interpreted as []* here. /snippet operation does this automatically if you specify context=s.
  • if a query matches several relations with same regex (e.g. multiple -->), each will get a unique capture name if not assigned explicitly in the query.
  • support Parlamint TEI with dependency relations
  • If relations query doesn't have rspan() call as root, automatically add rspan(..., 'all')

Done (other):

  • context=s will return a whole sentence as context; context=5:10 will return 5 words before and 10 after the hit (context replaces wordsaroundhit, which still works for now). Also works for /snippet.
  • /snippet response is now same as a hit in hits results
  • Tests were switched to API v4
  • Weight.isCachable() was implemented for various Weight classes to allow a bit more caching.
  • inconsistent Unicode character normalization (canonical vs decomposed) will be corrected during indexing. Also soft hyphen and zero-width space are automatically removed. This prevents problems with sorting/grouping later.
  • proxy supports POST requests as well, useful for very long queries
  • HitProperties that use context from the forward index now manage their own context instead of the complicated system with needsContext, context indices, etc. They only fetch the context tokens they actually need, which could be slightly faster. The context is be disposed when it's no longer needed to save memory.
  • API v4 is transitional (98% compatible with v3, adds stuff like new /corpora/NAME endpoints), experimental API v5 is new API (removes deprecated v3 stuff, changes some keys, some structures, XML more in line with JSON). Use api=3|4|exp to select API to use. Don't rely on any of the new API stuff until BlackLab 4.0 is released.
  • CI saves responses for external and integrated separately (mainly because matchInfo differs; there may be more differences in the future)
  • CorpusQL query can be translated to/from a JSON textpattern. You may pass such a JSON structure as the patt parameter, and the JSON structure for CQL query will be returned in the response as well. There's also a new endpoint /parse-pattern that can parse a CQL query into a JSON response and vice versa. Useful for implementing a query builder.
  • Unify VTD and Saxon indexers, give Saxon feature parity
  • Refactored DocumentFormats to simplify the code and make it easier to track errors.

TODO (see issue #449):

  • Add proper relations tests (include some lassy data for tests? or something synthetic?)
  • write proper documentation for relations (once features and syntax finalized)

MAYBE

  • Include rcapture alternative? E.g. "de" within (<s/> collect -->)
  • OPT: maybe drop the distinct relations requirement for rmatch if relationtypes in query are already distinct?
  • Relation descendants: what operations would be needed? would indexing extra "multi-relations" help? (you could index all paths from the root to a leaf, with the relation types included in the indexed term, and the positions included in the payload) or do we need to reconstruct the full tree structure and perform the entire operation on that?
  • it might be nice if you could do something like rspan but for any match info, i.e. "give me the span that matches group A" or "give me the span for the sentence tag".
  • MAYBE: syntax to capture different spanMode without changing span? So instead of e.g. rspan(A:rel('flat', _, 'target'), 'source') you could write rcap(rel('flat'), 'A', 'target') ("capture target of rel('flat') as A, but resulting span is still the same as rel('flat')"), so rcap(query, name, spanMode='target')

Closes #405 and #201.

@jan-niestadt jan-niestadt linked an issue Mar 29, 2023 that may be closed by this pull request
@jan-niestadt jan-niestadt linked an issue Apr 17, 2023 that may be closed by this pull request
@jan-niestadt jan-niestadt force-pushed the feature/relations branch 2 times, most recently from 2087971 to 21768fe Compare May 23, 2023 07:52
@jan-niestadt jan-niestadt linked an issue May 24, 2023 that may be closed by this pull request
@jan-niestadt jan-niestadt force-pushed the feature/relations branch 2 times, most recently from 42820db to 8714228 Compare May 31, 2023 09:38
@jan-niestadt jan-niestadt force-pushed the feature/relations branch 4 times, most recently from 7e1296e to d61d704 Compare June 21, 2023 13:15
@jan-niestadt jan-niestadt linked an issue Jul 6, 2023 that may be closed by this pull request
@jan-niestadt jan-niestadt force-pushed the feature/relations branch 2 times, most recently from 6b6ec0b to 93306c5 Compare July 24, 2023 11:46
@jan-niestadt jan-niestadt force-pushed the feature/relations branch 2 times, most recently from 8f13d2f to dcfe2e9 Compare September 12, 2023 13:01
A relation is simply a labelled, directed arrow between two words (or word groups). We will use this primitive to implement dependency relations search, among others. We've also updated how spans (e.g. inline XML tags) are indexed and search to use this same primitive.

The tests now all use API version 4.0. See https://inl.github.io/BlackLab/server/rest-api/ for the (minor) differences.

In the integrated index format, the 'starttag' annotation (where spans used to be indexed) is now called '_relation' and will be where both spans and word (dependency) relations will be indexed.

Custom DocIndexer classes may need to be updated to use the new indexInlineTag method instead of adding values to the starttag annotation manually. If not, the DocIndexer will still mostly work, except for inline tags.

A parser for the CoNLL-U format that includes dependency relation was implemented.

A new QueryExtensions mechanism was created that can be used to add extension functions to CQL. This mechanism is now used to add existing debug functions like _FI1(), but also for the relation primitives such as rel(), rspan(), etc. Default parameter values are supported (omit at the end or use _ in the middle).

Global constraints in CQL (:: operator) are now also allowed to be used within parentheses, making them local constraints.

Some bugs were fixed, including a bug finding doc length for any token queries.
… store captures.

MatchInfo is now kept in the HitsInternal classes instead of in an external class (CapturedGroupsImpl, with map lookups). This is more efficient and allows us to easily take MatchInfo into account when comparing hits.
Two-phase iterators provide a way to go through Spans more efficiently, as it allows us to entirely skip documents that don't contain the required terms at all before actually fetching all the terms lists.

We've converted our Spans classes to enable two-phase iterators. We're also sharing more code between the various Spans classes. We've adopted Lucene's FilterSpans and ConjunctionSpans and created BL-versions of them (to use BLSpans and to be able to override methods marked as final in Lucene) and derived several of our Spans classes from them.

Aside from the normal unit and integration tests, we've also run comparative tests on a larger corpus to make sure results are identical.
SpansAnd didn't deal correctly with clauses that produce hits with the same doc, start and end (but different match info). Now it has a default, slower version that correctly deals with this, and the previous version (SpansAndSimple) is used where possible.

SpansRepetition had a bug that skipped certain matches if the repetition consisted of non-consecutive hits. Again, there's now a slower implementation that correctly deals with this and SpansRepetitionSimple is the previous version that will be used when we know this problem cannot occur. A new guarantee, hitsCanOverlap, was added for this.

Spans classes were documented in the BlackLab internals doc, and QueryTool was made more suitable for larger-scale correctness testing.

SpansInBucketsPerDocumentSorted now uses sort indexes to sort hits, so match info is also sorted along with the starts/ends. Incurs a slight performance hit, but probably not too bad.

Additional tests were added for the above.
- BLSpans and SpansInBuckets now also have guarantees like BLSpanQuery had (e.g. hitsAllSameLength, etc.). They are available through a method guarantees() that returns a SpanGuarantees object. This allows better optimization (e.g. avoid unnecessary sorting/uniqueing) and better validation (e.g. ensure that clauses are sorted)
- Care was taken to include match info when removing duplicate hits. So hits with the same doc, start and end are still considered distinct if they have different match info (captured groups or relations).
- Matching problems with SpansRelations (non-consecutive repetitions) and SpansAndNot (identical hits with different match info) were resolved, as well as a number of smaller bugs.
- No positionsCost if asTwoPhaseIt() never returns null.
- Flag indicates default source/target length.
- Moved relations-related methods from AnnotatedFieldNameUtil to RelationUtil.
- Implemented getRelationInfo() to get active relation info for all BLSpans classes.
- spanMode 'all' adjusts hits to cover all matched relations.
- Docs. QueryTool tweak.
Almost all of our SpanWeight objects should be cacheable as they don't
rely on DocValues or global statistics etc. The only we're not sure about is
SpanWeightFiSeq, which might run into trouble with the global forward index API
which we plan to remove eventually. For now, we won't cache SpanWeightFiSeq.
- MatchInfo with captures, relations in BLS response. RelationInfo and SpanInfo both implement MatchInfo. RelationInfo is used both during indexing and matching and contains more data (source and target spans, relation type) than SpanInfo, which only represents a span during matching (e.g. a captured group).
- BLS response now includes relations, and Proxy handles them as well.
- Fix SpansInBuckets.setHitQueryContext being skipped.
BLFilterDocsSpans: handle SpansInBuckets properly.
- Separate MatchInfo type INLINE_TAG. Inline tags are indexed as relations, but are kind of a special case.
Their source and target are length 0 and the relation class is a special value, __tag, to avoid problems with "real" relations. It doesn't make sense to report these like relations, so we've made them a separate type that can be handled separately.
- Fixed test failures because inline tags were being reported as relations.
- Update plan-relations (rmatch).
- Improve CorpusQL parser definition: Enable top-level NOT; separate capture and sequencepart in cql.jj.
- Fix relations getting incorrect sorted.
- Fix active relation SpansInBuckets.
- Fix capturing multiple rels of same type.
- Query eq/hashCode. New rel(),rtree().
- Fix tests.
- rtree renamed to rmatch.
There were two duplicates problems with relations:
1. a query that has a relations query of a source (parent) with two targets (children) clauses might match the same relation for both (e.g. it matches some relation A twice for one hit), but the user expects different relations.
2. the same type of query might produce two hits that have the same relations in different order (e.g. one hit matches A and B and the other matches B and A). Again, these are duplicates to the user.

SpansAndMulti was created first to solve problem 1 by checking that matched relations are unique.

Eventually SpansAndMultiUniqueRelations was created to return unique sets of relations.

Also:
- Fix bug in CoNLLU parser.
jan-niestadt and others added 19 commits September 14, 2023 13:23
Global constraints: comparison ops, start/end(). These allow us to compare capture ordering (e.g. to see if a relation points forward or backward), e.g. using something like A:_ --> B:_ :: start(A) < start(B) will find forward relations.
We switched to indexing relations at the source because the relation operator yields the source, so we won't have to sort the hits in that case; they will already be sorted.

A matching bug was fixed. Many assertions were added to the Spans classes to make detecting and tracking down matching bugs easier.
Relations operators were added to CorpusQL, including one to match root relations and a negation.

Inline tags are now captured and included in the response, with tag name and attributes.

Also:
- Fix MatchFilterCompare.hashCode/equals.
- Fix SpanQueryRelations.equals/hashCode.
- Fix AND/[] bug. Add regex/any rel op.
- AND optimization: remove match all if all clauses same length.
- Fix AND/[] optimization, RelationSpanAdjust bug.
- Various optimizations. Improve guarantees.
- Add -v option (verbose) to QueryTool.
- Fix HitPropertyCaptureGroup.
- CoNLL-U: show line number on error.
- Fix NPE in QueryTool.
- Fix relation class prefix only applying to part of regex.
- Fix rel() not sorting clauses.
- Refactor SpanQueryAndNot.rewrite(). Lift captures.
- SpansConstraint should ask only for clause's match info.
- Fix group by capture by retrieving from FI directly.
- Document query rewriting.
- Fixes, assertions.
- Clear atFirstInCurrentDocument in twoPhaseCurrentDocumentMatches().
- More consistenly name atFirstInCurrentDocument.
- Bugfix in SpansSequenceWithGap constructor.
- Added assertions to Spans classes.
- Fix bug with rspan if no relations matched.
Because our Parlamint data set contains inconsistent Unicode
normalization (some canonical accented characters, some decomposed),
we now normalize to canonical in the CoNLL-U indexer. Maybe all
indexers should do this? Inconsistent normalization causes issues
in TermsReaders, probably because we use TERTIARY collator instead of
IDENTICAL.

Also adds (disabled) code to validate the terms sort on init.

Squashed commit of the following:

commit ca15b4c3fd170639a8419d76a2014370d340d839
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Jul 4 13:21:00 2023 +0200

    Normalize values while indexing CoNLL-U (Parlamint).

commit 9a3cc08
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Mon Jul 3 16:50:59 2023 +0200

    WIP figure out why term sorting seems broken.
context replaces wordsaroundhit. You can specify 5 (5 words before
and after), 5:10 (5 before, 10 after), or s (sentences).
- We now remove unprintable characters like zero-width space and soft hyphen
while indexing.
- We ensure canonical Unicode composition.
- Instead of String.trim() we use StringUtil.trimWhitespace() to trim all space
  characters from start and end.
Lucene's optional ~ operator can be used to find relations
whose type doesn't match something. Useful for exclusions.

We haven't enabled this everywhere because NFA matching would also
have to support this operator. Right now the regexes are assumed to
be compatible, which is not strictly true (e.g. escaping rules for Lucene
are stricter than for Java's regex engine), but this would present real
challenges.
- HitPropertyCaptureGroup equals/hashCode.
- Proxy: handle & forward POST requests.
- Fix markdown layout.
rcapture() captures all relations in a captured span.

e.g. rcapture(<s/>, 's', 'rels') captures sentences and their relations. It relies on a capture 's' existing, which it will if your query contains <s/>. You can also explicitly name your capture, e.g. rcapture('the' within X:<s/>, 'X', 'rels').

Also:
- allow _ (don't care / default) to be used with relation operator.
- simplify CQL parser definition a little bit.
Changes:
- for requests, before/after also work (as synonyms for left/right), but BLS response still calls them left/right so we don't break compatibility.
- complex needsContext/contextIndices removed. HitProperty will fetch its own context when needed. Call disposeContext() to free memory (standard group/sort/filter does this automatically).
- before:word:i:3 now uses 3 words before hit. wordleft/wordright now translate into before/after with n==1.
- ctx:word:i:E1 is a single-part version of context:word:i:H1;E1 (latter will be translated into multiple of the former; uses new class HitPopertyContextPart)
- classes HitPropertyContextWords/WordLeft/WordRight were removed (no longer needed because of the translations mentioned).
- HitPropertyCaptureGroup should be faster as it now batches context fetching just like the other classes do.
- Add HP(Before|After)Hit. Deprecate HP(Left|Right)Context. This continues the process of BlackLab being more reading direction
agnostic. Of course, BLS responses still refer to left/right for
compatibility. That will be addressed in a future version the API.
You can now explicitly name relations captures using e.g. A:-det-> to capture these relations under the name A.

Similarly, you can name captured tags, e.g. A:<s/> will capture not just the span start/end but also the tag name and attributes.

Switched relation operators to single dash, e.g. -det->

Also:
- update BARKs.
- fix and re-enable ignored tests.
- fix proxy. Match info as map.
- unified matchInfo with simple, verbose XML.
- update preindexed external files index for test.
Switch current API to v4. Add experimental v5 (stricter/cleaner). You can set the default API version to use or switch per request using the api parameter (api=4 / api=5-exp)

The goal for API v5 is to improve the API by removing and changing endpoints, parameters and response structure. For now, v4 provides a transitional API that includes stuff from both v3 and v5. Anything related to v5 should still be considered experimental.

The most important change in v4/5 is the addition of /corpora/CORPUSNAME/xxx endpoints that work slightly (not radically) different. v4 also retains the /CORPUSNAME/xxx endpoints, but those are removed in v5.

v5 also changes XML output to a format that's easier to maintain. XML is not used as often as JSON, but we would like to keep supporting it without too much extra work.

See api-versions.md for more info.
Refactored ResponseStreamer so hit and snippet use the same code and return the same structure, including e.g. matchInfo.

Snippet also supports context=s to get whole sentence. Made sure max snippet size is enforces for context=s as well. Relations can be capture with snippet, e.g. all relations in sentence.

Tests were split per index type, mainly because integrated will return more matchInfo than classic external.

Also fixed NPE in Hits.java.
Search response summary now includes a pattern object with a json key that contains a JSON structure corresponding to the query. The patt parameter can also take JSON now. There is a new endpoint, CORPUS/parse-pattern, that will just parse a (CorpusQL or JSON) pattern. Useful for e.g. implementing query builders.

Also:
- CI test results are stable now (object keys are stored in alphabetical order).
- latest-test-output is stored (.gitignored); this allows us to easily compare mismatches and update the saved-responses if necessary.
- add TextPatternRelationTarget/Match.
- deprecate TextPatternAnnotation/Sensitive (now part of ~Regex/Term; simplifies structure, closer to CorpusQL)
- deprecate TextPatternPrefix/Wildcard (just use regex).
- deprecate a few other TextPattern classes.
- removed already-deprecated double dash relation operators (e.g. --det-->, ---->, etc.).
- Improve unique capture name assignment.
- Add rspan(..,'all') automatically.
- Update plan-relations (index at source).
It happens when annotatedfield doesn't contain any annotations.
When indexing linked metadata, a dummy annotatedField "metadata" is created. Several parts in the code check for this and ignore it. But IndexMetadataExternal (and perhaps internal too?) let it through. This seems to be required because the IndexMetadata object is used to pre-create the contentstore. If we don't do that, the content store for the metadata won't exist, and the indexing process will throw an exception.

So for now, just let it through, and ignore it when serializing. A more thorough solution might have to be implemented in the future. (We probably DO want to have the field exist in the IndexMetadata object, or indexing new documents into the same corpus later won't work?)
This also triggered a problem where
it would be concluded that there weren't enough hits for a full window
(of say, 20 hits), then decided that we will just return all the available hits,
which in thise case turned out to be thousands.
The Saxon indexer should now have full feature parity with
the VTD one. Use `processor: saxon` at the top-level of your
.blf.yaml file to use Saxon. Saxon is faster and supports
XPath 3.1.

It is also possible to index relations using standoffAnnotations
now.

Squashed commit of the following:

commit 4bf5782ee35915bcf5eaf79e9e1647f9ace7691a
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Sep 7 15:23:48 2023 +0200

    Update changelog.md.

commit ba32ca9
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Sep 7 14:27:49 2023 +0200

    Small edit in xpath-examples.md.

commit 4cc51cb
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Sep 7 11:13:47 2023 +0200

    processor: saxon. Docs.

commit 0ac6d8e
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Sep 7 09:23:54 2023 +0200

    Deal with string values correctly.

commit ad26da4
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Wed Sep 6 16:02:41 2023 +0200

    Real Parlamint TEI parses.

commit 4a51c6c
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Wed Sep 6 13:13:57 2023 +0200

    Basic Parlamint TEI test works.

commit 78374cd
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 31 14:24:09 2023 +0200

    All but annotations (?) pulled up.

commit 9bc8225
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 31 10:42:40 2023 +0200

    Tweaks, before separate punct/inline (VTD).

commit 8ecd57f
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 29 17:38:53 2023 +0200

    Pulled some methods up.

commit 7ec8be5
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 29 14:29:09 2023 +0200

    Looking quite similar now, about to pull more up.

commit 9189dc7
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 29 13:59:29 2023 +0200

    Pull metadata code up.

commit 3ddf45d
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 29 13:03:51 2023 +0200

    Refactor VTD/Saxon

commit 0b1d9cf
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 29 11:14:22 2023 +0200

    Refactor VTD/Saxon

commit d220669
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 24 18:18:51 2023 +0200

    Continue refactoring VTD indexer.

commit ae70e94
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 24 15:28:04 2023 +0200

    WIP simplify annotation/relation processing.

commit ab84529
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 24 14:30:30 2023 +0200

    Intermediate commit.

commit 5e97b2b
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu Aug 24 13:42:10 2023 +0200

    Intermediate commit.

commit 904035f
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Wed Aug 23 16:20:32 2023 +0200

    WIP refactoring DocIndexerXPath/VTD/Saxon.

commit fb9fa2b
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Wed Aug 23 11:50:43 2023 +0200

    Refactoring, trying to share code between Saxon/VTD.

commit 0e1fb75
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 22 16:23:00 2023 +0200

    Simplify inline tags/punct handling.

commit 2e210f5
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue Aug 22 16:08:53 2023 +0200

    More refactoring.

commit 56765b3
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Mon Aug 21 16:21:44 2023 +0200

    Refactored SaxonHelper into smaller classes.

commit 2476de4
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Mon Aug 21 16:05:05 2023 +0200

    Intermediate commit.

commit 07951a4
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Mon Aug 21 13:40:23 2023 +0200

    Start on Parlamint-TEI.
@jan-niestadt jan-niestadt marked this pull request as ready for review September 14, 2023 11:28
- ConLL-U: don't index multitoken lines; index multitokens in mwt annotation.
- Warn about baseFormat deprecation.
- VTD XPath warning, named entity refs.
- Docs: Saxon, XPath.
- Remove unused subprops param. Cleanups.
- DocumentFormats refactor.

DocumentFormats keeps track of all loaded InputFormats.
InputFormat represents a format and can create DocIndexers.
FinderInputFormats can locate input formats. There's one
Finder for classes and one for user formats (BLS).
Format configs are located once and added, with the config
file lazy-loaded only when needed). Applications can add
formats manually too, as needed (e.g. IndexTool).
We keep track of errors with formats, either in the InputFormat
implementation or in the special InputFormatError class.
@jan-niestadt jan-niestadt merged commit 473bc96 into dev Sep 14, 2023
3 checks passed
@jan-niestadt jan-niestadt deleted the feature/relations branch September 21, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants