-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relations between words #411
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cd4464f
to
76c3d44
Compare
2087971
to
21768fe
Compare
Closed
42820db
to
8714228
Compare
8714228
to
ac3b384
Compare
7e1296e
to
d61d704
Compare
6b6ec0b
to
93306c5
Compare
7cc247f
to
59cc6ac
Compare
8f13d2f
to
dcfe2e9
Compare
A relation is simply a labelled, directed arrow between two words (or word groups). We will use this primitive to implement dependency relations search, among others. We've also updated how spans (e.g. inline XML tags) are indexed and search to use this same primitive. The tests now all use API version 4.0. See https://inl.github.io/BlackLab/server/rest-api/ for the (minor) differences. In the integrated index format, the 'starttag' annotation (where spans used to be indexed) is now called '_relation' and will be where both spans and word (dependency) relations will be indexed. Custom DocIndexer classes may need to be updated to use the new indexInlineTag method instead of adding values to the starttag annotation manually. If not, the DocIndexer will still mostly work, except for inline tags. A parser for the CoNLL-U format that includes dependency relation was implemented. A new QueryExtensions mechanism was created that can be used to add extension functions to CQL. This mechanism is now used to add existing debug functions like _FI1(), but also for the relation primitives such as rel(), rspan(), etc. Default parameter values are supported (omit at the end or use _ in the middle). Global constraints in CQL (:: operator) are now also allowed to be used within parentheses, making them local constraints. Some bugs were fixed, including a bug finding doc length for any token queries.
… store captures. MatchInfo is now kept in the HitsInternal classes instead of in an external class (CapturedGroupsImpl, with map lookups). This is more efficient and allows us to easily take MatchInfo into account when comparing hits.
Two-phase iterators provide a way to go through Spans more efficiently, as it allows us to entirely skip documents that don't contain the required terms at all before actually fetching all the terms lists. We've converted our Spans classes to enable two-phase iterators. We're also sharing more code between the various Spans classes. We've adopted Lucene's FilterSpans and ConjunctionSpans and created BL-versions of them (to use BLSpans and to be able to override methods marked as final in Lucene) and derived several of our Spans classes from them. Aside from the normal unit and integration tests, we've also run comparative tests on a larger corpus to make sure results are identical.
SpansAnd didn't deal correctly with clauses that produce hits with the same doc, start and end (but different match info). Now it has a default, slower version that correctly deals with this, and the previous version (SpansAndSimple) is used where possible. SpansRepetition had a bug that skipped certain matches if the repetition consisted of non-consecutive hits. Again, there's now a slower implementation that correctly deals with this and SpansRepetitionSimple is the previous version that will be used when we know this problem cannot occur. A new guarantee, hitsCanOverlap, was added for this. Spans classes were documented in the BlackLab internals doc, and QueryTool was made more suitable for larger-scale correctness testing. SpansInBucketsPerDocumentSorted now uses sort indexes to sort hits, so match info is also sorted along with the starts/ends. Incurs a slight performance hit, but probably not too bad. Additional tests were added for the above.
- BLSpans and SpansInBuckets now also have guarantees like BLSpanQuery had (e.g. hitsAllSameLength, etc.). They are available through a method guarantees() that returns a SpanGuarantees object. This allows better optimization (e.g. avoid unnecessary sorting/uniqueing) and better validation (e.g. ensure that clauses are sorted) - Care was taken to include match info when removing duplicate hits. So hits with the same doc, start and end are still considered distinct if they have different match info (captured groups or relations). - Matching problems with SpansRelations (non-consecutive repetitions) and SpansAndNot (identical hits with different match info) were resolved, as well as a number of smaller bugs.
dcfe2e9
to
0f60511
Compare
- No positionsCost if asTwoPhaseIt() never returns null. - Flag indicates default source/target length. - Moved relations-related methods from AnnotatedFieldNameUtil to RelationUtil. - Implemented getRelationInfo() to get active relation info for all BLSpans classes. - spanMode 'all' adjusts hits to cover all matched relations. - Docs. QueryTool tweak.
Almost all of our SpanWeight objects should be cacheable as they don't rely on DocValues or global statistics etc. The only we're not sure about is SpanWeightFiSeq, which might run into trouble with the global forward index API which we plan to remove eventually. For now, we won't cache SpanWeightFiSeq.
- MatchInfo with captures, relations in BLS response. RelationInfo and SpanInfo both implement MatchInfo. RelationInfo is used both during indexing and matching and contains more data (source and target spans, relation type) than SpanInfo, which only represents a span during matching (e.g. a captured group). - BLS response now includes relations, and Proxy handles them as well. - Fix SpansInBuckets.setHitQueryContext being skipped. BLFilterDocsSpans: handle SpansInBuckets properly. - Separate MatchInfo type INLINE_TAG. Inline tags are indexed as relations, but are kind of a special case. Their source and target are length 0 and the relation class is a special value, __tag, to avoid problems with "real" relations. It doesn't make sense to report these like relations, so we've made them a separate type that can be handled separately. - Fixed test failures because inline tags were being reported as relations. - Update plan-relations (rmatch). - Improve CorpusQL parser definition: Enable top-level NOT; separate capture and sequencepart in cql.jj. - Fix relations getting incorrect sorted. - Fix active relation SpansInBuckets. - Fix capturing multiple rels of same type. - Query eq/hashCode. New rel(),rtree(). - Fix tests. - rtree renamed to rmatch.
There were two duplicates problems with relations: 1. a query that has a relations query of a source (parent) with two targets (children) clauses might match the same relation for both (e.g. it matches some relation A twice for one hit), but the user expects different relations. 2. the same type of query might produce two hits that have the same relations in different order (e.g. one hit matches A and B and the other matches B and A). Again, these are duplicates to the user. SpansAndMulti was created first to solve problem 1 by checking that matched relations are unique. Eventually SpansAndMultiUniqueRelations was created to return unique sets of relations. Also: - Fix bug in CoNLLU parser.
Global constraints: comparison ops, start/end(). These allow us to compare capture ordering (e.g. to see if a relation points forward or backward), e.g. using something like A:_ --> B:_ :: start(A) < start(B) will find forward relations.
We switched to indexing relations at the source because the relation operator yields the source, so we won't have to sort the hits in that case; they will already be sorted. A matching bug was fixed. Many assertions were added to the Spans classes to make detecting and tracking down matching bugs easier.
Relations operators were added to CorpusQL, including one to match root relations and a negation. Inline tags are now captured and included in the response, with tag name and attributes. Also: - Fix MatchFilterCompare.hashCode/equals. - Fix SpanQueryRelations.equals/hashCode. - Fix AND/[] bug. Add regex/any rel op. - AND optimization: remove match all if all clauses same length. - Fix AND/[] optimization, RelationSpanAdjust bug. - Various optimizations. Improve guarantees. - Add -v option (verbose) to QueryTool. - Fix HitPropertyCaptureGroup. - CoNLL-U: show line number on error. - Fix NPE in QueryTool. - Fix relation class prefix only applying to part of regex. - Fix rel() not sorting clauses. - Refactor SpanQueryAndNot.rewrite(). Lift captures. - SpansConstraint should ask only for clause's match info. - Fix group by capture by retrieving from FI directly. - Document query rewriting. - Fixes, assertions. - Clear atFirstInCurrentDocument in twoPhaseCurrentDocumentMatches(). - More consistenly name atFirstInCurrentDocument. - Bugfix in SpansSequenceWithGap constructor. - Added assertions to Spans classes. - Fix bug with rspan if no relations matched.
Because our Parlamint data set contains inconsistent Unicode normalization (some canonical accented characters, some decomposed), we now normalize to canonical in the CoNLL-U indexer. Maybe all indexers should do this? Inconsistent normalization causes issues in TermsReaders, probably because we use TERTIARY collator instead of IDENTICAL. Also adds (disabled) code to validate the terms sort on init. Squashed commit of the following: commit ca15b4c3fd170639a8419d76a2014370d340d839 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Jul 4 13:21:00 2023 +0200 Normalize values while indexing CoNLL-U (Parlamint). commit 9a3cc08 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Mon Jul 3 16:50:59 2023 +0200 WIP figure out why term sorting seems broken.
context replaces wordsaroundhit. You can specify 5 (5 words before and after), 5:10 (5 before, 10 after), or s (sentences).
- We now remove unprintable characters like zero-width space and soft hyphen while indexing. - We ensure canonical Unicode composition. - Instead of String.trim() we use StringUtil.trimWhitespace() to trim all space characters from start and end.
Lucene's optional ~ operator can be used to find relations whose type doesn't match something. Useful for exclusions. We haven't enabled this everywhere because NFA matching would also have to support this operator. Right now the regexes are assumed to be compatible, which is not strictly true (e.g. escaping rules for Lucene are stricter than for Java's regex engine), but this would present real challenges.
- HitPropertyCaptureGroup equals/hashCode. - Proxy: handle & forward POST requests. - Fix markdown layout.
rcapture() captures all relations in a captured span. e.g. rcapture(<s/>, 's', 'rels') captures sentences and their relations. It relies on a capture 's' existing, which it will if your query contains <s/>. You can also explicitly name your capture, e.g. rcapture('the' within X:<s/>, 'X', 'rels'). Also: - allow _ (don't care / default) to be used with relation operator. - simplify CQL parser definition a little bit.
Changes: - for requests, before/after also work (as synonyms for left/right), but BLS response still calls them left/right so we don't break compatibility. - complex needsContext/contextIndices removed. HitProperty will fetch its own context when needed. Call disposeContext() to free memory (standard group/sort/filter does this automatically). - before:word:i:3 now uses 3 words before hit. wordleft/wordright now translate into before/after with n==1. - ctx:word:i:E1 is a single-part version of context:word:i:H1;E1 (latter will be translated into multiple of the former; uses new class HitPopertyContextPart) - classes HitPropertyContextWords/WordLeft/WordRight were removed (no longer needed because of the translations mentioned). - HitPropertyCaptureGroup should be faster as it now batches context fetching just like the other classes do. - Add HP(Before|After)Hit. Deprecate HP(Left|Right)Context. This continues the process of BlackLab being more reading direction agnostic. Of course, BLS responses still refer to left/right for compatibility. That will be addressed in a future version the API.
You can now explicitly name relations captures using e.g. A:-det-> to capture these relations under the name A. Similarly, you can name captured tags, e.g. A:<s/> will capture not just the span start/end but also the tag name and attributes. Switched relation operators to single dash, e.g. -det-> Also: - update BARKs. - fix and re-enable ignored tests. - fix proxy. Match info as map. - unified matchInfo with simple, verbose XML. - update preindexed external files index for test.
Switch current API to v4. Add experimental v5 (stricter/cleaner). You can set the default API version to use or switch per request using the api parameter (api=4 / api=5-exp) The goal for API v5 is to improve the API by removing and changing endpoints, parameters and response structure. For now, v4 provides a transitional API that includes stuff from both v3 and v5. Anything related to v5 should still be considered experimental. The most important change in v4/5 is the addition of /corpora/CORPUSNAME/xxx endpoints that work slightly (not radically) different. v4 also retains the /CORPUSNAME/xxx endpoints, but those are removed in v5. v5 also changes XML output to a format that's easier to maintain. XML is not used as often as JSON, but we would like to keep supporting it without too much extra work. See api-versions.md for more info.
Refactored ResponseStreamer so hit and snippet use the same code and return the same structure, including e.g. matchInfo. Snippet also supports context=s to get whole sentence. Made sure max snippet size is enforces for context=s as well. Relations can be capture with snippet, e.g. all relations in sentence. Tests were split per index type, mainly because integrated will return more matchInfo than classic external. Also fixed NPE in Hits.java.
Search response summary now includes a pattern object with a json key that contains a JSON structure corresponding to the query. The patt parameter can also take JSON now. There is a new endpoint, CORPUS/parse-pattern, that will just parse a (CorpusQL or JSON) pattern. Useful for e.g. implementing query builders. Also: - CI test results are stable now (object keys are stored in alphabetical order). - latest-test-output is stored (.gitignored); this allows us to easily compare mismatches and update the saved-responses if necessary. - add TextPatternRelationTarget/Match. - deprecate TextPatternAnnotation/Sensitive (now part of ~Regex/Term; simplifies structure, closer to CorpusQL) - deprecate TextPatternPrefix/Wildcard (just use regex). - deprecate a few other TextPattern classes. - removed already-deprecated double dash relation operators (e.g. --det-->, ---->, etc.).
- Improve unique capture name assignment. - Add rspan(..,'all') automatically. - Update plan-relations (index at source).
It happens when annotatedfield doesn't contain any annotations. When indexing linked metadata, a dummy annotatedField "metadata" is created. Several parts in the code check for this and ignore it. But IndexMetadataExternal (and perhaps internal too?) let it through. This seems to be required because the IndexMetadata object is used to pre-create the contentstore. If we don't do that, the content store for the metadata won't exist, and the indexing process will throw an exception. So for now, just let it through, and ignore it when serializing. A more thorough solution might have to be implemented in the future. (We probably DO want to have the field exist in the IndexMetadata object, or indexing new documents into the same corpus later won't work?)
This also triggered a problem where it would be concluded that there weren't enough hits for a full window (of say, 20 hits), then decided that we will just return all the available hits, which in thise case turned out to be thousands.
The Saxon indexer should now have full feature parity with the VTD one. Use `processor: saxon` at the top-level of your .blf.yaml file to use Saxon. Saxon is faster and supports XPath 3.1. It is also possible to index relations using standoffAnnotations now. Squashed commit of the following: commit 4bf5782ee35915bcf5eaf79e9e1647f9ace7691a Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Sep 7 15:23:48 2023 +0200 Update changelog.md. commit ba32ca9 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Sep 7 14:27:49 2023 +0200 Small edit in xpath-examples.md. commit 4cc51cb Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Sep 7 11:13:47 2023 +0200 processor: saxon. Docs. commit 0ac6d8e Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Sep 7 09:23:54 2023 +0200 Deal with string values correctly. commit ad26da4 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Wed Sep 6 16:02:41 2023 +0200 Real Parlamint TEI parses. commit 4a51c6c Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Wed Sep 6 13:13:57 2023 +0200 Basic Parlamint TEI test works. commit 78374cd Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 31 14:24:09 2023 +0200 All but annotations (?) pulled up. commit 9bc8225 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 31 10:42:40 2023 +0200 Tweaks, before separate punct/inline (VTD). commit 8ecd57f Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 29 17:38:53 2023 +0200 Pulled some methods up. commit 7ec8be5 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 29 14:29:09 2023 +0200 Looking quite similar now, about to pull more up. commit 9189dc7 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 29 13:59:29 2023 +0200 Pull metadata code up. commit 3ddf45d Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 29 13:03:51 2023 +0200 Refactor VTD/Saxon commit 0b1d9cf Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 29 11:14:22 2023 +0200 Refactor VTD/Saxon commit d220669 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 24 18:18:51 2023 +0200 Continue refactoring VTD indexer. commit ae70e94 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 24 15:28:04 2023 +0200 WIP simplify annotation/relation processing. commit ab84529 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 24 14:30:30 2023 +0200 Intermediate commit. commit 5e97b2b Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Thu Aug 24 13:42:10 2023 +0200 Intermediate commit. commit 904035f Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Wed Aug 23 16:20:32 2023 +0200 WIP refactoring DocIndexerXPath/VTD/Saxon. commit fb9fa2b Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Wed Aug 23 11:50:43 2023 +0200 Refactoring, trying to share code between Saxon/VTD. commit 0e1fb75 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 22 16:23:00 2023 +0200 Simplify inline tags/punct handling. commit 2e210f5 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Tue Aug 22 16:08:53 2023 +0200 More refactoring. commit 56765b3 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Mon Aug 21 16:21:44 2023 +0200 Refactored SaxonHelper into smaller classes. commit 2476de4 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Mon Aug 21 16:05:05 2023 +0200 Intermediate commit. commit 07951a4 Author: Jan Niestadt <jan.niestadt@ivdnt.org> Date: Mon Aug 21 13:40:23 2023 +0200 Start on Parlamint-TEI.
0f60511
to
3a8217f
Compare
- ConLL-U: don't index multitoken lines; index multitokens in mwt annotation. - Warn about baseFormat deprecation. - VTD XPath warning, named entity refs. - Docs: Saxon, XPath. - Remove unused subprops param. Cleanups. - DocumentFormats refactor. DocumentFormats keeps track of all loaded InputFormats. InputFormat represents a format and can create DocIndexers. FinderInputFormats can locate input formats. There's one Finder for classes and one for user formats (BLS). Format configs are located once and added, with the config file lazy-loaded only when needed). Applications can add formats manually too, as needed (e.g. IndexTool). We keep track of errors with formats, either in the InputFormat implementation or in the special InputFormatError class.
3a8217f
to
473bc96
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Allows searching for (dependency) relations between words.
More details in plan-relations.md.
Done (relations):
_relation
annotation (which replaces the ill-namedstarttag
annotation)matchinfo
parameter with valuecaptures
to get the old behaviour (for compatibility).::
operator) may now use e.g.start(A) > start(B)
to check the ordering of captures.end(A)
also works. Previously::
constraints were only allowed at the top-level of the query, but they may now also occur in parenthesized subqueries, as long as they don't refer to groups captured outside that subquery.[]{m,n}
, e.g.[]
for one word, are resolved using a new length filter instead of finding all n-grams (this is used for any applicable AND(NOT) query). If you capture the source/target specified as an n-gram, this should be optimized as well by lifting the capture operation so the AND can be optimized as described._ A:-det-> _
to capture alldet
relations asA
andA:<s/>
to capture all sentences asA
. The latter is a slightly incompatible change as this would formerly have been captured as a regular span, but is now captures as the inline tag (with tag name and attributes, ininlineTags
section in BLS). If you really need the old behaviour, you can useA:_ident(<s/>)
._ident()
is a debug function that will not affect the meaning of the query, just prevent certain query rewrites from being applied. The third parameter ofrel()
can now be used to name the capture group as well. If the empty string is used, a name is automatically assigned (default behaviour). The 4th param is the direction filter that used to be the 3rd param.rcapture(sen:<s/>, 'sen', 'sen-rels')
_
is now the "don't care" / "default" value for the relation operator(s) as well as for function calls. So_ --> _
now means "all relations. That is,_
is interpreted as[]*
here. /snippet operation does this automatically if you specifycontext=s
.-->
), each will get a unique capture name if not assigned explicitly in the query.rspan()
call as root, automatically addrspan(..., 'all')
Done (other):
context=s
will return a whole sentence as context;context=5:10
will return 5 words before and 10 after the hit (context
replaceswordsaroundhit
, which still works for now). Also works for /snippet./corpora/NAME
endpoints), experimental API v5 is new API (removes deprecated v3 stuff, changes some keys, some structures, XML more in line with JSON). Useapi=3|4|exp
to select API to use. Don't rely on any of the new API stuff until BlackLab 4.0 is released.patt
parameter, and the JSON structure for CQL query will be returned in the response as well. There's also a new endpoint/parse-pattern
that can parse a CQL query into a JSON response and vice versa. Useful for implementing a query builder.TODO (see issue #449):
MAYBE
rcapture
alternative? E.g."de" within (<s/> collect -->)
rmatch
if relationtypes in query are already distinct?rspan(A:rel('flat', _, 'target'), 'source')
you could writercap(rel('flat'), 'A', 'target')
("capture target of rel('flat') as A, but resulting span is still the same as rel('flat')"), sorcap(query, name, spanMode='target')
Closes #405 and #201.