Skip to content

Commit

Permalink
Change alignment operator ==>TGT to look at overlapping relatioons.
Browse files Browse the repository at this point in the history
Before, it would only find relations contained within the source and target spans;
now it finds any relations that overlap both the source and target spans.

Also fixed a parser tokenizer bug with the alignment operator, where two
alignment operators in a single query would be parsed as a single very long
alignment operator. Some characters inside relations operators regexes are
now disallowed, such as spaces, quotes, etc., to prevent problems like these.
If this proves to be a problems, we could add a special version like =/.../=>
that allows anything, but it's probably not necessary.

Squashed commit of the following:

commit 8271393
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Mon May 27 15:17:21 2024 +0200

    Guard against null config.

commit 7172e15
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue May 21 11:25:08 2024 +0200

    Fix parser tokenizer bug (alignment operator).

    Alignment operator was too "loose", i.e. in the query
    [word='the'] =test=> _, "='the'] =test=>" was seen as an
    alignment operator.

    Now disallowed several special characters inside the operator
    regular exoression to prevent this problem. Relation types
    should generally consist of normal characters, not quotes, spaces,
    brackets, etc. If really needed, we could add an option to put
    regexes between =/.../=> , but that's probably not necessary.

commit 796cc5e
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Tue May 21 10:19:30 2024 +0200

    Enable DEBUG_PARSER.

commit 6de7c86
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Fri May 17 20:33:07 2024 +0200

    Add failing test.

commit 823f2c8
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Fri May 17 13:28:13 2024 +0200

    Alignment operator source adjust.

commit 3628f9f
Author: Jan Niestadt <jan.niestadt@ivdnt.org>
Date:   Thu May 16 16:01:31 2024 +0200

    Alignment operator looks at overlapping relations.
  • Loading branch information
jan-niestadt committed Jun 3, 2024
1 parent e270083 commit 1641331
Show file tree
Hide file tree
Showing 5 changed files with 222 additions and 146 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@
/**
* Captures all relations between the hits from two clauses.
*
* This is used to capture cross-field relations in a parallel corpus.
* This is used to capture cross-field (alignment) relations in a parallel corpus.
*
* @@@ PROBLEM: right now, subsequent spans from the source spans may not overlap!
* FIXME ? right now, subsequent spans from the source spans may not overlap!
* If they do overlap, some relations may be skipped over.
* We should cache (some) relations from the source span so we can be sure we return all
* of them, even if the source spans overlap. Use SpansInBuckets or maybe a rewindable
* Spans view on top of that class?
* Spans view on top of that class? See @@@ comment below for a possible solution.
*/
class SpansCaptureRelationsBetweenSpans extends BLFilterSpans<BLSpans> {

Expand Down Expand Up @@ -117,6 +117,12 @@ public String toString() {
/** List of relations captured for current hit */
private List<RelationInfo> capturedRelations = new ArrayList<>();

/** Start of current (source) hit (covers all sources of captured relations) */
private int adjustedStart;

/** End of current (source) hit (covers all sources of captured relations) */
private int adjustedEnd;

/**
* Construct a SpansCaptureRelationsWithinSpan.
*
Expand All @@ -128,6 +134,24 @@ public SpansCaptureRelationsBetweenSpans(BLSpans source, List<Target> targets) {
this.targets = targets;
}

@Override
public int startPosition() {
if (atFirstInCurrentDoc)
return -1;
if (startPos == -1 || startPos == NO_MORE_POSITIONS)
return startPos;
return adjustedStart;
}

@Override
public int endPosition() {
if (atFirstInCurrentDoc)
return -1;
if (startPos == -1 || startPos == NO_MORE_POSITIONS)
return startPos;
return adjustedEnd;
}

@Override
protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException {
// Prepare matchInfo so we can add captured relations to it
Expand All @@ -139,33 +163,45 @@ protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException
candidate.getMatchInfo(matchInfo);

// Find current source span
int sourceStart = startPosition();
int sourceEnd = endPosition();
int sourceStart = candidate.startPosition();
int sourceEnd = candidate.endPosition();

// Our final (source) span will cover all captured relations.
adjustedStart = sourceStart;
adjustedEnd = sourceEnd;

for (Target target: targets) {
// Capture all relations with source inside this span.

// Capture all relations with source overlapping this span.
capturedRelations.clear();
int targetPosMin = Integer.MAX_VALUE;
int targetPosMax = Integer.MIN_VALUE;
int docId = target.relations.docID();
if (docId < candidate.docID())
docId = target.relations.advance(candidate.docID());
if (docId == candidate.docID()) {
// @@@ TODO: make rewindable Spans view on top of SpansInBucketsPerDocument for this?

// @@@ make rewindable Spans view on top of SpansInBucketsPerDocument for this?
// (otherwise we might miss relations if the source spans overlap)
// if (target.relations.startPosition() > sourceStart)
//
// // Rewind relations if necessary
// if (target.relations.endPosition() > sourceStart)
// target.relations.rewindStartPosition(sourceStart);
if (target.relations.startPosition() < sourceStart)
target.relations.advanceStartPosition(sourceStart);

// Advance relations such that the relation source end position is after the
// current start position (of the query source), i.e. they may overlap.
while (target.relations.endPosition() <= sourceStart) {
if (target.relations.nextStartPosition() == NO_MORE_POSITIONS)
break;
}
while (target.relations.startPosition() < sourceEnd) {
if (target.relations.endPosition() <= sourceEnd) {
// Source of this relation is inside our source hit.
RelationInfo relInfo = target.relations.getRelationInfo().copy();
capturedRelations.add(relInfo);
// Keep track of the min and max target positions so we can quickly reject targets below.
targetPosMin = Math.min(targetPosMin, relInfo.getTargetStart());
targetPosMax = Math.max(targetPosMax, relInfo.getTargetEnd());
}
// Source of this relation overlaps our source hit.
RelationInfo relInfo = target.relations.getRelationInfo().copy();
capturedRelations.add(relInfo);
// Keep track of the min and max target positions so we can quickly reject targets below.
targetPosMin = Math.min(targetPosMin, relInfo.getTargetStart());
targetPosMax = Math.max(targetPosMax, relInfo.getTargetEnd());

target.relations.nextStartPosition();
}
}
Expand All @@ -190,10 +226,11 @@ protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException
target.targetField);
}

updateSourceStartEndWithCapturedRelations(); // update start/end to cover all captured relations
continue;
}

// Find the smallest target span that covers the highest number of the relations we just captured.
// Find the smallest target span that overlaps the highest number of the relations we just captured.
int targetDocId = target.target.docID();
if (targetDocId < candidate.docID()) {
targetDocId = target.target.advance(candidate.docID());
Expand All @@ -213,9 +250,9 @@ protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException
continue;
}
// There is some overlap between the target span and the relations we captured.
// Find out which relations are inside this target span, so we can pick the best target span.
// Find out which relations overlap this target span, so we can pick the best target span.
int relationsCovered = (int) capturedRelations.stream()
.filter(r -> r.getTargetStart() >= targetStart && r.getTargetEnd() <= targetEnd)
.filter(r -> r.getTargetEnd() > targetStart && r.getTargetStart() < targetEnd)
.count();
int length = targetEnd - targetStart;
if (relationsCovered > targetRelationsCovered
Expand All @@ -229,13 +266,15 @@ protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException
// A valid hit must have at least one matching relation in each target.
return FilterSpans.AcceptStatus.NO;
}
// Only keep the relations that match the target span we found.
// Only keep the relations that overlap the target span we found.
int finalTargetIndex = targetIndex;
capturedRelations.removeIf(r -> r.getTargetStart() < target.target.startPosition(finalTargetIndex)
|| r.getTargetEnd() > target.target.endPosition(finalTargetIndex));
capturedRelations.removeIf(r -> r.getTargetEnd() <= target.target.startPosition(finalTargetIndex)
|| r.getTargetStart() >= target.target.endPosition(finalTargetIndex));
capturedRelations.sort(RelationInfo::compareTo);
matchInfo[target.captureAsIndex] = RelationListInfo.create(capturedRelations, getOverriddenField());
target.target.getMatchInfo(finalTargetIndex, matchInfo); // also perform captures on the target

updateSourceStartEndWithCapturedRelations(); // update start/end to cover all captured relations
} else {
// Target document has no matches. Reject this hit.
return FilterSpans.AcceptStatus.NO;
Expand All @@ -245,6 +284,17 @@ protected FilterSpans.AcceptStatus accept(BLSpans candidate) throws IOException
return FilterSpans.AcceptStatus.YES;
}

private void updateSourceStartEndWithCapturedRelations() {
// Our final (source) span will cover all captured relations, so that
// e.g. "the" =sentence-alignment=>nl "de" will have the aligned sentences as hits, not just single words.
capturedRelations.forEach(r -> {
if (r.getSourceStart() < adjustedStart)
adjustedStart = r.getSourceStart();
if (r.getSourceEnd() > adjustedEnd)
adjustedEnd = r.getSourceEnd();
});
}

@Override
public String toString() {
return "==>(" + in + ", " + targets + ")";
Expand Down
Loading

0 comments on commit 1641331

Please sign in to comment.