Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find some words near each other #396

Open
jan-niestadt opened this issue Feb 21, 2023 · 2 comments
Open

Find some words near each other #396

jan-niestadt opened this issue Feb 21, 2023 · 2 comments
Assignees

Comments

@jan-niestadt
Copy link
Member

jan-niestadt commented Feb 21, 2023

BlackLab (and CQL) don't currently support ordinary "near searches", e.g. "find dog, cat and hamster within 20 words of each other".

Lucene does support these kinds of searches though, even in span form, so this shouldn't be too difficult to add. We'd probably add a function to CQL, something like:

near(list_of_queries, slop, in_order)

so you could for example query like this:

near(list("dog", [lemma="cat"], [pos="ADJ"][word="hamster"]), 20, false)
@jan-niestadt jan-niestadt added this to the v4.0 milestone Feb 21, 2023
@jan-niestadt jan-niestadt self-assigned this Feb 21, 2023
@jan-niestadt
Copy link
Member Author

jan-niestadt commented Feb 22, 2023

BTW I had a look at CWB and Sketch Engine to see what the most compatible syntax would be.

Sketch Engine has a meet function that does something similar, e.g.:

(meet [tag  = "N.*"] [tag = "VB.*"] -3 3)

This Lisp-like syntax makes it difficult to pass more complex queries (because whitespace is already used as the "sequence operator" in CQL, so we can't tell where one query ends and the next starts without extra parentheses). Also, it's probably less familiar to our users, who are more likely to know e.g. Python than Lisp.

CWB has several function-like syntaxes, e.g. /codist[...] for macros, A = intersection B C for set operations, dist(...) for constraints, MU(meet|union ...) for meet/union. Having all of these different syntaxes does not seem like a good idea to me.

I think simple imperative-style function calls as shown in my previous comment are the most pragmatic choice. This will make these kinds of features consistent and easy to use, at the cost of slightly worse CQL compatibility with other corpus engines. But as CQL is already a collection of dialects as opposed to a standard, I feel this is okay. We should document how BlackLab CQL differs from the most popular alternatives at some point.

@jan-niestadt
Copy link
Member Author

(More or less) "pluggable" extension functions have been implemented in the feature/relations branch, so this should probably be done there as well. We need to add support for list() to pass a list of value as a parameter (this should probably be a special operator for now), but other than that it's straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant