Skip to content

Commit

Permalink
Added TextP coverage #269
Browse files Browse the repository at this point in the history
  • Loading branch information
spmallette committed Nov 15, 2023
1 parent f572270 commit c83d58c
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 112 deletions.
132 changes: 24 additions & 108 deletions book/Section-Beyond-Basic-Queries.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4587,35 +4587,38 @@ Using regular expressions to do fuzzy searches

Let's take a look at one case where use of closures might be helpful. It is a
common requirement when working with any kind of database to want to do some
sort of fuzzy text search or even to search using a regular expression. TinkerPop
itself does not provide direct support for this. In other words there
currently is no sophisticated text search method beyond the basic 'has()' type steps
we have looked at above. However, the underlying graph store can still expose
such capabilities.
sort of fuzzy text search or even to search using a regular expression. Gremlin
offers a series of text related predicates for these types of searches. Where
standard predicates are a part of the 'P' enum, the text specific predicates can be
found on the 'TextP' enum.

NOTE: Most TinkerPop enabled graph stores that you are likely to use for any sort of
serious deployment will also be backed by an indexing technology like Solr or
Elasticsearch. In those cases some amount of more sophisticated search methods will
likely be made available to you. You should always check the documentation for the
system you are using to see what is recommended.

When working with TinkerGraph and the Gremlin console if we want to do any
sort of text search beyond very basic things like 'city == "Dallas"' then we
will have to fall back on the Lambda function concept to take advantage of
underlying Groovy and Java features. Note that even in graph
systems backed by a real index the examples we are about to look at should
still work but may not be the preferred way.

So let's look at some examples. First of all, every airport in the air routes
graph contains a description which will be something like 'Dallas Fort Worth
International Airport' in the case of DFW. If we wanted to search the vertices in
the graph for any airport that has the word 'Dallas' in the description we
could take advantage of the Groovy 'String.contains()' method and do it like this.
the graph for any airport that starts with the letter "D" we could use the
'startingWith' predicate.

[source,groovy]
----
// Airport descriptions starting with 'D' - this is case sensitive
g.V().has('airport', 'desc', TextP.startingWith('D'))
----

NOTE: There is an analogous 'endingWith' predicate for testing the end of a string.

If we wanted to search the vertices in the graph for any airport that has the word
'Dallas' in the description we could use 'TextP.containing'.

[source,groovy]
----
// Airport descriptions containing the word 'Dallas'
g.V().hasLabel('airport').filter{it.get().property('desc').value().contains('Dallas')}
g.V().has('airport', 'desc', TextP.containing('Dallas'))
----

Where things get even more interesting is when you want to use a regular
Expand All @@ -4629,103 +4632,16 @@ Dalcahue, Dalat and Dalanzadgad!.
[source,groovy]
----
// Using a filter to search using a regular expression
g.V().has('airport','type','airport').filter{it.get().property('city').value ==~/Dallas|Austin/}.values('code')
g.V().has('airport','type','airport').has('city', TextP.regex('~/Dallas|Austin/')).values('code')
// A regular expression to find any airport with a city name that begins with "Dal"
g.V().has('airport','type','airport').filter{it.get().property('city').value()==~/^Dal\w*/}.values('city')
----

So in summary it is useful to know about closures and the way you can use them
with filters but as stated above - use them sparingly and only when a "pure
Gremlin" alternative does not present itself.

NOTE: We could actually go one step further and create a custom predicate (see
next section) that handles regular expressions for us.

[[pred]]
Creating custom tests (predicates)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TinkerPop comes with a set of built in methods that can be used for testing values.
These methods are commonly referred to as 'predicates'. Examples of existing Gremlin
predicates include methods like 'gte()', 'lte()' and 'neq()'. Sometimes, however, it
is useful to be able to define your own custom predicate that can be passed in to a
'has('), 'where()' or 'filter()' step as part of a Gremlin query.

The following example uses the Groovy closure syntax to define a custom predicate,
called 'f', that tests the two values passed in to see if 'x' is greater than twice
'y'. This new predicate can then be used as part of a 'has()' step by using it as a
parameter to the 'test()' method. When 'f' is called, it will be passed two
parameters. The first one will be the value returned in response to asking 'has()' to
return the property called 'longest'. The second parameter passed to 'f' will be the
value of 'a' that we provide. This is a simple example, but shows the flexibility
that Gremlin provides for extending the basic predicates.

[source,groovy]
g.V().has('airport','type','airport').has('city', TextP.regex('~/^Dal\w*/')).values('city')
----
// Find the average longest runway length.
a = g.V().hasLabel('airport').values('longest').mean().next()

// Define a custom predicate
f = {x,y -> x > y*2}
// Find airports with runways more than twice the average maximum length.
g.V().hasLabel('airport').has('longest',test(f,a)).values('code')
----

Creating a regular expression predicate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the previous section we used a closure to filter values using a regular
expression. Now that we know how to create our own predicates we could go one
step further and create a predicate that accepts regular expressions for us.

[source,groovy]
----
// Create our method
f = {x,y -> x ==~ y}
// Use it to find any vertices where the description string starts with 'Dal'
g.V().has('desc',test(f,/^Dal.*/)).values('desc')
----

We can actually go one step further and create a custom method called 'regex'
rather than use the 'test' method directly. If the following code seems a bit
unclear don't worry too much. It works and that may be all you need to know.
However if you want to understand the TinkerPop API in more detail the
documentation that can be found on the Apache TinkerPop web page explains things
like 'P' in detail. Also remember that Gremlin is written in Groovy/Java and we
take advantage of that here as well.

In the following example, rather than use 'test' directly we use the
'BiPredicate' functional interface that is part of Java 8. 'BiPredicate' is
sometimes referred to was a 'two-arity' predicate as it takes two parameters. We
will create an implementation of the interface called 'bp'. The interface
requires that we provide one method called 'test' that does the actual
comparison between two objects and returns a simple true or false result. Like
we did in the previous section we simply perform a regular expression compare
using the '==~' operator.

We can then use our 'bp' implementation to build a named closure that we will call
'regex'. TinkerPop includes a predicate class P that is an implementation of the Java
Predicate functional interface. We we can use 'P' to build our new 'regex' method. We
can then pass 'regex' directly to steps like 'has'.

[source,groovy]
----
// Create a new BiPredicate that handles regular expression pattern matching
bp = new java.util.function.BiPredicate<String, String>() {
boolean test(String val, String pattern) {
return val ==~ pattern }}
// Create a new closure we can use for regular expression pattern matching.
regex = {new P(bp, it)}
// Use our new closure to find descriptions that start with 'Dal'. As this
// unwinds, the contents of 'desc' are passed to the test method as the first parameter
// and the regex pattern as the second paramter.
g.V().has('desc', regex(/^Dal.*/)).values('desc')
----
Gremlin adheres to the regex syntax prescribed by the Java `Pattern` class documented
at https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
The Java regular expression syntax may be different than the one you are used to so
it is worth taking a few minutes to study the documentation at that URL.

[[graphvars]]
Using graph variables to associate metadata with a graph
Expand Down
17 changes: 13 additions & 4 deletions book/Section-Janus-Graph.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -1358,11 +1358,20 @@ Losuia
Regular expression predicates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is worth considering a bit of history when discussing regular expression
predicates for JanusGraph. JanusGraph introduced text-based predicates many years
before TinkerPop added them to Gremlin in version 3.6.0. As a result, there are
text predicates that are JanusGraph specific which have similar functionality to the
ones officially exposed by Gremlin itself. This section describes the
JanusGraph-specific text predicates. You can learn more about the official Gremlin
text predicates in the <<fuzzyregs,"Using regular expressions to do fuzzy searches">>
section.

The JanusGraph regular expression predicates recognize the syntax defined as part of
the Java 1.8 Pattern class that is documented at
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. The Java
regular expression syntax may be different than the one you are used to so it is
worth taking a few minutes to study the documentation at that URL.
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html.
The Java regular expression syntax may be different than the one you are used to so
it is worth taking a few minutes to study the documentation at that URL.

The query below uses a 'textContainsRegex' predicate to search for any city name that
contains a word starting with 'for', while ignoring case.
Expand Down Expand Up @@ -1520,7 +1529,7 @@ Fuzzy search predicates
These predicates use the
https://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein distance] method to
decide if a piece of text is 'close enough' to the pattern being looked for. This is
based on assessing how many characterss would have to change in the pattern word to
based on assessing how many characters would have to change in the pattern word to
achieve a match in the text being inspected. For example 'pall' would match 'palm',
'paul' and 'palm'.

Expand Down

0 comments on commit c83d58c

Please sign in to comment.