Gremlin Recipes: 9 – Pattern Matching
Part 9 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.
This blog post is the 9th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:
- Gremlin as a Stream
- SQL to Gremlin
- Recommendation Engine traversal
- Recursive traversals
- Path object
- Projection and selection
- Variable handling
- sack() operator
I KillrVideo dataset
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.
The graph schema of this dataset is :
II Pattern Matching in Gremlin
A. Description
Until now we always use descriptive traversals with Gremlin. By descriptive traversals I mean we describe all the navigation in the graph using standard steps like in(), out(), ...
However, Gremlin also offers a declarative way to do traversals, using pattern matching. Basically you declare all the properties of your vertices/edges and alias them with label then Gremlin will compute the most appropriate traversals to match your pattern.
A pattern-matching traversal uses the match()
step and look like:
gremlin>g.V().match(
__.as("label")....., // predicate 1
__.as("label")....., // predicate 2
...
__.as("label").where(...) // matching predicate
)
...
Please note that each predicate must start with a labelling step as()
. Each predicate can end with a labelling step as()
but it is optional
All the labels you declare in your predicates are global and can be re-called outside of the match()
step. Those labels also serve the purpose of matching different predicates together
There are a few things you should know to exploit fully pattern matching with Gremlin:
- all the predicates inside the
match()
are executed concurrently and independently. Consequently the lexical order of the declared predicates does not matter during execution time - variables(alias) binding need to hold true across multiple predicate. The traversers that succeeded all the variable findings in all predicate are the ones that survive the declared pattern and thus are the result of the
match()
step - in other languages like Cypher or SPARQL, reducing barriers (
sum(),count()
…) or collecting barriers (order(), group()
, …) are not allowed. In Gremlin pattern matching they are allowed but their scope is local only, meaning that if you have anorder()
clause inside a predicate, it is only useful for this predicate but this ordering cannot be used for the other predicates or outside of thematch()
step
The explanation for point 3. is that since each predicate is executed in any order and concurrently, inside a single predicate, at the time the traverser reaches the order().by()
step, all the data are being processed concurrently in other predicates so we don’t have all the data to perform a global ordering. Consequently the ordering can only be local to the current predicate
Examples are better than words
B. Director who played in their own movies
gremlin>g.V().match(
__.as("directors").hasLabel("person"), // 1st predicate
__.as("directors").in("director").as("directed_movies"),
// 2nd predicate
__.as("directed_movies").out("actor").as("directors")
// 3rd predicate ).
select("directors","directed_movies"). // Iterator
==>Keenen Ivory Wayans=[Scary Movie]
==>Terry Jones=[Monty Python's Life of Brian, Monty Python's The Meaning of Life, Monty Python and the Holy Grail]
==>Kenneth Branagh=[Much Ado About Nothing]
==>Kevin Costner=[Open Range, Dances with Wolves]
==>Kevin Smith=[Chasing Amy, Clerks]
==>Buster Keaton=[The General]
==>Dany Boon=[Welcome to the Sticks]
==>Ben Stiller=[The Cable Guy, Zoolander] ...
- the 1st predicate is labeled as directors and picks all person vertices
- in the 2nd predicate, we pick the previous directors movies and label them as directed_movies
- in the 3rd predicate, we pick the previous directed_movies and find their actors and label them as directors. This implicitly means that the actors playing in movies directed_movies should be the previously labeled directors themselves!
At the step select("directors","directed_movies") we have an Iterator> as follow:
==>{directors=v[{~label=person, community_id=1325149184, member_id=418}], directed_movies=v[{~label=movie, community_id=1757220352, member_id=30}]}
==>{directors=v[{~label=person, community_id=652999808, member_id=184}], directed_movies=v[{~label=movie, community_id=704287872, member_id=7}]}
==>{directors=v[{~label=person, community_id=652999808, member_id=184}], directed_movies=v[{~label=movie, community_id=1026409472, member_id=23}]}
Each map contains a single entry director=director vertex and directed_movies=ONE of the movie he directed/acted in.
The step group().by(select("directors").by("name")).by(select("directed_movies").by("title").fold())
will transform the previous stream into an Iterator==movies>>
. Please notice the interesting usage of the select()
operator to pick the right map entry by its key. The fold()
step in the 2nd by()
modulator is necessary to pick ALL the movies for each director/actor
C. Director who played in their own Sci-Fi movies
Now we add an extra condition to the previous pattern matching, we only want to look at Sci-Fi movies:
gremlin>g.V().match(
__.as("directors").hasLabel("person"), // 1st predicate
__.as("directors").in("director").as("directed_movies"), // 2nd predicate
__.as("directed_movies").out("actor").as("directors"), // 3rd predicate
__.as("directed_movies").out("belongsTo").as("genres"), // 4th predicate
__.as("genres").filter(values("name").is(eq("Sci-Fi"))) // 5th predicate
).
select("directors","directed_movies").
dedup().
group().
by(select("directors").by("name")).
by(select("directed_movies").by("title").fold()). unfold()
==>David Cronenberg=[The Fly]
==>M. Night Shyamalan=[Signs]
==>Kurt Wimmer=[Equilibrium]
==>George Lucas=[Star Wars: Episode III Revenge of the Sith]
We added 2 extra predicates (4th and 5th) at the end to force the genres of the movies to contain “Sci-Fi”. The results are now limited to only 4 movies.
D. Movies that are both Sci-Fi and Action
gremlin>g.V().match(
__.as("movies").hasLabel("movie"),
__.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies"), __.as("scifi_movies").filter(out("belongsTo").values("name").is(eq("Action"))).as("scifi_movies")
).
select("scifi_movies").
project("title","genres").
by("title").
by(out("belongsTo").values("name").fold())
==>{title=Total Recall, genres=[Sci-Fi, Action]}
==>{title=Battlefield Earth, genres=[Sci-Fi, Action]}
==>{title=The Avengers, genres=[Sci-Fi, Action, Fantasy]}
==>{title=Source Code, genres=[Sci-Fi, Thriller, Action]}
==>{title=R.O.T.O.R., genres=[Sci-Fi, Action]}
==>{title=Inception, genres=[Mystery, Sci-Fi, Thriller, Action]} ==>{title=Ghost in the Shell, genres=[Sci-Fi, Action, Animation]} ...
It’s quite explanatory, nothing complicated here
E. Top 5 well rated Sci-Fi movies
gremlin>g.V().match(
__.as("movies").hasLabel("movie"),
__.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies"), __.as("scifi_movies").dedup().order().by(inE("rated").values("rating").mean(), decr).as("well_rated_scifi") ). select("well_rated_scifi").limit(5). project("movie","rating").
by("title").
by(inE("rated").values("rating").mean())
==>{movie=Alien Resurrection, rating=5.543478260869565}
==>{movie=Total Recall, rating=6.714285714285714}
==>{movie=Battlefield Earth, rating=2.5}
==>{movie=Eternal Sunshine of the Spotless Mind, rating=7.597701149425287}
==>{movie=Back to the Future. Part III, rating=6.604651162790698}
Strangely, the results are incorrectly ordered. In fact we are hitting here the limitation I mentioned above. The ordering can only be local to the current step so in __.as("scifi_movies").dedup().order().by(inE("rated").values("rating").mean(), decr).as("well_rated_scifi")
the traverser enters the predicate from a single sci-fi movie so there is only 1 mean rating for this single movie and ordering on it is useless. Consequently scifi_movies == well_rated_scifi
If we want the ordering, it should be done outside of the match()
step
gremlin>g.V().match(
__.as("movies").hasLabel("movie"),
__.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies") ). select("scifi_movies"). order().by(inE("rated").values("rating").mean(), decr). limit(5). project("movie","rating"). by("title"). by(inE("rated").values("rating").mean())
==>{movie=Metropolis, rating=8.25}
==>{movie=A Clockwork Orange, rating=8.215686274509803}
==>{movie=Blade Runner, rating=8.20353982300885}
==>{movie=Star Wars. Episode V: The Empire Strikes Back, rating=8.121739130434783}
==>{movie=Stalker, rating=8.03191489361702}
F. Recommendation engine with pattern-matching
We can re-write our classical movies recommendation engine seen in previous blog posts using pattern matching
gremlin>g.V().match( __.as("blade_runner").has("movie", "title", "Blade Runner"), __.as("blade_runner").map(inE("rated").values("rating").mean()).as("avg_rating"), __.as("blade_runner").map(out("belongsTo").values("name").fold()).as("genres"), __.as("blade_runner").inE("rated").filter(values("rating").where(gte("avg_rating"))).outV().as("raters_of_blade_runner"),
__.as("raters_of_blade_runner").outE("rated").filter(values("rating").where(gte("avg_rating"))).inV().where(neq("blade_runner")).as("potential_movies"), __.as("potential_movies").where("potential_movies", gte("avg_rating")).by(inE("rated").values("rating").mean()).by().as("filtered_movies1"), __.as("filtered_movies1").filter(out("belongsTo").has("name", where(within("genres")))).as("matching_movies")
).
select("matching_movies").
project("movie", "average_rating", "genres").
by("title").
by(inE("rated").values("rating").mean()).
by(out("belongsTo").values("name").fold())
Neither the sideEffects, map, nor path has a genres-key: WherePredicateStep(within([genres]))
Type ':help' or ':h' for help.
Against, the error is due to a bug in Gremlin, reported in TINKERPOP-1762.
A simple work-around right now is to add an extra dedup()
after the select()
step
gremlin>g.V().match( __.as("blade_runner").has("movie", "title", "Blade Runner"), __.as("blade_runner").map(inE("rated").values("rating").mean()).as("avg_rating"), __.as("blade_runner").map(out("belongsTo").values("name").fold()).as("genres"), __.as("blade_runner").inE("rated").filter(values("rating").where(gte("avg_rating"))).outV().as("raters_of_blade_runner"), __.as("raters_of_blade_runner").outE("rated").filter(values("rating").where(gte("avg_rating"))).inV().where(neq("blade_runner")).as("potential_movies"), __.as("potential_movies").where("potential_movies", gte("avg_rating")).by(inE("rated").values("rating").mean()).by().as("filtered_movies1"), __.as("filtered_movies1").filter(out("belongsTo").has("name", where(within("genres")))).as("matching_movies") ).
select("matching_movies").dedup().
project("movie", "average_rating", "genres").
by("title").
by(inE("rated").values("rating").mean()).
by(out("belongsTo").values("name").fold())
==>{movie=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]}
==>{movie=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]}
==>{movie=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}
And that’s all folks! Do not miss the other Gremlin recipes in this series.
If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai