TechnologySeptember 24, 2017

Gremlin Recipes: 9 – Pattern Matching

Gremlin Recipes: 9 – Pattern Matching

Part 9 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.

This blog post is the 9th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:

  1. Gremlin as a Stream
  2. SQL to Gremlin
  3. Recommendation Engine traversal
  4. Recursive traversals
  5. Path object
  6. Projection and selection
  7. Variable handling
  8. sack() operator
     

I KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.

The graph schema of this dataset is :

II Pattern Matching in Gremlin

A. Description

Until now we always use descriptive traversals with Gremlin. By descriptive traversals I mean we describe all the navigation in the graph using standard steps like in(), out(), ...

However, Gremlin also offers a declarative way to do traversals, using pattern matching. Basically you declare all the properties of your vertices/edges and alias them with label then Gremlin will compute the most appropriate traversals to match your pattern.

A pattern-matching traversal uses the match() step and look like:

gremlin>g.V().match(
__.as("label").....,      // predicate 1
__.as("label").....,      // predicate 2
...
__.as("label").where(...) // matching predicate

)
...

Please note that each predicate must start with a labelling step as(). Each predicate can end with a labelling step as() but it is optional

All the labels you declare in your predicates are global and can be re-called outside of the match() step. Those labels also serve the purpose of matching different predicates together

There are a few things you should know to exploit fully pattern matching with Gremlin:

  1. all the predicates inside the match() are executed concurrently and independently. Consequently the lexical order of the declared predicates does not matter during execution time
  2. variables(alias) binding need to hold true across multiple predicate. The traversers that succeeded all the variable findings in all predicate are the ones that survive the declared pattern and thus are the result of the match() step
  3. in other languages like Cypher or SPARQL, reducing barriers (sum(),count() …) or collecting barriers (order(), group(), …) are not allowed. In Gremlin pattern matching they are allowed but their scope is local only, meaning that if you have an order() clause inside a predicate, it is only useful for this predicate but this ordering cannot be used for the other predicates or outside of the match() step

The explanation for point 3. is that since each predicate is executed in any order and concurrently, inside a single predicate, at the time the traverser reaches the order().by() step, all the data are being processed concurrently in other predicates so we don’t have all the data to perform a global ordering. Consequently the ordering can only be local to the current predicate

Examples are better than words

B. Director who played in their own movies

gremlin>g.V().match(
    __.as("directors").hasLabel("person"), // 1st predicate
    __.as("directors").in("director").as("directed_movies"),
                                           // 2nd predicate
    __.as("directed_movies").out("actor").as("directors")
                                           // 3rd predicate ).
    select("directors","directed_movies"). // Iterator>
    dedup().
    group().
        by(select("directors").by("name")).
        by(select("directed_movies").by("title").fold()).
 
                                          // Iterator=movies>>
unfold()

                                           // Iterator=movies>>

==>Charles Chaplin=[The Gold Rush, Modern Times, City Lights, The Great Dictator, The Kid]
==>Keenen Ivory Wayans=[Scary Movie]
==>Terry Jones=[Monty Python's Life of Brian, Monty Python's The Meaning of Life, Monty Python and the Holy Grail]
==>Kenneth Branagh=[Much Ado About Nothing]
==>Kevin Costner=[Open Range, Dances with Wolves]
==>Kevin Smith=[Chasing Amy, Clerks]
==>Buster Keaton=[The General]

==>Dany Boon=[Welcome to the Sticks]
==>Ben Stiller=[The Cable Guy, Zoolander] ...

  • the 1st predicate is labeled as directors and picks all person vertices
  • in the 2nd predicate, we pick the previous directors movies and label them as directed_movies
  • in the 3rd predicate, we pick the previous directed_movies and find their actors and label them as directors. This implicitly means that the actors playing in movies directed_movies should be the previously labeled directors themselves!

At the step select("directors","directed_movies") we have an Iterator> as follow:

==>{directors=v[{~label=person, community_id=1325149184, member_id=418}], directed_movies=v[{~label=movie, community_id=1757220352, member_id=30}]}

==>{directors=v[{~label=person, community_id=652999808, member_id=184}], directed_movies=v[{~label=movie, community_id=704287872, member_id=7}]}

==>{directors=v[{~label=person, community_id=652999808, member_id=184}], directed_movies=v[{~label=movie, community_id=1026409472, member_id=23}]}

Each map contains a single entry director=director vertex and directed_movies=ONE of the movie he directed/acted in.

The step group().by(select("directors").by("name")).by(select("directed_movies").by("title").fold())

will transform the previous stream into an Iterator==movies>>. Please notice the interesting usage of the select() operator to pick the right map entry by its key. The fold() step in the 2nd by() modulator is necessary to pick ALL the movies for each director/actor

C. Director who played in their own Sci-Fi movies

Now we add an extra condition to the previous pattern matching, we only want to look at Sci-Fi movies:

gremlin>g.V().match(
    __.as("directors").hasLabel("person"), // 1st predicate
    __.as("directors").in("director").as("directed_movies"), // 2nd predicate
    __.as("directed_movies").out("actor").as("directors"), // 3rd predicate
    __.as("directed_movies").out("belongsTo").as("genres"), // 4th predicate
    __.as("genres").filter(values("name").is(eq("Sci-Fi"))) // 5th predicate
    ).
    select("directors","directed_movies").
    dedup().
    group().
        by(select("directors").by("name")).
        by(select("directed_movies").by("title").fold()). unfold()
==>David Cronenberg=[The Fly]
==>M. Night Shyamalan=[Signs]
==>Kurt Wimmer=[Equilibrium]
==>George Lucas=[Star Wars: Episode III Revenge of the Sith]

We added 2 extra predicates (4th and 5th) at the end to force the genres of the movies to contain “Sci-Fi”. The results are now limited to only 4 movies.

D. Movies that are both Sci-Fi and Action

gremlin>g.V().match(
    __.as("movies").hasLabel("movie"),
  __.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies"), __.as("scifi_movies").filter(out("belongsTo").values("name").is(eq("Action"))).as("scifi_movies")
).
select("scifi_movies").
project("title","genres").
   by("title").
   by(out("belongsTo").values("name").fold())
==>{title=Total Recall, genres=[Sci-Fi, Action]}
==>{title=Battlefield Earth, genres=[Sci-Fi, Action]}
==>{title=The Avengers, genres=[Sci-Fi, Action, Fantasy]}
==>{title=Source Code, genres=[Sci-Fi, Thriller, Action]}
==>{title=R.O.T.O.R., genres=[Sci-Fi, Action]}
==>{title=Inception, genres=[Mystery, Sci-Fi, Thriller, Action]} ==>{title=Ghost in the Shell, genres=[Sci-Fi, Action, Animation]} ...

It’s quite explanatory, nothing complicated here

E. Top 5 well rated Sci-Fi movies

gremlin>g.V().match(
    __.as("movies").hasLabel("movie"),
 __.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies"),   __.as("scifi_movies").dedup().order().by(inE("rated").values("rating").mean(), decr).as("well_rated_scifi") ). select("well_rated_scifi").limit(5). project("movie","rating").
        by("title").
        by(inE("rated").values("rating").mean())
==>{movie=Alien Resurrection, rating=5.543478260869565}
==>{movie=Total Recall, rating=6.714285714285714}
==>{movie=Battlefield Earth, rating=2.5}
==>{movie=Eternal Sunshine of the Spotless Mind, rating=7.597701149425287}
==>{movie=Back to the Future. Part III, rating=6.604651162790698}

Strangely, the results are incorrectly ordered. In fact we are hitting here the limitation I mentioned above. The ordering can only be local to the current step so in __.as("scifi_movies").dedup().order().by(inE("rated").values("rating").mean(), decr).as("well_rated_scifi") the traverser enters the predicate from a single sci-fi movie so there is only 1 mean rating for this single movie and ordering on it is useless. Consequently scifi_movies == well_rated_scifi

If we want the ordering, it should be done outside of the match() step

gremlin>g.V().match(
__.as("movies").hasLabel("movie"),
__.as("movies").filter(out("belongsTo").values("name").is(eq("Sci-Fi"))).as("scifi_movies") ). select("scifi_movies"). order().by(inE("rated").values("rating").mean(), decr). limit(5). project("movie","rating"). by("title"). by(inE("rated").values("rating").mean())
==>{movie=Metropolis, rating=8.25}
==>{movie=A Clockwork Orange, rating=8.215686274509803}
==>{movie=Blade Runner, rating=8.20353982300885}
==>{movie=Star Wars. Episode V: The Empire Strikes Back, rating=8.121739130434783}
==>{movie=Stalker, rating=8.03191489361702}

F. Recommendation engine with pattern-matching

We can re-write our classical movies recommendation engine seen in previous blog posts using pattern matching

gremlin>g.V().match( __.as("blade_runner").has("movie", "title", "Blade Runner"), __.as("blade_runner").map(inE("rated").values("rating").mean()).as("avg_rating"), __.as("blade_runner").map(out("belongsTo").values("name").fold()).as("genres"), __.as("blade_runner").inE("rated").filter(values("rating").where(gte("avg_rating"))).outV().as("raters_of_blade_runner"),

__.as("raters_of_blade_runner").outE("rated").filter(values("rating").where(gte("avg_rating"))).inV().where(neq("blade_runner")).as("potential_movies"), __.as("potential_movies").where("potential_movies", gte("avg_rating")).by(inE("rated").values("rating").mean()).by().as("filtered_movies1"), __.as("filtered_movies1").filter(out("belongsTo").has("name", where(within("genres")))).as("matching_movies")
).
select("matching_movies").
project("movie", "average_rating", "genres").

   by("title").
   by(inE("rated").values("rating").mean()).
   by(out("belongsTo").values("name").fold())
Neither the sideEffects, map, nor path has a genres-key: WherePredicateStep(within([genres]))
Type ':help' or ':h' for help.

Against, the error is due to a bug in Gremlin, reported in TINKERPOP-1762.

A simple work-around right now is to add an extra dedup() after the select() step

gremlin>g.V().match( __.as("blade_runner").has("movie", "title", "Blade Runner"), __.as("blade_runner").map(inE("rated").values("rating").mean()).as("avg_rating"), __.as("blade_runner").map(out("belongsTo").values("name").fold()).as("genres"), __.as("blade_runner").inE("rated").filter(values("rating").where(gte("avg_rating"))).outV().as("raters_of_blade_runner"), __.as("raters_of_blade_runner").outE("rated").filter(values("rating").where(gte("avg_rating"))).inV().where(neq("blade_runner")).as("potential_movies"), __.as("potential_movies").where("potential_movies", gte("avg_rating")).by(inE("rated").values("rating").mean()).by().as("filtered_movies1"), __.as("filtered_movies1").filter(out("belongsTo").has("name", where(within("genres")))).as("matching_movies") ). 
   select("matching_movies").dedup().
   project("movie", "average_rating", "genres").
      by("title").
      by(inE("rated").values("rating").mean()).
      by(out("belongsTo").values("name").fold())
==>{movie=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]}
==>{movie=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]}
==>{movie=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

Discover more
Gremlin
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.