TechnologySeptember 22, 2017

Gremlin Recipes: 1 - Understanding Gremlin Traversals

Gremlin Recipes: 1 - Understanding Gremlin Traversals

Part 1 of 5 for the series Gremlin Recipes. The purpose is to explain the internals of Gremlin and provide a deeper insight into the query language to master it.

This blog post belongs to a series called Gremlin Recipes, published by Duyhai Doan on his blog. The purpose is to explain the internals of Gremlin and give people a deeper insight into the query language to master it.

I. KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. The graph schema of this dataset is:

INSERTING DATA

First, open the Gremlin console or Datastax Studio (whichever works fine) and execute the following statements:

Open-source Gremlin Console config:

:remote connect tinkerpop.server conf/remote.yaml session-manage
:remote config timeout max
:remote console
system.graph('KillrVideo').create()
:remote config alias g KillrVideo.g

KillrVideo schema & data loading:

schema.clear();
// Property keys 
schema.propertyKey("genreId").Text().create(); 
schema.propertyKey("personId").Text().create(); 
schema.propertyKey("userId").Text().create(); 
schema.propertyKey("movieId").Text().create(); 
schema.propertyKey("name").Text().create(); 
schema.propertyKey("age").Int().create(); 
schema.propertyKey("gender").Text().create(); 
schema.propertyKey("title").Text().create(); 
schema.propertyKey("year").Int().create(); 
schema.propertyKey("duration").Int().create(); 
schema.propertyKey("country").Text().create(); 
schema.propertyKey("production").Text().multiple().create(); 
schema.propertyKey("rating").Int().create();
 
// Vertex labels
schema.vertexLabel("genre").properties("genreId","name").create();
schema.vertexLabel("person").properties("personId","name").create();
schema.vertexLabel("user").properties("userId","age","gender").create();
schema.vertexLabel("movie").properties("movieId","title","year","duration","country","production").create();
 
// Edge labels
schema.edgeLabel("knows").connection("user","user").create();
schema.edgeLabel("rated").single().properties("rating").connection("user","movie").create();
schema.edgeLabel("belongsTo").single().connection("movie","genre").create();
schema.edgeLabel("actor").connection("movie","person").create();
schema.edgeLabel("director").single().connection("movie","person").create();
 
// Vertex indexes
schema.vertexLabel("genre").index("genresById").materialized().by("genreId").add();
schema.vertexLabel("genre").index("genresByName").materialized().by("name").add();
schema.vertexLabel("person").index("personsById").materialized().by("personId").add();
schema.vertexLabel("person").index("personsByName").materialized().by("name").add();
schema.vertexLabel("user").index("usersById").materialized().by("userId").add();
schema.vertexLabel("user").index("search").search().by("age").by("gender").asString().add();
schema.vertexLabel("movie").index("moviesById").materialized().by("movieId").add();
schema.vertexLabel("movie").index("moviesByTitle").materialized().by("title").add();
schema.vertexLabel("movie").index("search").search().by("year").by("country").asString().add();
 
// Edge indexes
schema.vertexLabel("user").index("toMoviesByRating").outE("rated").by("rating").add();
schema.vertexLabel("movie").index("toUsersByRating").inE("rated").by("rating").add();
 
schema.config().option("tx_autostart").set(true);
 
// Load data from file KillrVideo-small.kryo
graph.io(IoCore.gryo()).readGraph("/path/to/KillrVideo-small.kryo");

To be able to perform a full scan on this small dataset, you need to add

schema.config().option('graph.allow_scan').set('true');

The file KillrVideo-small.kryo can be downloaded here

II. Gremlin, Iterators, and Streams

Usually, any graph traversal starts with g.V(), but what is the type of this expression ? To know that, just execute the query below:

gremlin> g.V().getClass()
==>class org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal

The type of g.V() is indeed a DefaultGraphTraversal. According to the Gremlin API, DefaultGraphTraversal implements Iterator. In fact DefaultGraphTraversal is an Iterator of Vertex.

If we had done g.E().getClass() it would be an Iterator of Edge

To confirm that g.V() is an iterator of vertex, let’s just execute this code

gremlin> g.V().next().getClass()
==>class com.datastax.bdp.graph.impl.element.vertex.DsegCachedVertexImpl

Since it’s an iterator, we can call method next() on it and call getClass() to get its type. As we can see DsegCachedVertexImpl is just an implementation of Gremlin Vertex interface. Of course since it’s an iterator, we can invoke as well other methods on Iterator interface like hasNext(), tryNext(), next(n), limit(n), tail(n), toList(), toSet()

So far so good, now let’s analyse a very simple traversal:

gremlin>g.
    V().                            
    hasLabel("person").             
    has("name", "Harrison Ford").   
    next()
==>v[{~label=person, community_id=1425165696, member_id=527}]

Below is a matrix giving the equivalent stream processing for each Gremlin step:

Gremlin step Stream equivalent
g.V() Iterator iterator = …
.hasLabel(“person”) iterator.stream().filter(vertex -> vertex instanceof Person) == Iterator
.has(“name”, “Harrison Ford”) iterator.stream().filter(person -> person.getName().equals(“Harrison Ford”)) == Iterator
.next() iterator.findFirst().get()

The above traversal just fetch the whole vertex Harrison Ford. What if we only want to retrieve the name instead of the complete vertex ?

gremlin>g.
    V().                            
    hasLabel("person").             
    has("name", "Harrison Ford"). 
    values("name").  // iterator.map(person -> person.getName()) == Iterator<String>
    next()
==>Harrison Ford

The step values("name") is equivalent to a map operation of Java stream to extract only the name property of the vertex.

Now, if you wish to retrieve more properties from the vertex, you can either use values("property1", "property2", ...,"propertyN") or use the step valueMap("property1", "property2", ..., "propertyN")

gremlin>g.
    V().                            
    hasLabel("person").             
    has("name", "Harrison Ford"). 
    valueMap("name", "personId"). // iterator.map(person -> ImmutableMap.of("name",person.getName(), "id", person.getPersonId()))
    next()
==>name=[Harrison Ford]
==>personId=[p3211]

The step valueMap("name", "personId") will transform the vertex object into a Map structure. The key of the map is the property label, the value of the Map is the corresponding value.

To prove it, just call getClass() again

gremlin>g.
    V().                            
    hasLabel("person").             
    has("name", "Harrison Ford"). 
    valueMap("name", "personId").
    next().
    getClass()
==>class java.util.HashMap

We get a java.util.HashMap as expected.

III. Grouping data


Now let’s do something crazier, let’s just count the number of movies released by year in our dataset. For that we use this traversal

gremlin>g.
    V().                    // Iterator<Vertex>    
    hasLabel("movie").      // Iterator<Movie>   
    groupCount().by("year") // Iterator<Map<Int == year of release, Long = count>>
==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}

So the interesting step here is the groupCount().by("year"). This steps performs 2 actions in one: a grouping followed by a counting. The grouping is executed on the property year as grouping key. The result of the grouping is Iterator>>. The counting step will transform this iterator into Iterator>

One interesting thing to notice here is that we have a nested collection structure. The outer collection is the Iterator and the inner collection is the Map.

Because all the grouping is result is stored inside the map, the iterator contains only a single element, which is the map itself. We can prove it with:

gremlin>g.
    V().                     // Iterator<Vertex>    
    hasLabel("movie").       // Iterator<Movie>   
    groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>>
    count()
==>1

Now, what it we want to access the content of the inner collection(the map here) inside the iterator ? Of course we can always invoke next() but Gremlin also offers an interesting alternative: unfold(). It is equivalent to the Java stream flatMap() method. Let’s see it in action:

gremlin>g.
    V().                     // Iterator<Vertex>    
    hasLabel("movie").       // Iterator<Movie>   
    groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>>
    unfold().                // iterator.stream().flatMap() == Iterator<MapEntry<Int,Long>>
    next()
==>1921=1

IV. Advanced grouping in Gremlin

The grouping we have seen above (groupCount().by("year")) is just a special case of a more general grouping form: group().by(xxx).by(yyy)

The first by(xxx) defines the grouping key e.g. the property/value on which the grouping is done. It can be a property of the vertex/edge itself or something more complicated like a complete traversal.

The second by(yyy) can be:

  1. either a reducing step like count() to reduce the collection of matching vertices for each grouping key
  2. or a projection step to transform the matching collection of vertices into a collection of something else. Something else can be a scalar value or vertices/edges

Let’s put all this into practice. So first, the equivalent of our previous groupCount().by("year") can be rewritten as:

gremlin>g.
    V().               // Iterator<Vertex>    
    hasLabel("movie"). // Iterator<Movie>   
    group().
      by("year").      // iterator = Iterator<Map<Int == year of release, Collection<Movie>>
      by(count())      // iterator.stream().map(mapStructure -> mapStructure.stream().collect(Collectors.toMap(key -> key, value -> value.size()))) 
==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}

As expected the result is identical to what we get earlier.

Now let’s say we want to perform a projection of the collection of movies instead of a reduction, we want to have, for each year, a list of movies title instead of movies count, the previous traversal becomes:

gremlin>g.
    V().               // Iterator<Vertex>    
    hasLabel("movie"). // Iterator<Movie>   
    group().
      by("year").      // Iterator<Map<Int == year of release, Collection<Movie>>
      by("title").     // Iterator<Map<Int == year of release, Collection<String == movie title>>
      unfold().        // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
      take(10)
==>1921=[The Kid]
==>1925=[The Gold Rush]
==>1926=[The General]
==>1927=[Metropolis, Sunrise]
==>1931=[City Lights]
==>1933=[Duck Soup]
==>1934=[It Happened One Night]
==>1935=[A Night at the Opera, Top Hat]
==>1936=[Modern Times]
==>1937=[Snow White and the Seven Dwarfs, Captains Courageous]

So far, so good. Now instead of doing a simple projection on a Movie property, let’s do a traversal instead. We want to group the movies by year and for each year, we display the list of movie director name:

gremlin>g.
    V().                                  // Iterator<Vertex>    
    hasLabel("movie").                    // Iterator<Movie>   
    group().
      by("year").                         // Iterator<Map<Int == year of release, Collection<Movie>>
      by(out("director").values("name")). // Iterator<Map<Int == year of release, Collection<String == director name>>
      unfold().                           // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
      take(10)
==>1921=Charles Chaplin
==>1925=Charles Chaplin
==>1926=Buster Keaton
==>1927=F.W. Murnau
==>1931=Charles Chaplin
==>1933=Leo McCarey
==>1934=Frank Capra
==>1935=Mark Sandrich
==>1936=Charles Chaplin
==>1937=Victor Fleming

Strangely enough, we only get 1 director name for each year, which is not correct. It looks like a bug in Gremlin but it isn’t.

When we traverse the edge “director” from the vertex “movie”, it is a 1-to-N relationship because we can have for each movie more than 1 directors so we’ll end up having Collection> and thus a combinatory explosion. By design, Gremlin only call next() on inner traversal, maybe to avoid such explosion (this is my assumption, to be confirmed).

To force Gremlin to be exhaustive on the inner traversal out("director").values("name"), we can use the fold() step

gremlin>g.
    V().                                         // Iterator<Vertex>    
    hasLabel("movie").                           // Iterator<Movie>   
    group().
      by("year").                                // Iterator<Map<Int == year of release, Collection<Movie>>
      by(out("director").values("name").fold()). // Iterator<Map<Int == year of release, String == director names>>
      unfold().                                  // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
      take(10)
==>1921=[Charles Chaplin]
==>1925=[Charles Chaplin]
==>1926=[Buster Keaton, Clyde Bruckman]
==>1927=[Fritz Lang, F.W. Murnau]
==>1931=[Charles Chaplin]
==>1933=[Leo McCarey]
==>1934=[Frank Capra]
==>1935=[Sam Wood, Mark Sandrich]
==>1936=[Charles Chaplin]
==>1937=[David Hand, Victor Fleming]

Here we are!

Finally, we can also replace the grouping step by a complete traversal instead of a simple vertex property. Let’s say we want to group movies by director name and display the movies title for each director:

gremlin>g.
    V().                                  // Iterator<Vertex>    
    hasLabel("movie").                    // Iterator<Movie>   
    group().
      by(out("director").values("name")). // Iterator<Map<String == director name, Collection<Movie>>
      by("title").                        // Iterator<Map<String == director name, Collection<String> == movie title>>
      unfold().                           // Iterator<MapEntry<String == director name, Collection<String == movie title>>
      take(10)
==>Kwak Jae-young=[My Sassy Girl]
==>Peter Jackson=[The Lord of the Rings: The Two Towers, The Lord of the Rings: The Fellowship of the Ring, The Hobbit: The Desolation of Smaug, The Lord of the Rings: The Return of the King, King Kong, The Hobbit: An Unexpected Journey]
==>Alejandro Agresti=[The Lake House]
==>George Pan Cosmatos=[Tombstone]
==>Pitof=[Catwoman]
==>Raman Hui=[Shrek the Third]
==>Michael Bay=[Transformers, Armageddon, The Island, Pearl Harbor, Bad Boys, The Rock]
==>Santiago Segura=[Torrente, el brazo tonto de la ley, Torrente 2: Mission in Marbella, Torrente 3]
==>Robert Stevenson=[Mary Poppins]
==>Clare Kilner=[The Wedding Date]

Again, putting out("director").values("name") inside the first by() instead of “year” will do the trick.

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

Discover more
Gremlin
Share

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.