Gremlin Recipes: 1 - Understanding Gremlin Traversals
Part 1 of 5 for the series Gremlin Recipes. The purpose is to explain the internals of Gremlin and provide a deeper insight into the query language to master it.
This blog post belongs to a series called Gremlin Recipes, published by Duyhai Doan on his blog. The purpose is to explain the internals of Gremlin and give people a deeper insight into the query language to master it.
I. KillrVideo dataset
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. The graph schema of this dataset is:
INSERTING DATA
First, open the Gremlin console or Datastax Studio (whichever works fine) and execute the following statements:
Open-source Gremlin Console config:
:remote connect tinkerpop.server conf/remote.yaml session-manage
:remote config timeout max
:remote console
system.graph('KillrVideo').create()
:remote config alias g KillrVideo.g
KillrVideo schema & data loading:
schema.clear();
// Property keys
schema.propertyKey("genreId").Text().create();
schema.propertyKey("personId").Text().create();
schema.propertyKey("userId").Text().create();
schema.propertyKey("movieId").Text().create();
schema.propertyKey("name").Text().create();
schema.propertyKey("age").Int().create();
schema.propertyKey("gender").Text().create();
schema.propertyKey("title").Text().create();
schema.propertyKey("year").Int().create();
schema.propertyKey("duration").Int().create();
schema.propertyKey("country").Text().create();
schema.propertyKey("production").Text().multiple().create();
schema.propertyKey("rating").Int().create();
// Vertex labels
schema.vertexLabel("genre").properties("genreId","name").create();
schema.vertexLabel("person").properties("personId","name").create();
schema.vertexLabel("user").properties("userId","age","gender").create();
schema.vertexLabel("movie").properties("movieId","title","year","duration","country","production").create();
// Edge labels
schema.edgeLabel("knows").connection("user","user").create();
schema.edgeLabel("rated").single().properties("rating").connection("user","movie").create();
schema.edgeLabel("belongsTo").single().connection("movie","genre").create();
schema.edgeLabel("actor").connection("movie","person").create();
schema.edgeLabel("director").single().connection("movie","person").create();
// Vertex indexes
schema.vertexLabel("genre").index("genresById").materialized().by("genreId").add();
schema.vertexLabel("genre").index("genresByName").materialized().by("name").add();
schema.vertexLabel("person").index("personsById").materialized().by("personId").add();
schema.vertexLabel("person").index("personsByName").materialized().by("name").add();
schema.vertexLabel("user").index("usersById").materialized().by("userId").add();
schema.vertexLabel("user").index("search").search().by("age").by("gender").asString().add();
schema.vertexLabel("movie").index("moviesById").materialized().by("movieId").add();
schema.vertexLabel("movie").index("moviesByTitle").materialized().by("title").add();
schema.vertexLabel("movie").index("search").search().by("year").by("country").asString().add();
// Edge indexes
schema.vertexLabel("user").index("toMoviesByRating").outE("rated").by("rating").add();
schema.vertexLabel("movie").index("toUsersByRating").inE("rated").by("rating").add();
schema.config().option("tx_autostart").set(true);
// Load data from file KillrVideo-small.kryo
graph.io(IoCore.gryo()).readGraph("/path/to/KillrVideo-small.kryo");
To be able to perform a full scan on this small dataset, you need to add
schema.config().option('graph.allow_scan').set('true');
The file KillrVideo-small.kryo can be downloaded here
II. Gremlin, Iterators, and Streams
Usually, any graph traversal starts with g.V(), but what is the type of this expression ? To know that, just execute the query below:
gremlin> g.V().getClass()
==>class org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal
The type of g.V() is indeed a DefaultGraphTraversal. According to the Gremlin API, DefaultGraphTraversal implements Iterator. In fact DefaultGraphTraversal is an Iterator of Vertex.
If we had done g.E().getClass() it would be an Iterator of Edge
To confirm that g.V() is an iterator of vertex, let’s just execute this code
gremlin> g.V().next().getClass()
==>class com.datastax.bdp.graph.impl.element.vertex.DsegCachedVertexImpl
Since it’s an iterator, we can call method next() on it and call getClass() to get its type. As we can see DsegCachedVertexImpl is just an implementation of Gremlin Vertex interface. Of course since it’s an iterator, we can invoke as well other methods on Iterator interface like hasNext(), tryNext(), next(n), limit(n), tail(n), toList(), toSet() …
So far so good, now let’s analyse a very simple traversal:
gremlin>g.
V().
hasLabel("person").
has("name", "Harrison Ford").
next()
==>v[{~label=person, community_id=1425165696, member_id=527}]
Below is a matrix giving the equivalent stream processing for each Gremlin step:
Gremlin step | Stream equivalent |
---|---|
g.V() | Iterator iterator = … |
.hasLabel(“person”) | iterator.stream().filter(vertex -> vertex instanceof Person) == Iterator |
.has(“name”, “Harrison Ford”) | iterator.stream().filter(person -> person.getName().equals(“Harrison Ford”)) == Iterator |
.next() | iterator.findFirst().get() |
The above traversal just fetch the whole vertex Harrison Ford. What if we only want to retrieve the name instead of the complete vertex ?
gremlin>g.
V().
hasLabel("person").
has("name", "Harrison Ford").
values("name"). // iterator.map(person -> person.getName()) == Iterator<String>
next()
==>Harrison Ford
The step values("name")
is equivalent to a map operation of Java stream to extract only the name property of the vertex.
Now, if you wish to retrieve more properties from the vertex, you can either use values("property1", "property2", ...,"propertyN")
or use the step valueMap("property1", "property2", ..., "propertyN")
gremlin>g.
V().
hasLabel("person").
has("name", "Harrison Ford").
valueMap("name", "personId"). // iterator.map(person -> ImmutableMap.of("name",person.getName(), "id", person.getPersonId()))
next()
==>name=[Harrison Ford]
==>personId=[p3211]
The step valueMap("name", "personId")
will transform the vertex object into a Map structure. The key of the map is the property label, the value of the Map is the corresponding value.
To prove it, just call getClass()
again
gremlin>g.
V().
hasLabel("person").
has("name", "Harrison Ford").
valueMap("name", "personId").
next().
getClass()
==>class java.util.HashMap
We get a java.util.HashMap
as expected.
III. Grouping data
Now let’s do something crazier, let’s just count the number of movies released by year in our dataset. For that we use this traversal
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
groupCount().by("year") // Iterator<Map<Int == year of release, Long = count>>
==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}
So the interesting step here is the groupCount().by("year")
. This steps performs 2 actions in one: a grouping followed by a counting. The grouping is executed on the property year
as grouping key. The result of the grouping is Iterator>>
. The counting step will transform this iterator into Iterator>
One interesting thing to notice here is that we have a nested collection structure. The outer collection is the Iterator
and the inner collection is the Map
.
Because all the grouping is result is stored inside the map, the iterator contains only a single element, which is the map itself. We can prove it with:
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>>
count()
==>1
Now, what it we want to access the content of the inner collection(the map here) inside the iterator ? Of course we can always invoke
but Gremlin also offers an interesting alternative:next()
. It is equivalent to the Java streamunfold()
method. Let’s see it in action:flatMap()
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>>
unfold(). // iterator.stream().flatMap() == Iterator<MapEntry<Int,Long>>
next()
==>1921=1
IV. Advanced grouping in Gremlin
The grouping we have seen above (groupCount().by("year"))
is just a special case of a more general grouping form: group().by(xxx).by(yyy)
The first by(xxx)
defines the grouping key e.g. the property/value on which the grouping is done. It can be a property of the vertex/edge itself or something more complicated like a complete traversal.
The second by(yyy)
can be:
- either a reducing step like
count()
to reduce the collection of matching vertices for each grouping key - or a projection step to transform the matching collection of vertices into a collection of something else. Something else can be a scalar value or vertices/edges
Let’s put all this into practice. So first, the equivalent of our previous groupCount().by("year")
can be rewritten as:
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
group().
by("year"). // iterator = Iterator<Map<Int == year of release, Collection<Movie>>
by(count()) // iterator.stream().map(mapStructure -> mapStructure.stream().collect(Collectors.toMap(key -> key, value -> value.size())))
==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}
As expected the result is identical to what we get earlier.
Now let’s say we want to perform a projection of the collection of movies instead of a reduction, we want to have, for each year, a list of movies title instead of movies count, the previous traversal becomes:
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
group().
by("year"). // Iterator<Map<Int == year of release, Collection<Movie>>
by("title"). // Iterator<Map<Int == year of release, Collection<String == movie title>>
unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
take(10)
==>1921=[The Kid]
==>1925=[The Gold Rush]
==>1926=[The General]
==>1927=[Metropolis, Sunrise]
==>1931=[City Lights]
==>1933=[Duck Soup]
==>1934=[It Happened One Night]
==>1935=[A Night at the Opera, Top Hat]
==>1936=[Modern Times]
==>1937=[Snow White and the Seven Dwarfs, Captains Courageous]
So far, so good. Now instead of doing a simple projection on a Movie property, let’s do a traversal instead. We want to group the movies by year and for each year, we display the list of movie director name:
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
group().
by("year"). // Iterator<Map<Int == year of release, Collection<Movie>>
by(out("director").values("name")). // Iterator<Map<Int == year of release, Collection<String == director name>>
unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
take(10)
==>1921=Charles Chaplin
==>1925=Charles Chaplin
==>1926=Buster Keaton
==>1927=F.W. Murnau
==>1931=Charles Chaplin
==>1933=Leo McCarey
==>1934=Frank Capra
==>1935=Mark Sandrich
==>1936=Charles Chaplin
==>1937=Victor Fleming
Strangely enough, we only get 1 director name for each year, which is not correct. It looks like a bug in Gremlin but it isn’t.
When we traverse the edge “director” from the vertex “movie”, it is a 1-to-N relationship because we can have for each movie more than 1 directors so we’ll end up having Collection> and thus a combinatory explosion. By design, Gremlin only call next()
on inner traversal, maybe to avoid such explosion (this is my assumption, to be confirmed).
To force Gremlin to be exhaustive on the inner traversal out("director").values("name")
, we can use the fold()
step
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
group().
by("year"). // Iterator<Map<Int == year of release, Collection<Movie>>
by(out("director").values("name").fold()). // Iterator<Map<Int == year of release, String == director names>>
unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>>
take(10)
==>1921=[Charles Chaplin]
==>1925=[Charles Chaplin]
==>1926=[Buster Keaton, Clyde Bruckman]
==>1927=[Fritz Lang, F.W. Murnau]
==>1931=[Charles Chaplin]
==>1933=[Leo McCarey]
==>1934=[Frank Capra]
==>1935=[Sam Wood, Mark Sandrich]
==>1936=[Charles Chaplin]
==>1937=[David Hand, Victor Fleming]
Here we are!
Finally, we can also replace the grouping step by a complete traversal instead of a simple vertex property. Let’s say we want to group movies by director name and display the movies title for each director:
gremlin>g.
V(). // Iterator<Vertex>
hasLabel("movie"). // Iterator<Movie>
group().
by(out("director").values("name")). // Iterator<Map<String == director name, Collection<Movie>>
by("title"). // Iterator<Map<String == director name, Collection<String> == movie title>>
unfold(). // Iterator<MapEntry<String == director name, Collection<String == movie title>>
take(10)
==>Kwak Jae-young=[My Sassy Girl]
==>Peter Jackson=[The Lord of the Rings: The Two Towers, The Lord of the Rings: The Fellowship of the Ring, The Hobbit: The Desolation of Smaug, The Lord of the Rings: The Return of the King, King Kong, The Hobbit: An Unexpected Journey]
==>Alejandro Agresti=[The Lake House]
==>George Pan Cosmatos=[Tombstone]
==>Pitof=[Catwoman]
==>Raman Hui=[Shrek the Third]
==>Michael Bay=[Transformers, Armageddon, The Island, Pearl Harbor, Bad Boys, The Rock]
==>Santiago Segura=[Torrente, el brazo tonto de la ley, Torrente 2: Mission in Marbella, Torrente 3]
==>Robert Stevenson=[Mary Poppins]
==>Clare Kilner=[The Wedding Date]
Again, putting out("director").values("name")
inside the first by()
instead of “year” will do the trick.
And that’s all folks! Do not miss the other Gremlin recipes in this series.
If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai