Technology•September 25, 2017

Gremlin Recipes: 7 – Variables Handling

Part 7 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.

This blog post is the 7th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:

I KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.

The graph schema of this dataset is:

II Variable handling operators

In Gremlin we distinguish 2 types of operators:

for setting variables
1. as side effect
2. as label in path history (path object)
for accessing variables

Below are the list of these operators

Operator

Side-Effect/Label

Setting/reading variable

Data type

store("variable")

side effect

setting

BulkSet implements Set

aggregate("variable")

side effect

setting

BulkSet implements Set

group("variable").by(grouping_axis).by(projection_axis)

side effect

setting

Map

groupCount("variable").by(projection_axis)

side effect

setting

Map

as("label")

label

setting

Traversal

subgraph("variable")

side effect

setting

TinkerGraph or Traversal depending on how you access it

select("variable_or_label")

side effect or label

reading

depends on the variable

cap("variable")

side effect or label

reading

depends on the variable

First let’s see how many users have rated Blade Runner more than 9/10:A. store(variable)

gremlin>g.V(). has("movie", "title", "Blade Runner"). inE("rated").has("rating", gte(9)). count() ==>45

Ok, now let’s save all those 45 ratings using store() operator:

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator<Movie> inE("rated").has("rating", gte(9)). // Iterator<Rated_Edge> store("rated_edges"). // Store rated edges inside a Collection outV(). // Iterator<User> select("rated_edges"). // Iterator<BulkSet<Rated_Edge>> next(). // BulkSet<Rated_Edge> size() ==>5

Strangely we only have 5 edges stored in the “rated_edges” collection. Why ? The reason is that store() is a lazy operator and only appends the traversed edges lazily into the “rated_edges” collection. The time the traverser reaches the next().size() step, only 5 edges have been accumulated into the “rated_edges” collection so far.

To force Gremlin to fill up completely the “rated_edges” collection so that next().size() returns the expected 45 count, we can use the barrier() instruction. In this case Gremlin will perform a breadth-first evaluation of all rated edges before moving to the next step

The concept of depth-first and breadth-first will be discussed in a future blog post

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)). // Iterator store("rated_edges"). // Store rated edges inside a Collection outV(). // Iterator barrier(). // Force eager evaluation of all previous steps select("rated_edges"). // Iterator> next(). // BulkSet size() ==>45

Optionally, instead of using select("rated_edges") we could use cap("rated_edges") instead. cap() will force an eager evaluation of all previous step, pretty similar to barrier(). Indeed you can see cap(xxx) as a shorthand for barrier().select(xxx)

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)). // Iterator store("rated_edges"). // Store rated edges inside a Collection outV(). // Iterator cap("rated_edges"). // Iterator> next(). // BulkSet size() ==>45

B. aggregate(variable)

Aggregate works like store() but in an eager fashion. Consequently with aggregate() you don’t need to resort to neither barrier() nor to cap()

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)). // Iterator aggregate("rated_edges"). // Store rated edges inside a Collection outV(). // Iterator select("rated_edges"). // Iterator> next(). // BulkSet size() ==>45

aggregate(x).select(x) == store(x).cap(x) == store(x).barrier().select(x)

C. group(variable).by(grouping_axis).by(projection_axis)

The group(variable) allows you to store the partial grouping result (which is a Map structure) as a side effect of your traversal.

Let’s say we want to save all fans of Blade Runner and Inception and group them by their age and project on their userId:

Strangely the query timed out. Is it related to any lazy evaluation rule or something like that ? Not at all. The explanation is given inside the 1st comment on TINKERPOP-1741:

You can’t select('a'). You have to cap('a'). This is because GroupStep requires a “on completion” computation. Why is it like this? When should reduction happen? select() just grabs the side-effect and if it hasn’t been reduced (because it might be reduced later), then that’s that. Why not have it reduced at group('a') – nope, you can’t cause you typically use side-effects repeatedly (e.g. in a repeat()). If you wanted it reduced after group('a'), you would use group().store('a'). Thus, the only step that we have the forces reduction on a side-effect is cap()

Long story short, when you’re doing group(variable) the map structure is not closed and not available for reading yet until you call cap(variable)>. The reason not to close the map structure and let it open is because we may accumulate more data into it later in the traversal and that’s what exactly happens with the above traversal.

First we cumulate fans of Blade Runner into “scifi_fans” map and we re-use it later to store Inception fans.

Please notice the use of V(). in the middle of our traversal. Once we have grouped all fans of Blade Runner, we restart a new traversal by jumping back to the original V(). iterator on vertices.

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator group("scifi_fans"). // Group fans of Blade Runner by("age"). // By age by("userId"). // Project on userId V(). // Restart a new traversal has("movie", "title", "Inception"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator group("scifi_fans"). // Group fans of Inception by("age"). // By age by("userId"). // Project on userId cap("scifi_fans") // Recall the Map><Int==age, Collection<String==userId>> ==>{64=[u408, u476, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u411, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452, u452], 18=[u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978, u978], ...

The result is ugly and there are a lot of userId duplicate for each age. We need a de-duplication:

The result is shorter but also wrong. There is only 1 user for each age group. Why ? This is again a classic caveat. For all inner traversals Gremlin will call by default next() to get the result/output of this inner traversal. Since values("userId").dedup() is a traversal, we need to use fold() to collect all the userId into a collection structure:

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator group("scifi_fans"). // Group fans of Blade Runner by("age"). // By age by(values("userId").dedup().fold()). // Project on userId V(). // Restart a new traversal has("movie", "title", "Inception"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator group("scifi_fans"). // Group fans of Inception by("age"). // By age by(values("userId").dedup().fold()). // Project on userId cap("scifi_fans") // Recall the Map><Int==age, Collection<String==userId>> ==>{64=[u452, u408, u411, u476], 18=[u978], 19=[u1031, u292], 20=[u637, u483], 21=[u1070, u689, u239, u305], 22=[u356, u501, u456], 23=[u390, u678, u841], 24=[u355, u692, u718], 25=[u255, u724, u963], 26=[u628, u848, u598], 27=[u685, u339, u240, u445], 28=[u979, u497], 29=[u420, u861, u474, u1000], 32=[u473], 33=[u548, u547], 34=[u262], 35=[u785, u603, u223, u511], 36=[u744, u267], 37=[u398, u386, u434, u646], 39=[u365, u896], 40=[u693, u764], 41=[u513, u498], 42=[u347, u536], 43=[u287, u389, u251, u327], 44=[u442], 45=[u565], 46=[u335], 47=[u591, u466], 48=[u622], 49=[u455], 50=[u555, u252], 51=[u507], 52=[u416, u263], 53=[u328], 54=[u268, u524], 55=[u332], 56=[u530, u477, u396], 57=[u522, u210], 58=[u551], 59=[u349, u202], 60=[u630, u218, u491, u378], 61=[u617]}

So far so good.

D. groupCount(variable).by(grouping_axis)

groupCount() is just a special case of group() so it works exactly the same:

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator groupCount("scifi_fans"). // Group fans of Blade Runner by("age"). // By age V(). // Restart a new traversal has("movie", "title", "Inception"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator groupCount("scifi_fans"). // Group fans of Inception by("age"). // By age cap("scifi_fans") // Recall the Map ==><Int==age, Long==user count>
==>{64=92, 18=45, 19=46, 20=90, 21=92, 22=92, 23=91, 24=47, 25=135, 26=47, 27=92, 28=90, 29=48, 32=45, 33=46, 34=1, 35=92, 36=91, 37=92, 39=47, 40=46, 41=46, 42=2, 43=92, 44=45, 45=1, 46=1, 47=46, 48=1, 49=45, 50=90, 51=45, 52=90, 53=1, 54=47, 55=45, 56=91, 57=46, 58=1, 59=46, 60=92, 61=45}

E. as(label)

The case of as(label) is interesting. It allows you to label any step in your traversal to refer to it later. Used in combination with select() you can jump back and forth inside a traversal.

Taking the previous examples of Blade Runner fans:

has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)). // Iterator as("rated_edges"). // Label this step "rated_edges" outV(). // Iterator select("rated_edges"). // should be Iterator next() // Rated_Edge ==>e[{~label=rated, ~out_vertex={~label=user, community_id=227508608, member_id=635}, ~in_vertex={~label=movie, community_id=285248128, member_id=2}, ~local_id=c2ef7ea4-7066-11e7-897a-edc6e0130fc0}][{~label=user, community_id=227508608, member_id=635}-rated->{~label=movie, community_id=285248128, member_id=2}

It is also possible to assign multiple labels and recall them together. In this traversal we display the fans of Blade Runner as well as their rating:

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator<Movie> inE("rated").has("rating", gte(9)). // Iterator as("rated_edges"). // Label this step "rated_edges" outV(). // Iterator<User> as("fans"). // Label this step "fans" select("fans","rated_edges"). // Select "fans" and "rated_edges" by("userId"). // project "fans" on userId by("rating"). // project "rated_edges" on rating ==>{fans=u622, rated_edges=9} ==>{fans=u628, rated_edges=9} ==>{fans=u223, rated_edges=9} ==>{fans=u434, rated_edges=9} ==>{fans=u1031, rated_edges=9} ==>{fans=u718, rated_edges=9} ==>{fans=u524, rated_edges=9} ==>{fans=u365, rated_edges=9} ==>{fans=u689, rated_edges=9} ==>{fans=u390, rated_edges=9} ...

Easy. Now what if the 2 labeled steps are independent e.g. there is a mid traversal ?

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator as("fans_blade_runner"). // Label this step "fans_blade_runner" V(). // new traversal has("movie", "title", "Inception"). // Iterator inE("rated").has("rating", gte(9)).outV(). // Iterator as("fans_inception"). // Label this step "fans_inception" select("fans_blade_runner","fans_inception"). // Select "fans_blade_runner" and "fans_inception" by("userId"). // project "fans_blade_runner" on userId by("userId"). // project "fans_inception" on userId limit(10) ==>{fans_blade_runner=u622, fans_inception=u744} ==>{fans_blade_runner=u622, fans_inception=u978} ==>{fans_blade_runner=u622, fans_inception=u963} ==>{fans_blade_runner=u622, fans_inception=u598} ==>{fans_blade_runner=u622, fans_inception=u252} ==>{fans_blade_runner=u622, fans_inception=u513} ==>{fans_blade_runner=u622, fans_inception=u555} ==>{fans_blade_runner=u622, fans_inception=u764} ==>{fans_blade_runner=u622, fans_inception=u378} ==>{fans_blade_runner=u622, fans_inception=u332}

As we can see Gremlin is performing a kind of cartesian product between the 2 sets of users. This is explained by the fact that we have a mid traversal step V() which is equivalent to having 2 separated traversals combined together, thus the cartesian product effect.

F. subgraph(variable)

subgraph() allows you to save a partial edge-induced subgraph. Consequently the subgraph() step can only be placed after an edge step, not after a vertex step:

gremlin>g.V(). has("movie", "title", "Blade Runner"). inE("rated").has("rating", gte(9)).outV(). subgraph("fans_blade_runner") com.datastax.bdp.graph.impl.element.vertex.DsegCachedVertexImpl cannot be cast to org.apache.tinkerpop.gremlin.structure.Edge Type ':help' or ':h' for help. Display stack trace? [yN]

To fix this:

gremlin>g.V(). has("movie", "title", "Blade Runner"). // Iterator inE("rated").has("rating", gte(9)). // Iterator subgraph("fans_blade_runner"). // Save the subgraph connected by those edges outV(). // Iterator cap("fans_blade_runner"). // Iterator next(). // TinkerGraph traversal(). // DefaultTraversal V(). // Iterator count() // count number of vertex ==>46

Please note the interesting sequence here

Similar to group() subgraph() needs a termination step and in this case cap("fans_blade_runner")is mandatory
cap("fans_blade_runner") yields a Iterator
next() yields a TinkerGraph
traversal() yields a DefaultTraversal

We found 46 vertices, which correspond to 1 Blade RunnerMovie vertex + 45 fans User vertices.

There is no much usage of subgraph() as in-stream variable since it cannot be re-used later in the traversal itself.

G. select(variable_or_label)

There is no much to say about select() step. It can be used to pick side effect variables as well as saved labels in the path history. The only thing you should pay attention to is that any reducer step (sum(), mean(), …) will destroy path history and thus you loose all previously labeled steps.

H. cap(variable)

Not much to say, every thing has been said about cap() above.

III In-stream variable

In normal Gremlin pipeline, it is not unusual to create intermediate variables and re-use them later in subsequent traversal. Let’s take the recommendation engine traversal we wrote in some the previous post.

has("movie", "title", "Blade Runner"). // Iterator inE("rated").values("rating"). // Iterator mean().next() // Double ==>8.20353982300885 gremlin>def genres = g.V(). has("movie", "title", "Blade Runner"). // Iterator out("belongsTo"). // Iterator values("name"). // Iterator fold().next() // Collection ==>Sci-Fi ==>Action gremlin> g.V(). ......1> has("movie", "title", "Blade Runner").as("blade_runner"). ......2> inE("rated").filter(values("rating").is(gte(avg_rating))).outV(). ......3> outE("rated").filter(values("rating").is(gte(avg_rating))).inV(). ......4> where(neq("blade_runner")). ......5> filter(inE("rated").values("rating").mean().is(gte(avg_rating))). ......6> filter(out("belongsTo").has("name", within(genres))). ......7> dedup(). ......8> project("movie", "average_rating", "genres"). ......9> by("title"). .....10> by(inE("rated").values("rating").mean()). .....11> by(out("belongsTo").values("name").fold()) ==>{movie=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]} ==>{movie=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]} ==>{movie=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}

Fine, but what if we want to avoid defining intermediate variables like avg_rating and genres ? Can we have our recommendation traversal and define those variables inside the traversal itself to make it one-liner ? The answer is YES!!! That’s what’s I call in-stream variables e.g. variables created during previous traversal steps to be re-used later.

1st solution using aggregate()

gremlin>g.V(). has("movie", "title", "Blade Runner").as("blade_runner"). // Label blade_runner sideEffect(inE("rated").values("rating").mean().aggregate("avg_rating")). // Save the avg_rating in a sideEffect() step sideEffect(out("belongsTo").values("name").fold().aggregate("genres")). // Save the genres in a sideEffect() step inE("rated").filter(values("rating").where(gte("avg_rating")).by(unfold())).outV(). // Recall "avg_rating" using where() outE("rated").filter(values("rating").where(gte("avg_rating")).by(unfold())).inV(). // Recall "avg_rating" using where() where(neq("blade_runner")). filter(inE("rated").values("rating").mean().where(gte("avg_rating")).by(unfold())). // Recall "avg_rating" using where() filter(out("belongsTo").has("name", where(within("genres")).by(unfold()))). // Recall "genres" using where() dedup(). project("movie", "average_rating", "genres"). by("title"). by(inE("rated").values("rating").mean()). by(out("belongsTo").values("name").fold())

A little bit of explanation is required.

sideEffect(inE(" rated").values("rating").mean().aggregate("avg_rating")) : in this step we save the average rating of Blade Runner using aggregate("avr_rating"). It means that the type of “avg_rating” is now a BulkSet with a single element. The reducing barrier step mean() is performed inside the sideEffect() step leaving the outside path history untouched and preserving our “blade_runner” label.

Similarly all the genres of Blade Runner are saved in “genres” using sideEffect(out("belongsTo").values("name").fold().aggregate("genres")). The type of “genres” is a BulkSet with 2 elements: “Sci-Fi” & “Action”

Now let’s analyse inE("rated").filter(values("rating").where(gte("avg_rating")).by(unfold())).outV(): we’re using where(gte("avg_rating")) to perform variable resolution for “avg_rating”. However, what is the by(unfold()) ??? Indeed it is a projection modulator for the where() clause. The data type of “avg_rating” BulkSet so we cannot compare a Double coming from values("rating") with this. by(unfold()) will project the BulkSet using unfold() and then we obtain a Double as a result.

The same remarks apply to filter(out("belongsTo").has("name", where(within("genres")).by(unfold())))

A 2nd solution using labeled step as():

gremlin>g.V(). has("movie", "title", "Blade Runner").as("blade_runner"). // Label "blade_runner" map(inE("rated").values("rating").mean()).as("avg_rating"). // Label "avg_rating" step select("blade_runner"). // Jump back map(out("belongsTo").values("name").fold()).as("genres"). // Label "genres" step select("blade_runner"). // Jump back inE("rated").filter(values("rating").where(gte("avg_rating"))).outV(). outE("rated").filter(values("rating").where(gte("avg_rating"))).inV(). where(neq("blade_runner")). filter(map(inE("rated").values("rating").mean()).where(gte("avg_rating"))). filter(out("belongsTo").has("name", where(within("genres")))). dedup(). project("movie", "average_rating", "genres"). by("title"). by(inE("rated").values("rating").mean()). by(out("belongsTo").values("name").fold())

So to decorticate this giant traversal.

map(inE("rated").values("rating").mean()).as("avg_rating"): classic trick, wrap the reducing barrier mean() inside a map() to avoid loosing path history and assign the result a label. “avg_rating” is now an Iterator with a single element

Same strategy for map(out("belongsTo").values("name").fold()).as("genres"): the data type of “genres” is Iterator with 2 elements.

Please notice the repeated usage of select("blade_runner") to force Gremlin to backtrack to the original movie vertex to restart a new traversal from there.

inE("rated").filter(values("rating").where(gte("avg_rating"))).outV(): unlike previously when we were using aggregate() we don’t need any by() modulator here since the average rating is not nested inside any collection. where(gte("avg_rating")) will just pop the average rating value out from the Iterator, calling implicitly next() on it.

The rest of the traversal is quite clear. And that’s all folks! Do not miss the other Gremlin recipes in this series.Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

Discover more

Gremlin

JUMP TO SECTION

More Technology

View All

How to Create a Local LangChain Vector Database

Technology • December 20, 2024

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.

Learn More

Get Started for Free

Gremlin Recipes: 7 – Variables Handling

I KillrVideo dataset

II Variable handling operators