Gremlin Recipes: 8 – Sack() Operator
Part 8 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.
This blog post is the 8th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:
- Gremlin as a Stream
- SQL to Gremlin
- Recommendation Engine traversal
- Recursive traversals
- Path object
- Projection and selection
- Variable handling
I KillrVideo dataset
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.
The graph schema of this dataset is :
II Introduction to sack()
The sack()
step was introduced recently to help people solving some problem more efficiently.
In some case people are writing Gremlin traversals that make use of data aggregation using path information. Typically, people will use path()
and then perform “reduction” of the data in the path to get the specific result.
Unfortunately, this is inefficient as path calculations are expensive, not mergable. The same ca be now achieved using sack()
.
A sack is a local datastructure relative to each traverser, unlike global datastructure as when using aggregate()/store()/group(x)/subgraph(x)....
It mean that each traverser is now equipped with his own sack. A sack can contain any type of data structure:
- primitives: Int, Double, Long, …
- collections: List/Set/Map\
A. Defining the sack
The definition of this data structure should be placed right after the traversal object:
gremlin>g.withSack(1.0). // Define a sack containing an initial double value
V().
...
gremlin>g.withSack([] as Set). // Define a sack containing an initial empty Set
V().
...
gremlin>g.withSack([1,2,3] as Set). // Define a sack containing a pre-initialized Set
V().
...
gremlin>g.withSack([] as List). // Define a sack containing an initial empty List
V().
...
gremlin>g.withSack([:]). // Define a sack containing an initial empty Map
V().
...
gremlin>g.withSack([:]. // Define a sack containing an initial empty Map
withDefault{key -> [] as List}). // with default values being an empty list
V().
...
The complete signature of the withSack()
step is withSack(default_data_structure, split_operator, merge_operator)
- split operator is used to split the data structure whenever traverser is facing a branch and needs to duplicate itself. Example:
V().has("user","userId", "u861").out("rated")
. From the User vertex now there are as many traversers as there are ratings to different movie. Thus the sack structure at User vertex get duplicated - merge operator is used to merge different sack data structures together whenever all traversers are converging to a single point. Example:
g.V().hasLabel("movie").out("director").has("name", "Woody Allen")
. The traversers from many Movie vertices are converging to the single Person vertex which is Woody Allen. The sack at vertex Woody Allen is the merge of all sack data structures coming from his movies. If you do not define any merge operator, the sacks cannot be merged
Note: for primitive data type, split operator == value copy. For other data types like collection, if you don’t define any split operator, the same object reference will be copied to all branching traversers. Consequently all those traversers will shared the same data structure. Be very wary of this
Some example of defining collection as data structure for sack:
// Define a sack with sum() as merge operator
// The split operator is implicitly copy-by-value for primitives
gremlin>g.withSack(1.0, sum).
V().
...
// Define a sack with initial empty Set
// Split function == Set::clone()
// Merge function == Set::appendAll()
g.withSack([] as Set){set-> set.clone()}{setA,setB -> setA.appendAll(setB); setA}.
V()
...
// Define a sack with initial empty Map
// Split function == Map::clone()
// Merge function == Map::putAll() gremlin>g.withSack([:]){it.clone}{mapA,mapB -> mapA.putAll(mapB); mapA}.
V().
...
B. Mutating the sack
To mutate the content of the sack, you can use the step sack(bi-function).by(projection_axis)
.
- bi-function is a function with signature:
DATA_STRUCTURE apply(DATA_STRUCTURE current, X projected_value)
- DATA_STRUCTURE is the type of the data structure inside the sack (primitives, collections, …) initially defined with
g.withSack(DATA_STRUCTURE)
- X is the type of the projected value on the current vertex/edge using the modulator
by()
- DATA_STRUCTURE is the type of the data structure inside the sack (primitives, collections, …) initially defined with
- projection_axis is the property/inner traversal result applied on a vertex/edge
C. Accessing the sack
The sack data structure can be accessed at any time by calling the step sack()
. The data type returned by this step is Iterator
III Usage of sack()
gremlin>g.withSack(0).V().
has("movie", "title", "Blade Runner"). // Iterator
==>6
==>5
==>8
==>8
==>8
==>9
...
In the above traversal the interesting step is inE("rated").sack(sum).by("rating").outV()
. On each edge “rated”, we add the value “rating” into the sack value. Please note the usage of by("rating")
acting as projection for the sack()
. Sum is the bi-function Int apply(Int current, Int projected_value)
When the traversal reaches step outV()
there are as many traversers as there are users rating Blade Runner and each of those traversers carries a sack containing the value of the rating. When calling sack()
the content of each sack is extracted and therefor we have an Iterator
Computing Blade Runner average rating with sack()
is really easy:
gremlin>g.withSack(0).V().
has("movie", "title", "Blade Runner"). // Iterator
Now let’s see how we can use sack()
for useful purposes. First you need to import the schema of London Tube as described in this previous post.
Then switch g alias to London_Tube
gremlin>:remote config alias g London_Tube.g
==>g=London_Tube.g
One of the goal in this blog post was to find the shortest path between South Kensington and Marble Arch tube stations. Shortest path can mean either least stations count or least line changes.
Shortest path by least stations count
gremlin>g.withSack(0).V().
has("station", "name", "South Kensington"). // Iterator
ASCENDING
==>[South Kensington, -->, Knightsbridge, -->, Hyde Park Corner, -->, Green Park, -->, Bond Street, -->, Marble Arch]
The interesting pieces are:
sack(sum).by(constant(1))
: for each traversed station, we increment the counter by +1order().by(sack(),incr)
: order each journey by the counter in each sack, ASCENDING order
Shortest path by least line changes
gremlin>g.withSack([] as Set){it.clone()}{a,b -> a.addAll(b); a}.
V().
has("station", "name", "South Kensington"). // Iterator
==>[South Kensington, -->, Gloucester Road, -->, High Street Kensington, -->, Notting Hill Gate, -->, Queensway, -->, Lancaster Gate, -->, Marble Arch]
Let’s dig into the traversal:
withSack([] as Set)
: defines a sack with an empty Set{it.clone()}
: defines the split operator for the set, which is just cloning the set itself{a,b -> a.addAll(b); a}
: defines the merge operator for the set, which is just appending all contents of the sets together- Please notice the Groovy closure syntax to pass lambdas as arguments to the
withSack()
method. More details about this syntax is given here sack{set,val -> set.add(val); set}.by("line")
: on each traversed “connectedTo” edge, we add the “line” property to our set inside the sackorder().by(sack().count(local),incr)
: we order the journeys by the number of lines inside each sack. Please notice the usage ofsack().count(local)
. Since the sack() type is now anIterator>
we need the local keyword to count the number of element in the nested set
And that’s all folks! Do not miss the other Gremlin recipes in this series.
If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai