Gremlin DSLs in Java with DSE Graph
When we think about our friend Gremlin of Apache TinkerPop™, we typically imagine him traversing a graph, bounding from vertex to edge to vertex, all the while aggregating, filtering, sacking, matching into complex recursions that ultimately provide the answer to the question we asked of him. Gremlin is quite adept at his job of graph traversing, but he is also quite adaptable to the domain of the graph he is traversing. In being adaptable, for one example, Gremlin can become "Dr. Gremlin" for a healthcare domain, thus comprehending a traversal like:
1 2 3 4 5 |
|
Underneath this healthcare syntax, Gremlin relies on his low-level knowledge of the graph and the various steps that allow his navigation of it, but to users of this language those complexities can be hidden. By now it should be clear that “Dr. Gremlin” simply promotes imagery for a well known concept: Domain Specific Languages (DSLs).
Several years ago a blog post was authored that described how one could develop DSLs in Gremlin. This older blog post applied to TinkerPop 2, which has long ago been eclipsed by the now widely adopted TinkerPop 3, and therefore has minimum relevance to building a DSL today. Today’s blog post seeks to make DSLs relevant to current times using TinkerPop 3 programming paradigms with an emphasis on their implementation with DSE Graph.
The Importance of DSLs
A good argument for the importance of DSLs in the development of graph applications is Gremlin itself. Gremlin is a graph traversal language, or in other words, a DSL for the graph domain. It "speaks" in the language of property graphs, capturing notions of vertices, edges and properties, and then constrains the actions applied to those domain objects (e.g. out()
, inE()
, has()
) to the form of a traversal. The benefit manifests as a far more succinct, robust, and manageable way to interact with a graph structure as compared with an attempt to do so with a general purpose language.
It is the job of the graph application developer to encode their application’s domain into the vertices and edges of a graph. In other words, they define some form of schema by which vertices and edges represent the various aspects of their domain. With knowledge of the schema it then becomes possible to write Gremlin, using graph language, to insert and extract domain data to and from that encoding.
As an example, consider the KillrVideo dataset from the DataStax Academy 330 course. KillrVideo defines a DSE Graph schema that encodes a “movie” domain into the graph. For example, a “movie” vertex and a “person” vertex each have a number of properties:
1 2 |
|
and there is a "actor" relationship between these two vertex types:
1 |
|
It is with this knowledge that the Gremlin graph domain language can be used to find all the actors who were in "Young Guns":
1 |
|
In the above statement, Gremlin is told to get the vertices, filter on the vertex label "movie" and a property called "title" and then for the vertices that are allowed by that filter, traverse on out edges labeled "actor" to the adjacent vertex. Neither the Gremlin code nor the description of what it is doing is especially daunting to follow, but it focuses heavily on graph language and the graph schema to interpret it. Someone who is familiar with both of these things wouldn't have much trouble expressing any traversal that they liked, but those who are less versed in these particulars would have a higher barrier for working with the graph. If the level of abstraction were changed so that those with this higher barrier could express their queries in language more familiar to them (in this case, the KillrVideo language), then their efforts to interact with the graph are simplified.
A KillrVideo DSL, a language for working with elements of the movie domain, could create this higher level of abstraction by allowing the same traversal as above to be written as follows:
1 |
|
The first thing to notice in the above traversal is that the language of the graph is now hidden. The traversal internally holds a "movie" vertex and travels over edges, but none of that is especially evident by just reading the code. It simply states: "get me a movie named 'Young Guns' and then find me the actors on that movie". The second thing to note is that the need to understand the schema and logic of the graph is reduced. Obviously a user of the KillrVideo DSL can’t be completely ignorant of what the graph contains, but the developer of that DSL who is more knowledgeable can design away pitfalls that less knowledgeable users would encounter. Some of those pitfalls are covered in the IDE with intelligent code completion, which would prevent mistypes of string-based property keys and edge labels, but it would also be possible to add validation logic into the DSL to ensure proper usage.
A typical use of validation logic would occur when parameters are supplied to a step. A simple check in the example above would be to ensure that the string passed to the movies()
step was not null or empty. Therefore, a traversal constructed as killr.movies(null).actors()
would immediately throw an exception. Parameterization of steps and the validation of those parameters go hand-in-hand when building a DSL. Complex traversal algorithms can be hidden behind a single step and made flexible by a body of parameters that can provide runtime tweaks to their execution. Parameters could trigger additional filters, limit returned results, define ranges and depths of execution, or expose any other algorithm feature that might be unknown at design time.
The DSL also creates a buffer that could protect against schema changes. The current schema design calls for a "movie" vertex to have an outgoing "actor" edge to a "person" to define the people who act in a movie. Should that data model change to one where the "actor" edge was promoted to a vertex to resolve the relationship between "movie" and "person", the logic for traversing this relationship is protected by the DSL and this revised traversal logic would only need to changed within its bounds. In other words, the same results would be derived from killr.movies("Young Guns").actors()
irrespective of the nature of the schema.
A final point to consider when it comes to the benefits of DSLs, is to realize that DSLs can lead to more focused testing. A DSL will typically establish a fair number of "small" reusable steps (not all will be a few lines of Gremlin, however), each of which is straightforward to independently unit test. We can then use these tested steps with confidence elsewhere in the DSL in higher-ordered steps. The tests of these lower-level DSL steps would help provide assurance that an application will behave well after undergoing schema change, without having to wait for errors in application level tests where the Gremlin may be more complex to debug.
Implementation
In TinkerPop 2, Gremlin was heavily driven by the Groovy programming language. Groovy supports metaprogramming which provided a natural fit for building DSLs. Gremlin was outfitted with some helpful utilities to build new DSL steps, which hid the specifics of the metaprogramming that was going on underneath. For TinkerPop 3.x, Gremlin is not bound to Groovy. It is instead supported natively in Java and is extended on the JVM in projects like Gremlin Groovy or Gremlin Scala and off the JVM in projects like Gremlin Python. Since metaprogramming is not an available feature of all of these languages, a new method for building DSLs needed to be devised.
Each Gremlin Language Variant has its own method of DSL development, but the recommended pattern for implementation is largely the same and is rooted in simple inheritance. Reviewing the basic class structure of the Traversal API, there are:
- GraphTraversal - The interface that defines the step methods for the graph traversal DSL (e.g
out()
,in()
,select()
, etc). - GraphTraversalSource - A class that spawns
GraphTraversal
instances (i.e the variable that is normally denoted by “g” that starts traversal as ing.V()
) - __ - A class that spawns anonymous
GraphTraversal
instances mostly used as inner traversals.
At the most simple level, creating a DSL involves extending upon these interfaces and classes. Programming languages, like Python, that are not extraordinarily restrictive on types allow DSLs to be built with limited effort. Java, on the other hand, makes things a bit more difficult. The following is a skeleton of a Java version of the KillrVideo DSL that directly extends GraphTraversalSource
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Note that it returns a KillrVideoTraversal
, so when the DSL is used and a call is made to:
1 |
|
the return value is a KillrVideoTraversal
. As a side note, do not be confused by the use of "killr" for the variable name in place of "g". The familiar "g" could still be used, but to clarify that the code is using the KillrVideo DSL instead of the graph DSL, the "killr" variable name is used. Consider, however, what happens when the traversal “drops down” from the KillrVideo DSL to the graph DSL:
1 |
|
The in()
does not return a KillVideoTraversal
- it returns a GraphTraversal
. Therefore, it does not become possible to get back to the KillrVideo DSL without casting, as shown below:
1 2 3 4 5 |
|
It isn’t necessarily hard to figure out what needs to be done to resolve this problem, but it is not one that is solved by simply extending a class.
To make DSL building a bit easier, TinkerPop has a GremlinDsl annotation which can help streamline the process of DSL building in Java. The GremlinDsl
annotation can be applied to a template interface that extends GraphTraversal
. The annotation marks the interface as one to be processed by the Java Annotation Processor, which will generate some boilerplate code at compilation, thus providing the KillrVideoTraversalSource
(and related classes/interfaces) that is passed to graph.traversal()
to begin using the DSL.
1 2 3 |
|
The KillrVideoTraversalSource
will have its own methods to start a traversal. For example, rather than starting a traversal with g.V()
it could be started with killr.movies()
. To allow that to happen the annotation must be updated:
1 2 3 |
|
Adding the traversalSource
parameter will specify the class to use to help generate KillrVideoTraversalSource
class. The KillrVideoTraversalSourceDsl
template class referenced above looks like this:
1 2 3 4 5 6 7 8 9 10 |
|
Both template classes, KillrVideoTraversalDsl
and KillrVideoTraversalSourceDsl
, will contain all the custom DSL methods that will drive the language. It is important to only use existing Gremlin steps (or other DSL steps conforming to that requirement) within these methods to build the traversal so that it remains compatible for remoting, serialization, traversal strategies and other aspects of the TinkerPop stack. It is also worth keeping in mind that the code within these DSL steps is meant for traversal construction. Attempting to include methods that do not meet the expected signature (e.g. adding a method that returns something other than a Traversal
) may lead to unexpected problems for the annotation processor.
To this point, we've discussed the workings of a movies()
step and an actors()
step. These steps can be added to this DSL scaffolding as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
The movies()
method demonstrates how a DSL can hide logic around a traversal’s construction. The traversal always filters on the label for movies but only adds filters by title if titles are present. These new steps can now be put to use in the following code, where the DSE Java Driver is used to connect to DSE Graph:
1 2 3 4 5 6 7 8 |
|
Java Annotation Processor at Work | ||
When building the KillrVideo DSL project, the Java Annotation Processor will detect the GremlinDsl annotation and generate the appropriate code. Assuming usage of Maven, the code will be generated to target/generated-sources/annotations and will produce four files:
The generated code is lengthy and too much to display here, however the following snippets may help with understanding how everything fits together:
|
The full KillrVideo DSL project includes additional steps and documentation and can be found in the DataStax graph-examples repository. This repository not only contains the Maven-based DSL project, but also includes instructions for loading the KillrVideo DSL data into DSE Graph using the DSE Graph Loader.
One of the interesting use cases presented there is for graph mutations. The project shows how to enable this syntax:
1 2 3 4 |
|
The four lines of code above perform a number of tasks:
- The mutation steps
movie()
andactor()
are meant to "get or create/update" the relevant graph elements. Therefore, when we usemovie()
the traversal first determines if the movie is present. If it is, it simply returns it and updates its properties with those specified. If the movie is not present, it adds the vertex first with the presented properties and returns it - With the
actor()
step the complexity is even greater because it must first detect if the "person" vertex is present, and if not add it. It then must also detect if that person already has an "actor" edge to the movie and if not, add it. - Both mutation steps contains validation or sensible defaults if values are not provided to enforce data integrity. As this code is bound to the steps, the logic is centralized which is convenient for testing and maintainability.
- The
ensure()
step is an alias to the standard sideEffect() step. As an alias it provides for a more readable language in the KillrVideo domain. By wrapping the mutation steps inensure()
, the mutations become side-effects so that the "movie" vertex passes through those steps, which allowsactor()
steps to be chained together.
The syntax of the DSL is highly readable with respect to its intentions and the steps demonstrate their flexibility and power to be re-used and chained. Compare the above example to the actual graph traversal that is being executed underneath:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
Conclusion
This blog post discussed the benefits of Gremlin DSLs and re-establishes their development patterns with Gremlin for TinkerPop 3.x. As we conclude, let’s return to the "Dr. Gremlin" example from the introduction:
1 2 3 4 5 |
|
While the graph language is hidden, we've already seen where it is quite possible to drop back into that language at any point along the way if desired. Assuming the interactions()
DSL step returned "interaction" vertices, it would be simple enough to filter those with a graph-based has()
step. The power of the DSL approach is that the essence of Gremlin still remains. Each step, whether from the graph language or the DSL, still serves to mutate the type of the object passed to it from the previous step and all standard Gremlin semantics remain intact.