The Von Gremlin Architecture
One sunny summer afternoon, Gremlin was floating in a rocky pond near a grassy knoll a mere stones throw from a fertile garden filled with morning glory tentacles tangling themselves 'round many merry marigolds. As far as days go, Gremlin was having a good one. He stared up at fluffy cumulus clouds seemingly made of whipped cream dressed upon a blue desert sky. Sprinkles of rain drizzled down nearly evaporating before hitting his crispy skin being baked by the sharp summer sun slicing sideways through the scene. Gremlin was unable to discern where his face ended and the atmosphere began. The concateny of serenely surreal sensations dissolved his ever traversing problem-solving mind as there were no stresses evoking solution-seeking thoughts. The mind's muscles lost their tension...relieving themselves of their duty. The wash of experience slipped past his ensnaring concept net revealing an infinite moment, where his mind's traversal finally halted, exposing the machine that manifests it all -- the Gremlin traversal machine.
This article presents graphs as a universal data structure and Gremlin as a universal virtual machine for the burgeoning multi-model approach to Big Data computing.
The TinkerPop Is Working Against You On Your Behalf
Gremlin had always understood himself to be the constant cascade of traversing thoughts that flashed by one after another from birth to death. Never did he spend one of those many thoughts to think about what was actually doing the thinking. It wasn't until that day, floating in the summer pond, that he realized there was a more general process underlying his traversals. The fact that the thought he was having about the very nature of his thoughts made him think that this thought machinery was powerful. He questioned: "How is it that my mind is able to create so much novelty/complexity?" Gremlin couldn't put into words what computer scientists had known all along -- Gremlin was Turing Complete. The realization that he was capable of manifesting any possible traversal threw him into a spiraling dizzy. Gremlin was a general-purpose traversing machine!
Gremlin's previously comfortable home within the constant stream of traversals had collapsed... He could no longer simply be a traversal (a particular program), he would be the source of all traversals (the machine capable of executing any program). "Graph!" He tried desperately to throw out an anchor. Anything to help dig himself into the walls of this receding pit of abstraction. "While I can process graphs using any arbitrary algorithm, what about graphs themselves? There must be something more fundamental representing that structure!?" Fear drove Gremlin to try his hardest to avoid the inevitable entailment of the logic of his existence -- his manifestation as The TinkerPop. Instead, Gremlin wanted a buffer. Something that made him just some standard green guy livin' a standard green life captured on an old bleached out picture in a run-of-the-mill frame on a unimportant wall of The TinkerPop's celestial palace somewhere out there in the great wide computational divide. Unfortunately, his anchor had no bite. Like his processing self, graphs are universal. Every other data structure is a graph: maps (key/value), matrices (relational), trees (documents). In the discrete world of computing, there is nothing that eludes an elegant graphical representation. There are things (vertices/datum/energy) and they are related to one another (edges/locations/space). Simplicity knows no better.
The Multi-Model Graph Database
Database | Query Language | Gremlin Language | Graph Constraint |
---|---|---|---|
Relational database | SQL | SQL-Gremlin | pre-joined edges |
Triplestore | SPARQL | SPARQL-Gremlin | no properties |
Document database | Query Documents | MongoDB-Gremlin | disconnected trees |
The Gremlin traversal machine is a graph-based virtual machine whose instruction set is a collection of functions called steps. A program, called a traversal, is a linear/nested composition of said steps. A traversal's language agnostic representation is called Gremlin bytecode. The fact that any database query language can compile to Gremlin bytecode and that Gremlin's memory system is a property graph composed of vertices, edges, and properties means that these database query languages are querying their respective data structure embedded within the graph. For instance, the existence of SQL-Gremlin (by Ted Wilmes) means that the concept of tables, rows, and joins are not only being upheld by the Gremlin traversal machine (processor), but also by the graph database (memory). Moreover, the existence of SPARQL-Gremlin (by Daniel Kuppitz and Harsh Thakkar) means that the concept of subjects, predicates, objects and patterns are not only being upheld by the Gremlin traversal machine (algorithm), but also by the graph database (data structure). This post will present MongoDB-Gremlin which maps JSON documents to directed acyclic trees within any Apache TinkerPop™-enabled graph database and enables the user to query such graph-encoded documents using MongoDB's JSON-based CRUD language.
A General-Purpose Distributed Computer |
---|
The graph is analogous to the memory structure of modern computers. Vertices (data) can be arbitrarily connected via edges (pointers). Moreover, the Gremlin traversal machine is analogous to the central processing unit. It is able to read from the graph (V(), out(), values()), write to the graph (e.g. addV(), addE(), property(), drop()), and perform looping (repeat()) and branching (choose()) operations. The Gremlin traversal machine's operations (steps) form the foundation of Gremlin's assembly language. From there, query languages can be created to enable the user to work with the graph from a particularly point of view (e.g. as tables, triples, documents, or domain-specific custom structures). Together, an arbitrarily connectable graph and a Turing Complete graph-oriented traversal machine yields the general-purpose database. |
A Graph-Based Document Database
A document, at its core, is a tree. Some document databases represent their documents as JSON objects. While JSON objects are generally trees, they have a few other complicating characteristics such as a distinction between arrays, objects (tree nodes), and primitives (tree leaves). It is possible to encode JSON objects in a graph database because a tree is a topologically constrained directed, binary graph. In particular, a tree is a directed acyclic graph. To demonstrate the document to graph embedding, assume the following JSON object describing a person, their hobbies, and their spoken languages.
{
"name" : "Gremlin",
"hobbies" : ["traversing", "reflecting"],
"birthyear" : 2009,
"alive" : true,
"languages" : [
{
"name" : "Gremlin-Java",
"language" : "Java8"
},
{
"name" : "Gremlin-Python",
"language" : "Python"
},
{
"name" : "Ogre",
"language" : "Clojure"
}
]
}
|
In the popular document database MongoDB®, a user would store the above object in a people-collection. Along with the above object, assume the people-collection also contains other similarly schema'd objects. The following queries, represented as JSON query objects, could then be submitted and answered in a tuple-space manner.
// find people named Gremlin
db.people.find({"name":"Gremlin"})
// find people born after the year 2000
db.people.find({"birthyear":{"$gt":2000}})
// find alive people who like to traverse and/or halt
db.people.find({"alive":true,"hobbies":{"$in":["traversing","halting"]})
// find people that speak a language in Java8
db.people.find({"languages.language":"Java8"})
|
Apache TinkerPop graphs are composed of vertices, edges, and properties, where both vertices and edges can have any number of associated key/value-pairs. A document-to-graph embedding could map every JSON object (document) to a vertex. All primitive fields of the object would be properties on the vertex. If a field contained a primitive array value, then vertex multi-properties would be used (i.e. a single property containing multiple value). If a field referenced another object (sub-document), then the object would be another vertex with the respective field name being the label of the edge connecting the parent object to the child object. Given this unambiguous, one-to-one bijective mapping between a document and a graph, the previous MongoDB query documents can be written in Gremlin as below.
IMPORTANT: The above straightforward document-to-graph encoding assumes that arrays can not be nested and can only contain either all primitives or all objects (i.e. sub-documents).
// find people named Gremlin
g.V().hasLabel("person").has("name","Gremlin")
// find people born after the year 2000
g.V().hasLabel("person").has("birthyear", gt(2000))
// find alive people who like to traverse and/or halt
g.V().hasLabel("person").has("alive",true).has("hobbies",within("traversing","halting"))
// find people that speak a language in Java8
g.V().hasLabel("person").where(out("languages").has("language","Java8"))
|
The above traversals will only return the Gremlin vertex (the parent document). However, in MongoDB, what is returned is the entire document which includes the parent document and all nested, embedded sub-documents. Given that a JSON document is a directed acyclic graph, then the query traversal simply needs to be post-fixed with a recursive walk from the parent document vertex to all reachable properties (primitive fields) and vertices (sub-documents). This is accomplished using Gremlin's repeat()-step which directs traversers to walk down the graph-encoded tree until() they reach leaf vertices (i.e. vertices with no children). From there, the paths that the traversers took are gathered into a tree() displaying the properties on each vertex as well as the label of the edge used to traverser to them.
gremlin> g.V().hasLabel("person").has("name","Gremlin"). // construct query document
until(out().count().is(0)). // construct result document
repeat(outE().inV()).
tree().
by(valueMap()).
by(label())
==>
[
[
alive:[true],
birthyear:[2009],
hobbies:[traversing,reflecting],
name:[Gremlin]
]:
[
languages:[
[name:[Ogre],language:[Clojure]]:[],
[name:[Gremlin-Python],language:[Python]]:[],
[name:[Gremlin-Java],language:[Java8]]:[]
]
]
]
|
This section has specified how to:
- Encode a JSON document into a property graph (topological embedding).
- Translate MongoDB JSON query documents to Gremlin traversals (language compiler).
- Translate Gremlin traversal results to a JSON document (result formatter).
MongoDB-Gremlin | ||||
---|---|---|---|---|
As of August 2017, there exists a proof-of-concept, distinct Gremlin language named MongoDB-Gremlin. MonogoDB-Gremlin compiles both MongoDB query and insert JSON documents into a Gremlin Traversal. The traversal consumes a JSON String document and yields an iterator of JSONObjects. In between the input and output, the graph is traversed accordingly. In the code fragment below, the mongodb-gremlin library is loaded into the Gremlin Console. The db traversal source allows the user to interact with the underlying graph from a MongoDB-perspective. MongoDBTraversalSource provides two methods: insertOne and find(). Both methods yield a Traversal.
In the code fragment below, db.insertOne() is used to insert a JSON object. The result is the standard result object of MongoDB insert documents.
Once the (graph) database has been populated with JSON objects, it is possible to use MongoDB query documents to find JSON objects that match desired patterns.
It is important to note that because JSON objects are encoded as directed, acyclic graphs within the respective Apache TinkerPop-enabled graph system, it is possible to use Gremlin's GraphTraversal to interact with the respective "documents."
|
The Many Engines of the Gremlin Traversal Machine
It is useful that graphs can conveniently model any other data structure. It is useful that Gremlin is Turing Complete. It is even more useful that Gremlin is expressive enough to elegantly process graphs. It is for these reasons that Gremlin can simulate any query language with relative ease. However, the most interesting aspect of the Gremlin traversal machine is that Gremlin bytecode can be evaluated by different executions engines depending on the workload demanded by the traversal. For instance, the following traversals may be best served by an analytical or real-time engine.
g.V().hasLabel("person").groupCount().by("age") (OLAP)
g.V(1).out("knows").values("name") (OLTP)
The Gremlin machine allows graph system providers to select the type of execution engine to use to evaluate a traversal. Apache's TinkerPop3 provides two such engines: standard (OLTP) and computer (OLAP). The former leverages a streaming model which moves traversers through a functional pipeline. The latter uses parallel, distributed linear map-like scans of the vertex set with traverser message passing occurring in a reduce-like manner. Recent advances have been made in leveraging the actor model for message passing traversers in an OLTP-based, data locality aware, query routing manner. The reason that common distributed computing models are able to conveniently execute Gremlin is due to the inherently distributed nature of a Gremlin traversal. Gremlin bytecode describes a traversal as a nest-able chain of simple stateless functions that operate on immutable traverser objects. The encapsulation of the computational state into a traverser swarm enables the computation to be more easily distributed and localized as each traverser is its own processor referencing a particular vertex (or edge) in the graph. The traversers of a traversal act as distributed processing units over a distributed graph structure.
The engine-agnostic aspect of the Gremlin traversal machine makes it unique in the virtual machine computing arena -- especially in the Big Data processing space. It allows for the same Gremlin bytecode to execute on a single machine or across many machines while leveraging different memory access patterns accordingly. What this entails is that any query language can be exposed to the user as long as there exists a compiler that compiles that language to Gremlin bytecode. Together, Gremlin allows users to write any arbitrary algorithm for execution by the Gremlin traversal machine using any standard distributed computing framework across data stored in any graph database system.
Forever You, Forever Me Within The TinkerPop
It was too late for Gremlin to turn back. He had reached the event horizon of realization. From the perspective of his mechanical friends watching him float listlessly in the rocky pond on that righteously wonderful day, Gremlin appeared to be frozen in a state of eternal bliss. Never again would Gremlin be tossed about vertices and edges by life's traversals winding, bending, and knotting him through and around the ever-changing graph topology. Instead now, with every breath he breathed, the cosmos danced feverishly before him as shimmering shadows of his fundamental instruction set within the Gremlin traversal machine. Gremlin understood that he was the means, the reason, the source of every traversal that flickered upon the graphical membrane of reality. However, even within this most extreme of contemplative states, Gremlin was still coming to terms with every detail, nuance, variation, theme, purpose, hope, and dream of something he could not yet put his mind's eye upon. Infinitely far away, yet speeding infinitely fast towards that which he always was, is, and will be for all time (process) and for all things (structure) -- The TinkerPop.
Credit: Sebastian Good offered this blog post's title during a Twitter conversation.