TechnologyOctober 25, 2017

Gremlin DSLs in Python with DSE Graph

Stephen Mallette
Stephen Mallette
Gremlin DSLs in Python with DSE Graph

Gremlin-based DSLs provide a powerful way to develop graph applications and to perform graph analysis. This post describes a series of guidelines to consider when developing DSLs and uses Gremlin Python and the KillrVideo dataset to demonstrate these techniques.

While Frames seems to hold a relatively enigmatic existence in the land of Apache TinkerPop™, his relationship to Gremlin is no less important than Gremlin's relationship to his other machine friends. The role of Frames is one of context. Frames provides Gremlin with the ability to interpret the graph in the context of the domain of that graph. Frames is like a viewer that helps Gremlin see vertices and edges as the real-world objects they represent and to therefore be fluent in the language of that space. Frames represents the mediation between the graph and its domain, which can be commonly managed through the implementation of Gremlin-based Domain Specific Languages (DSLs).

Gremlin is a DSL for querying a graph and developers can choose to extend the Gremlin graph language with new custom steps to better fit the language of their domain, thus producing their own higher level DSLs that abstract away the language of the vertices and edges. The benefits of Gremlin-based DSLs with implementation patterns provided in Java, were detailed in a previous blog post entitled, "Gremlin DSLs in Java with DSE Graph". In this current blog post, the focus is on understanding the design choices involved in writing the DSL itself using DSL implementation patterns in the Python Gremlin Language Variant (GLV).

While the examples presented here are in Python, the concepts are not restricted to Python and are generally applicable across all programming languages that Gremlin is available in. The source code repository for this DSL blog post series also contains a the Java version of the Python code presented here.

Gremlin Python VS. Frames

"Giga Gremlin Python vs. Mega Frames"

As with the earlier blog post, this one will use the KillrVideo dataset from the DataStax Academy 330 course. As a first step to designing a DSL, it is important that the developer understand their schema, as it provides the encoding of their domain into a property graph of vertices and edges. The KillrVideo schema is fairly succinct and can be examined most easily through DSE Studio. The schema viewer provides a clear, visual representation of how the KillrVideo graph is modeled. Please spend a few moments getting familiar with the schema before proceeding further.

Anatomy of the Traversal API

Irrespective of the host programming language (e.g Java, Python, etc), Gremlin is defined by the Graph Traversal API, which specifies the steps that make up the language. The Graph Traversal API essentially consists of four main components, which experienced Gremlin users constantly utilize, but might not be directly aware of. These four components represent the building blocks for the higher-level languages that can be built on top of Gremlin. To identify these components, consider the following diagram:

  1. GraphTraversalSource - Even those just starting with Gremlin will immediately recognize the "g" variable as something significant to the language. The "g" represents a GraphTraversalSource instance which holds important configurations for starting traversals. As the entry point to a traversal, it also contains the start steps that will begin Gremlin's walk through the graph. The most common entry point for Gremlin is of course the V() start step, which signifies that the traversal will begin with the vertices of the graph.
  2. GraphTraversal - In the example above, the start step of g.V() returns a GraphTraversal instance. graphtraversal methods are the Gremlin steps that make up the language. The steps themselves return a GraphTraversal so that the steps may be chained together in a fluent fashion. Therefore the has() step will return GraphTraversal, such that outE() can be called which itself will return GraphTraversal and so on.
  3. Anonymous Traversal - A traversal that is "anonymous" is one that is not bound to a GraphTraversalSource instance directly and is typically used as a sub-traversal within a parent traversal. As a convention, anonymous traversals use the double underscore, as in __, for their class name (as different languages will allow) and are usually exposed as standalone functions (e.g. statically importing in Java) to improve Gremlin readability.
  4. Expression - Outside of steps, Gremlin has a number of "expressions" that are used in conjunction with the aforementioned components. These "expressions" are not specific classes or interfaces as the previously described components are, but instead refer to string tokens, enums or shortcuts to Predicate values. Like anonymous traversals, they are typically exposed in way that does not require the class or enum name to be referenced.

Together, these components of the Traversal API become the foundational elements of any Gremlin-based DSL. Designing the DSL within this framework will ensure that it maintains the form of Gremlin itself.

Start Steps

Starting Gremlin Frames

Thinking about the entry points to a DSL is good start for the design process. Consider how users will begin their traversals. The ever-present g.V() provides a hint as to where to start looking for start steps, as most traversals will start with vertices. The g.V() is typically followed by one or more has() steps (or perhaps the V() takes a direct vertex identifier) which serves to filter those vertices to a single vertex or an otherwise smaller set of vertices. In many cases, these initial lookup steps will map quite closely to the vertex labels in the schema and the challenge will lie in determining the most flexible and robust way for users of the DSL to get their initial list of vertices to traverse from.

Recall that entry points to the DSL are really just the start steps (i.e. methods) on a GraphTraversalSource. Therefore, when considering the KillrVideo schema and the four available vertex labels, some obvious start steps might be:

killr.movies() # equivalent to g.V().hasLabel('movie') killr.users() # equivalent to g.V().hasLabel('user') killr.persons() # equivalent to g.V().hasLabel('person') killr.genres() # equivalent to g.V().hasLabel('genre')

Now consider the usability of these steps. Like g.V(), they should take arguments to further filter their values, preferably binding each lookup to an index. For purposes of this imaginary use case, the assumption will be that users will typically start traversals from users and movies, therefore the start steps will look like this:

killr.movies('Young Guns') # equivalent to g.V().has('movie', 'title', 'Young Guns') killr.users('u460') # equivalent to g.V().has('user', 'userId', 'u460')

The flexibility of these two starts steps might be improved even further if they allowed for a multiple parameters so that the DSL users can lookup several movies or users as part of the start step:

killr.movies('Young Guns', 'Hot Shots!') # equivalent to g.V().has('movie', 'title', within('Young Guns', 'Hot Shots!')) killr.users('u460', 'u461', 'u462') # equivalent to g.V().has('user', 'userId', within('u460', 'u461', 'u462'))

These two start steps can be implemented in Python by creating a KillrVideoTraversalSource class that extends GraphTraversalSource as follows:

class KillrVideoTraversalSource(GraphTraversalSource):

   def __init__(self, *args, **kwargs):
      super(KillrVideoTraversalSource,
      self).__init__(*args, **kwargs) self.graph_traversal = KillrVideoTraversal # tells the "source" the type of Traversal to spawn

   def movies(self, *args):
      traversal = self.get_graph_traversal().V().hasLabel("movie")

      if len(args) == 1: traversal = traversal.has("title", args[0])
      elif len(args) > 1: traversal = traversal.has("title", P.within(args))

      return traversal

   def users(self, *args): traversal = self.get_graph_traversal().V().hasLabel("user")

   if len(args) == 1:
      traversal = traversal.has("userId", args[0])

   elif len(args) > 1:
      traversal = traversal.has("userId", P.within(args))

   return traversal

Now that there are methods to spawn a DSL version of GraphTraversal (discussed in the next section), the next task will be to consider the DSL steps that it will support.

Traversal Steps

Gremlin Impossible Steps

Traversal steps represent the continuation of the traversal from the initial start step. Each step transforms the data provided to it from the previous step. As with start steps, there are no specific rules as to how traversal steps should be designed, but three guidelines to consider would be:

  • Think small and atomic - Design steps that simplify and optimize the writing of obvious traversal paths.
  • Hide complexity - Design steps that hide complexity of many Gremlin steps behind a single DSL step and validate input to that step to prevent misuse.
  • Maximize flexibility - Design steps that make heavy use of parameters to suit a variety of different execution options.

All of the above guidelines are meant to be taken in concert with one another, but for purposes of discussion they will each be explored individually in the following sub-sections.

Think Small and Atomic

It is often good to think of DSL steps as building blocks to other more complex steps. In fact, if there is already a decision to build a Gremlin-based DSL, then that thought process is already in place, as Gremlin's graph steps are being used as building blocks to the DSL's more complex steps. Within the DSL, some of the most basic building blocks reside in abstracting away the graph language itself and encoding the low-level structure of the graph into the domain language. To be more specific, look to the edge labels and consider the common traversal paths that a user would follow and turn those into steps.

Using KillrVideo as an example, recall that it's easy to find movie vertices with: g.movies(). A common traversal path might be to find the actors within a movie or to get a list of ratings for a movie.

killr.movies('Young Guns').out('actor') # equivalent to g.V().has('movie','title','Young Guns').out('actor') killr.movies('Young Guns').inE('rated') # equivalent to g.V().has('movie','title','Young Guns').inE('rated')

The out('actor') and inE('rated') graph language could be hidden by the DSL as follows:

killr.movies('Young Guns').actors() # equivalent to g.V().has('movie','title','Young Guns').out('actor') killr.movies('Young Guns').ratings() # equivalent to g.V().has('movie','title','Young Guns').inE('rated')

and would be implemented by extending GraphTraversal as follows:

class KillrVideoTraversal(GraphTraversal):
   def actors(self):
      return self.out("actor").hasLabel('person') # extra verification that a 'person' is obtained
    def ratings(self):
      return self.inE('rated')

In this case, hiding the graph language behind actors() and ratings() has provided several benefits:

  • The DSL user need only recall that movies have actors and ratings from an API perspective. They don't need to recall edge direction, edge labels or property names. Additional validation of the traversal path can be put in place to reduce the possibility of error as shown with the addition of hasLabel('person'), which would prevent a mistakenly added "actor" edge. Practically speaking, DSE Graph would prevent such a mistake at the time the bad "actor" edge was inserted, however the general concept remains sound. The DSL developer presumably has the knowledge of the graph schema and its data and would be aware of areas with open chance for inconsistency. Protecting against such problems by encoding that knowledge into the DSL itself is a smart tactic.
  • These base steps of the DSL represent foundational steps that can be re-used within the DSL itself.

A Note on Documentation

Documentation of any code library is important and the importance is no different for DSLs. Use standard code documentation patterns (e.g. pydoc, javadoc, etc.) to annotate steps and related expressions to explain the inputs and outputs expected. For example, the actors() step and the ratings() step both expect an incoming "movie" vertex for the step to be functional

class KillrVideoTraversal(GraphTraversal):
    """The KillrVideo Traversal class which exposes the available steps of the DSL."""

def actors(self):
    """Finds the actors in a movie by traversing from a "movie" to an "person" over the "actor" edge."""

        return self.out("actor").hasLabel('person') # extra verification that a 'person' is obtained

def ratings(self):
    """Finds the ratings in a movie by traversing from a "movie" to a "rated" edge."""

    return self.inE('rated')

Note that for brevity, documentation will not be shown in the remainder of this blog post, but is present in the full source code of the actual KillrVideo Python DSL project.

Hide Complexity

Once a complex traversal algorithm is perfected, it would certainly become a DSL step candidate. The KillrVideo data set contains all the data required to build a basic movie recommendation algorithm. The following code shows an example of such an algorithm utilizing the users() start step of the KillrVideo DSL:

killr.users('u460').
   outE('rated').has('rating', P.gt(7)).inV().
   aggregate("seen").
   local(outE('actor').sample(3).inV().fold()).
   unfold().in_('actor').where(P.without(["seen"])).
   groupCount().
   order(Scope.local).
      by(Column.values, Order.decr).
   limit(Scope.local, 5).
   select(Column.keys).
   unfold())

The DSL can be extended to encapsulate this complexity by extending it to allow for:

killr.users('u460').recommend()

which is implemented as follows:

class KillrVideoTraversal(GraphTraversal):
   def actors(self):
      return self.out("actor").hasLabel('person') # extra verification that a 'person' is obtained

def ratings(self):
   return self.inE('rated')

def recommend(self):
   return (self.outE('rated').has('rating', P.gt(7)).inV().
      aggregate("seen").
      local(outE('actor').sample(3).inV().fold()).
      unfold().in_('actor').where(P.without(["seen"])).
      groupCount(). order(Scope.local).
         by(Column.values, Order.decr).
      limit(Scope.local, 5).
      select(Column.keys).
      unfold())

Maximize Flexibility

The recommend() step was a nice addition to the KillrVideo DSL, but it doesn't leave the user any options to modify its execution. Leaving recommend() as it is, resigns this step to rigid and limited general usage. Adding a few parameters to the step, making it more flexible, changes this limitation quickly:

class KillrVideoTraversal(GraphTraversal):
   def actors(self):
      return self.out("actor").hasLabel('person') # extra verification that a 'person' is obtained

   def ratings(self):
      return self.inE('rated')

   def recommend(self, recommendations, minimum_rating):
   

      if minimum_rating < 0 or minimum_rating > 10:
         raise ValueError('minimum_rating must be a value between 0 and 10')
      if recommendations <= 0:
         raise ValueError('recommendations must be greater than zero')
      return (self.outE('rated').has('rating', P.gt(minimum_rating)).inV().
            aggregate("seen").
            local(outE('actor').sample(3).inV().fold()).
               unfold().in_('actor').where(P.without(["seen"])).
            groupCount().
            order(Scope.local).
            by(Column.values, Order.decr).
            limit(Scope.local, recommendations).
            select(Column.keys).
            unfold())

Beta Gremlin

With the addition of recommendations and minimum_rating parameters, the user of the DSL can now control the number of recommendations they get back as well as the minimum user rating required for the initial set of movies used as basis for the recommendation. There are however other ways to tweak the execution of the algorithm in meaningful and useful ways. Just prior to the groupCount() step, Gremlin has collected the list of movies that will be grouped and ranked. It is the ideal place for a filter that will allow the user to further constrain the results of the recommendation. Movies might be filtered in a variety of ways. There could be a direct filter based on the properties of the "movie" vertex itself, like, production date or country of origin. There might be a more indirect filter like "genre", where a traversal is required to determine that value. Users may also combine the previously mentioned options into something more complex. If the DSL is to be open to all of these options, the easiest way to allow for that is to add a Traversal argument to the recommend method:

class KillrVideoTraversal(GraphTraversal):
   def actors(self):
      return self.out("actor").hasLabel('person') # extra verification that a 'person' is obtained

   def ratings(self): return self.inE('rated')

   def recommend(self, recommendations, minimum_rating, include):

   if minimum_rating < 0 or minimum_rating > 10:
      raise ValueError('minimum_rating must be a value between 0 and 10')
   if recommendations <= 0:
      raise ValueError('recommendations must be greater than zero')

   return (self.outE('rated').has('rating', P.gt(minimum_rating)).inV().
         aggregate("seen").
         local(outE('actor').sample(3).inV().fold()).
         unfold().in_('actor').
         where(P.without(["seen"])).  
         where(include).
         groupCount().
         order(Scope.local).
            by(Column.values, Order.decr).
         limit(Scope.local, recommendations).
         select(Column.keys).
         unfold())

By adding the include argument, the user can introduce their own constraints to the algorithm without having to know much about how the algorithm itself works. They need only be aware of the fact that the include parameter is fed a "movie" vertex as its start:

killr.users('u460').recommend(5, 7, has('country','USA')) # recommend movies from 'USA'
 

killr.users('u460').recommend(5, 7, has('country','USA').has('duration',gt(90))) # recommend movies from 'USA' longer than 90 minutes

killr.users('u460').recommend(5, 7, out('belongsTo').has('name', 'Comedy')) # recommend movies that are comedies

It is worth considering "genre" vertices and the "belongsTo" edge as shown in the final traversal above. The recommend() takes the include parameter to filter movies that could possibly be recommended. As the graph language is creeping into the DSL with out('belongsTo').has('name', 'Comedy') and this would likely be considered a common traversal path, it makes sense to provide a step that wraps this up so that it is possible to do:

killr.users('u460').recommend(5, 7, genre('Comedy')) # recommend movies that are comedies

That basic step would look like this:

class KillrVideoTraversal(GraphTraversal):
# ... omitting other DSL methods to focus on genre()

 

def genre(self, *args):
   if len(args) < 1:
      raise ValueError('There must be at least one genre')    if len(args) == 1:
      return self.out('belongsTo').has('name', args[0])      elif len(args) > 1:
      return self.out('belongsTo').has('name', within(args))

While that code is the proper implementation of the genre() step, it does not make it possible to call that step as part of an anonymous traversal as is shown in the previous syntax. There is a second step to making that happen. Following Gremlin graph language patterns, all steps added to the KillrVideoTraversal should also be added to an extension of __ as "static" methods. Given the steps that have been added thus far to KillrVideoTraversal that class should look like this:

from gremlin_python.process.graph_traversal import __ as AnonymousTraversal
class __(AnonymousTraversal):
   graph_traversal = KillrVideoTraversal

   @classmethod
   def actors(cls):
      return cls.graph_traversal(None, None, Bytecode()).actors()
   @classmethod
   def ratings(cls):
      return cls.graph_traversal(None, None, Bytecode()).ratings()
   @classmethod
   def genre(cls, *args):
      return cls.graph_traversal(None, None, Bytecode()).genre(*args)
   @classmethod
   def recommend(cls, *args):
      return cls.graph_traversal(None, None, Bytecode
()).recommend(*args)

Gremlin Popcorn

It is perhaps up for debate as to whether or not every DSL step needs to be exposed as an anonymous traversal. With the Gremlin graph language, the steps are largely general purpose enough for that replication to make sense. If there is a step that is clearly not useful in an anonymous fashion, it might do users a favor to remove it from the possible list of steps they may employ. On the other hand, consistency of the Gremlin language patterns calls for all steps to be available on both the GraphTraversal and __ extensions. That consistency may create the expectation for them to be present in the DSL in the same way.

Expressions

Recall that "expressions" are String tokens, enums or other functions that integrate into the DSL to make it easier to read and generally more convenient. The most basic expression that should be made available in a project (even without DSLs) are global variables that contain the vertex labels, the edge labels, and the property keys. For Python, a simple file like this will suffice:

VERTEX_MOVIE = 'movie'
VERTEX_PERSON = 'person'
VERTEX_USER = 'user'
VERTEX_GENRE = 'genre'

EDGE_ACTOR = 'actor'
EDGE_RATED = 'rated'
EDGE_BELONGS_TO = "belongsTo"
EDGE_KNOWS = "knows"

KEY_AGE = 'age'
# ... omitted for brevity
KEY_YEAR = 'year'

These global variables (i.e. String tokens) can be used both within the DSL and within projects using the DSL and greatly help in code maintainability (i.e. changing a property key name only has to occur in one place as opposed to doing a global find and replace). The following code shows the DSL appropriately updated to use these tokens:

class KillrVideoTraversal(GraphTraversal):
   def actors(self):
      return self.out(EDGE_ACTOR).hasLabel(VERTEX_PERSON)

   def ratings(self):
      return self.inE(EDGE_RATED)

   def recommend(self, recommendations, minimum_rating, include):

      if minimum_rating < 0 or minimum_rating > 10:
         raise ValueError('minimum_rating must be a value between 0 and 10')
      if recommendations <= 0:
         raise ValueError('recommendations must be greater than zero')
      return (self.outE(EDGE_RATED).has(KEY_RATING, P.gt(minimum_rating)).inV().
           aggregate("seen").
           local(outE(EDGE_ACTOR).sample(3).inV().fold()).            unfold().in_(EDGE_ACTOR).
           where(P.without(["seen"])).
           where(include). groupCount().
           order(Scope.local).
           by(Column.values, Order.decr).
           limit(Scope.local, recommendations).
           select(Column.keys).
           unfold())

def genre(self, *args):
   if len(args) < 1:
      raise ValueError('There must be at least one genre')
   if len(args) == 1:
      return self.out(EDGE_BELONGS_TO).has(KEY_NAME, args[0])
   elif len(args) > 1:
      return self.out(EDGE_BELONGS_TO).has(KEY_NAME, within(args))

To further examine the value of expressions, think back to the state of the recommend() step. An important step to maximizing flexibility for this algorithm is to allow the user to have more control over the speed of the algorithm. Often times, graph traversals can trade-off some level of quality of results for an improvement in speed of retrieving those results. For the recommend() algorithm, much of that balance is handled by local(outE(EDGE_ACTOR).sample(3).inV().fold()) which uses the sample() step to control the number of actors from a rated movie that feed into the basis of the recommendation. Having that behavior hard-coded to the algorithm doesn't allow the user the choice as to how that sampling should occur. A further change to the recommend() function can rectify that:

class KillrVideoTraversal(GraphTraversal):

# ... omitted other DSL steps to focus on recommend() step

def recommend(self, recommendations, minimum_rating, include, recommender):

   if minimum_rating < 0 or minimum_rating > 10: raise ValueError('minimum_rating must be a value between 0 and 10')
   if recommendations <= 0:
      raise ValueError('recommendations must be greater than zero')

   return (self.outE(EDGE_RATED).has(KEY_RATING, P.gt(minimum_rating)).inV().
         aggregate("seen").
         local(recommender).
         unfold().in_(EDGE_ACTOR).
         where(P.without(["seen"])).
         where(include). groupCount().
         order(Scope.local).
            by(Column.values, Order.decr).
         limit(Scope.local, recommendations).
         select(Column.keys).
         unfold())

Gremlin 3DOf course, now the user needs to know how to write an appropriate recommender, which is largely core to the proper workings of the recommendation algorithm itself. With the include argument, it was reasonable to leave the writing of the filter open, as most users of Gremlin would find basic has() usage to be quite within their comfort zone. For the recommender argument, it will be best to close off the available possibilities so that users aren't frustrated with providing this input and they are prevented from subverting what would likely be the tested and accepted traversal algorithm, thus getting "bad" recommendations. If there are a known set of possible recommender options that work, provide them as some form of static enum. In this way, the user can choose the option that best fits their situation. In Python, it could be implemented as follows:

class Recommender(Enum):

   SMALL_SAMPLE = 1
   LARGE_SAMPLE = 2

   @property
   def traversal(self):

      switcher = {
         Recommender.SMALL_SAMPLE: outE(EDGE_ACTOR).sample(3).inV().fold(),
         Recommender.LARGE_SAMPLE: outE(EDGE_ACTOR).sample(10).inV().fold() }

      return switcher.get(self)

class KillrVideoTraversal(GraphTraversal):

# ... omitted other DSL steps to focus on recommend() step

def recommend(self, recommendations, minimum_rating, include, recommender):

   if minimum_rating < 0 or minimum_rating > 10:
      raise ValueError('minimum_rating must be a value between 0 and 10')
   if recommendations <= 0: raise ValueError('recommendations must be greater than zero')
   if not isinstance(recommender, Recommender):
      ​​​​​​ raise ValueError('recommender argument must be of type Recommender')
   return (self.outE(EDGE_RATED).has(KEY_RATING, P.gt(minimum_rating)).inV().
   ​​​​​​​   ​​​​​​​   ​​​​​​​   aggregate("seen").
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   local(recommender.traversal()).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   unfold().in_(EDGE_ACTOR).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   where(P.without(["seen"])).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   where(include). groupCount().
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   order(Scope.local).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   by(Column.values, Order.decr).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   limit(Scope.local, recommendations).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   select(Column.keys).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​   unfold())

Now the DSL can be used as follows:

killr.users('u460').recommend(5, 7, has('country','USA'), Recommender.SMALL_SAMPLE) # recommend movies from 'USA' with a small sample
killr.users('u460').recommend(5, 7, has('country','USA'), Recommender.LARGE_SAMPLE) # recommend movies from 'USA' with a large sample

Frames PopcornThe two options for Recommender enum are a bit contrived and are mostly for purpose of demonstrating how to introduce static traversal options to a DSL. Obviously those traversal could have been added as steps (like actors(), for example), but to do that would have polluted the available "steps" with something that is not really re-usable in any other context but the recommend() step. Pulling it aside to an enum or some similar class gives it clear purpose and can enforce type safety when a certain type of input for a specific step is required.

As a last adjustment to the recommend() step, providing some sensible defaults can greatly improve readability and usability:

class KillrVideoTraversal(GraphTraversal):

# ... omitted other DSL steps to focus on recommend() step

def recommend(self, recommendations, minimum_rating, include=AnonymousTraversal.__(),
recommender=Recommender.SMALL_SAMPLE):

​​​​​​​   if minimum_rating < 0 or minimum_rating > 10:
​​​​​​​   ​​​​​​​   raise ValueError('minimum_rating must be a value between 0 and 10')
​​​​​​​   if recommendations <= 0:
​​​​​​​   ​​​​​​​   raise ValueError('recommendations must be greater than zero')
​​​​​​​   if not isinstance(recommender, Recommender):
​​​​​​​   ​​​​​​​   raise ValueError('recommender argument must be of type Recommender')

​​​​​​​   return (self.outE(EDGE_RATED).has(KEY_RATING, P.gt(minimum_rating)).inV().
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  aggregate("seen"). ​​​​​
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  local(recommender.traversal()).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  unfold().in_(EDGE_ACTOR).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  where(P.without(["seen"])).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  where(include).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  groupCount().
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  order(Scope.local).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​     by(Column.values, Order.decr).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  limit(Scope.local, recommendations).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  select(Column.keys).
​​​​​​​   ​​​​​​​   ​​​​​​​   ​​​​​​​  unfold())

Now the step can be called in a variety of ways that will work nicely without the user needing full knowledge of the more advanced parameters:

killr.users('u460').recommend(5, 7) killr.users('u460').recommend(5, 7, has('country','USA')) killr.users('u460').recommend(5, 7, has('country','USA'), Recommender.LARGE_SAMPLE)

As a final note on expressions, reconsider the genre() step. It takes one or more String values that should map to the "name" property on a "genre" vertex. The KillrVideo dataset contains just over a dozen genres and should not change often. Turning genre into a first class expression of the DSL would be quite helpful to users, as it reduces the chance of mistyping one of the genre names, prevents misuse of the genre() step by enforcing a specific type as the argument, and improves overall code maintainability (for the same reasons as the global String tokens). For Python, usage of an enum makes sense:

from aenum import Enum

class Genre(Enum):
​​​​​​​   ACTION = "Action"
​​​​​​​   # ... omitted all genres in KillrVideo for brevity
​​​​​​​   THRILLER = "Thriller"
​​​​​​​   WAR = "War"
​​​​​​​   WESTERN = "Western"

class KillrVideoTraversal(GraphTraversal):

# ... omitting other DSL methods to focus on genre()

def genre(self, *args):
​​​​​​​   if len(args) < 1:
​​​​​​​   ​​​​​​​raise ValueError('There must be at least one genre')

​​​​​​​   if not all(isinstance(genre, Genre) for genre in args): ​​​​​​​   ​​​​​​​   raise ValueError('The arguments to genre() step must all be of type Genre')
​​​​​​​   if len(args) == 1: return self.out(EDGE_BELONGS_TO).has(KEY_NAME, args[0].value) ​​​​​​​   
​​​​​​​   elif len(args) > 1:
​​​​​​​   ​​​​​​​   genres = [genre.value for genre in args]
​​​​​​​   ​​​​​​​   return self.out(EDGE_BELONGS_TO).has(KEY_NAME, within(genres))

These changes allow for genre() to now be used as:

killr.users('u460').recommend(5, 7, genre(Genre.COMEDY, Genre.ANIMATION)) # recommend movies that are comedies or animations

Conclusion

From the outset of this blog post, the killr variable has been used to represent the KillrVideo DSL's GraphTraversalSource instance. Now that the DSL is written, it is possible to initialize that variable to make use of the DSL in connection DSE Graph over the DataStax Python Driver:

from killrvideo_dsl.dsl import KillrVideoTraversalSource​​​​​​​  from gremlin_python.structure.graph import Graph ​​​​​​
from dse.cluster import Cluster
from dse_graph import DSESessionRemoteGraphConnection

# connect to DSE Graph with the DataStax driver
c = Cluster()
session = c.connect()

# initialize the 'killr' with the DSL
​​​​​​​killr = Graph().traversal(KillrVideoTraversalSource).withRemote(DSESessionRemoteGraphConnection(session, "killrvideo"))

Killrvideo Github 1

The usefulness of the DSL is directly tied to how its steps are designed as they will largely determine how much graph language the user must rely on to accomplish their tasks. Finding that correct balance is dependent on the domain, but also on the expected skills of users. Users with greater Gremlin expertise will not mind falling back to the graph language to write their traversals, which means that the DSL can be developed with more finely grained steps and more open parameters. On the other hand, those with less Gremlin expertise will depend more heavily on the DSL to abstract the graph language away from their usage, which means that the DSL should focus on more coarsely grained steps with more restrictive parameters.

There really aren't specific rules for building DSLs. This blog post merely offers guidelines to consider when building one and it is up to the DSL designer to apply them as they see fit. Application of these guidelines in virtually any capacity should help lead to all the benefits that DSLs promise.

Discover more
GremlinPythonDSE Graph
Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.