TechnologyMarch 6, 2018

Gremlin DSLs In .NET With DSE Graph

Gremlin DSLs In .NET With DSE Graph

Gremlin-based Domain Specific Languages (DSLs) provide a powerful way to develop graph applications and to perform graph analysis. This post continues the "Gremlin DSL" series by discussing a pattern for forming Gremlin results and demonstrating that pattern's implementation with Gremlin .NET.

The techniques and patterns for developing Domain Specific Languages (DSLs) with Gremlin and the related benefits to be had from such an effort, have been previously explored in two separate blog posts covering two different Gremlin Language Variants (GLVs). The first blog post titled, "Gremlin DSLs in Java with DSE Graph" presented the foundations for building Gremlin-based DSLs, while the second blog post, titled "Gremlin DSLs in Python with DSE Graph", discussed the anatomy of DSLs and various guidelines helpful to their development. This series on Gremlin DSLs continues with this topic by presenting a pattern that developers should find generally useful. The mechanics of this pattern will be demonstrated using the C# GLV for .NET.

As a quick reminder, Gremlin is the graph traversal language of Apache TinkerPop™ and since DataStax Enterprise Graph (DSE Graph) is TinkerPop-enabled, Gremlin is the query language of DSE Graph. In one of the aforementioned and related blog posts, Gremlin is described as "a DSL for querying a graph [which] developers can choose to extend . . . with new custom steps to better fit the language of their domain, thus producing their own higher level DSLs that abstract away the language of the vertices and edges." In addition, Gremlin DSLs help to create easily maintainable and highly readable graph query code. As with previous posts in this series, this blog post will use the KillrVideo dataset from the DataStax Academy 330 course as an example.

If unfamiliar with the previously mentioned blog posts in this series, it is recommended that those posts be read prior to this one as the concepts and examples build upon one another. Also note that while the examples presented here are in C#, the concepts are not restricted to C# and are generally applicable across all programming languages that Gremlin is available in. The source code repository for this DSL blog post series also contains the Java and Python versions of the C# code presented here.

Enrichment

"Gremlin Riches" - based on: Generation Grundeinkommen

An Enrichment Pattern

Gremlin traversal results typically require "shaping" so that they fit some need of an application or some analytical view. In the most simple terms, shaping might mean limiting the property keys returned from a vertex or edge. For example, a common traversal such as:

g.V().Has("movie","title","Young Guns"). ValueMap(true)

would be much better written as:

g.V().Has("movie","title","Young Guns"). ValueMap(true, "title", "year")

Young Guns Laserdisc Coverwhere rather than return all the properties on the vertex, the Gremlin specifies the explicit properties to retrieve. If all the property values are required then all the keys should be included. The logic for taking the extra time to write all the properties is quite similar to the reasoning for typically avoiding SQL statements such as SELECT * FROM table, which return all columns in a table. In graph terms, imagine what might happen if it was one day decided to add a new multi-property to the "movie" vertex which had the potential to hold thousands of values. Suddenly, a traversal that didn't specify the required properties would instantly become much more expensive and impact any parts of an application that relied on it.

In more complex data shaping cases, common traversal activities might mean performing some form of collection manipulation, data aggregation, or other similar operation. To think about what a complex case might mean, continue with the previous example which returns a Dictionary that looks like this:

{ "year": [ 1988 ], "label": "movie", "id": { "member_id": 431, "community_id": 148705792, "~label": "movie" }, "title": [ "Young Guns" ] }

While this result has all the data that was requested, the format is a bit unweildy, as ValueMap() step wraps property values, such as "title" and "age", in lists given the expectation of potential multi-properties. This result would be much easier to consume if that format was simplified to remove the unnecessary list instances (i.e. unwrap the value from the list) and just have a simple key and value pair. Gremlin can manipulate that Dictionary into the desired format by doing:

g.V().Has("movie", "title", "Young Guns"). Map(ValueMap(true,"title","year"). Unfold(). Group(). By(Keys). By(Choose(Select(Column.Keys).Is(T.Id), Select(Column.Values), Select(Column.Values).Unfold())))

Gremlin New Sheriff in TownIn the above code, the vertex that makes it past the Has() filter will be locally transformed with ValueMap(). Each of those returned Dictionary instances will be unfolded to the local traversal stream as key/value pairs, then regrouped on those keys. The critical part to this transformation lies in the final By() modulator which checks if the key is a T.Id instance and if so, simply passes the value through unchanged. That step is necessary as the DSE Graph identifier is a complex object which should not undergo the same processing as the property values which are wrapped in a list. The alternative (i.e. the false side of the Choose() step) demonstrates how the value is unwrapped from the list. Basically, all property values and the T.Label values are unfolded to the local stream with the first item extracted from that stream to be used as the value in the final output. It yields the following result:

{ "year": 1988, "label": "movie", "id": { "member_id": 431, "community_id": 148705792, "~label": "movie" }, "title": "Young Guns" }

Without those extra single value lists, the result is much easier to consume. Of course, now that the required property keys are being specified up front, ValueMap() may not be the right step to use anymore, especially if flattening property values is desired. A better step to use in this case is Project(). All the transformation steps that flatten and reconstruct Dictionary objects can be removed in favor of:

g.V().Has("movie", "title", "Young Guns"). Project("title","year","id","label"). By("title"). By("year"). By(T.Id). By(T.Label))

The use of Project() drastically reduces the Gremlin complexity and eliminates a significant amount of Dictionary construction and destruction, while at the same time produces the exact output as before. Unfortunately, it's still a bit verbose with the redundant property key specifications in both the Project() step and the associated By() modulators. It is also fairly easy to see just how much larger such a statement would get if there were many more properties to be returned from the vertex. This complexity might further increase should a need arise where this output requires additional information appended to it. For example, perhaps the requirement is to append the degree distribution to that Dictionary as one of the key/value pairs. That traversal could be written as follows:

g.V().Has("movie", "title", "Young Guns").     Project("title","year","id","label","degree").
 By("title").
 By("year").
 By(T.Id).
 By(T.Label).
 By(BothE().Count())

The output for the above traversal is:

{
 "year": 1988,
 "degree": 56,
 "label": "movie",
 "id": {
   "member_id": 431,
   "community_id": 148705792,
   "~label": "movie"
},
 "title": "Young Guns"
}

The Project() step does a good job of managing the complexity of new fields in the outputted Dictionary, but "degree" makes for a generally useful traversal that would have to be written over and over again when needed. It would be nice to make all of this Gremlin much more reusable and concise, which is where DSLs can help.

The DSL Solution

Gremlin KillrVideo

In .NET, Gremlin DSLs are implemented via Extension Methods. Simply create extension methods as needed for the GraphTraversalSource and GraphTraversal. Note that for anonymous traversals, extension methods won't work because extension methods cannot be added to static classes and the __ class is defined that way. It is recommended to simply create a class prefixed with the double underscore (e.g. __KillrVideo) and then spawn traversals from that with static methods. More details about building DSLs in .NET can be found in the Apache TinkerPop Reference Documentation.

There are two basic goals in implementing this pattern as part of a new DSL step which will be called Enrich(). The first goal is to make it easier to specify the property keys to extract from a Vertex thereby avoiding some redundancy for the user. The second goal is to isolate and hide the nature of By() modulators with expressions to make Gremlin easier to read and more concise in terms of what the expectations are for the API. Consider the Recommender enum (see code here), which provided predetermined "sampling engines" that were designed to allow the user to introduce greater control into how the recommendation algorithm would work. The basic benefit was that the user didn't have to know how to write a "sampling engine" traversal and could simply focus on how to use the engines in the DSL. In addition, the recommendation step of the DSL only accepted a Recommender enum as an argument (see code here), which meant that the user knew exactly what was expected from the API and there were known working arguments that would provide a tested result. A similar, but slightly different approach will be described now to help support the new Enrich() step.

An enum doesn't work quite well for the Enrich() step, because functionally, the user needed to be able to potentially provide parameters to Enrichment objects on creation. This need is evident when considering how ValueMap() is called in the earlier examples (i.e. with keys). Therefore, instead of an enum, Enrichment is implemented as a class with some static methods (similar to Gremlin's P class):

public class Enrichment 
{
    private List<string> projectedKeys;
    private List<GraphTraversal<object,object>> traversals;

    private Enrichment(string projectedKeys, GraphTraversal<object,object> traversal) : 
        this(new List() { projectedKeys }, new List>() {traversal }) 
    {
    }

    private Enrichment(List<string>, List<GraphTraversal<object,object>>  traversals) 
    {
        this.projectedKeys = projectedKeys;
        this.traversals = traversals;
    }

    public List<GraphTraversal<object,object>> GetTraversals() 
    {
        return traversal;
    }

    public List<string> GetProjectedKeys() 
    {
        return projectedKeys;
    }

    public static Enrichment Degree() 
    { 
        return new Enrichment(KeyDegree, __.Map<object>(__.BothE().Count()));
    } 

    public static Enrichment Keys(params String[] propertyKeys) 
    { 
        var valueTraversals = keys.Select(k => __.Values<object>(k)).ToList();
            return new Enrichment(keys.ToList(), valueTraversals);
    } 
}

The Degree() and Keys() methods provide ways to construct predefined Enrichment objects that calculate the vertex degree and gather property values respectively. The results of these methods can be passed to this Enrich extension method on the KillrVideoGraphTraversalExtension class:

public static GraphTraversal> Enrich(this GraphTraversal t, bool includeIdLabel, params Enrichment[] enrichments) {
   // get the traversals associated with each enrichment and flatten them to the traversals that
   // will be passed to the by() modulators
   var projectTraversals = enrichments.Select(e => e.GetTraversals()).SelectMany(i => i).ToList();
   if (includeIdLabel)
   {
      projectTraversals.Add(__.Id());  
      projectTraversals.Add(__.Map(__.Label()));
   }
   // the keys should occur in the same order as the by() modulators - like the traversals above the list of
   // keys per enrichment are flattened so as to be passed to the project() step
   var keys = enrichments.Select(e => e.GetProjectedKeys()).SelectMany(i => i).ToList(); if (includeIdLabel)
   {
      keys.Add("id");
      keys.Add("label");
   }
   // the project() step has a signature that does not allow for an empty set of keys. the first key in the list
   // needs to be passed explicitly followed by a the rest of the list as varargs
   var projectedKeys = keys.GetRange(1, keys.Count() - 1).ToArray();
   var te = t.Project(keys.First(), projectedKeys);
   foreach (GraphTraversal projectTraversal in projectTraversals)
{

   te = te.By(projectTraversal);
}
   return te;
}

With this approach, an arbitrary number of Enrichment objects can be given to the Enrich() step yielding a flexible way for users to control and extend data returned from vertices. It also presents for a nice maintainable way to add new enrichment expressions. The full example code extends that class to include additional methods like InDegree, OutDegree and others. With the DSL, the last Gremlin example in this post that included the degree with the Project() output simplifies down to just the following:

killr.Movies("Young Guns"). Enrich(true, Keys("title", "year"), Degree())

Depending on the target user, another possible approach to the definition of the Enrich() step might have been to drop type safety and write the API as follows:

public static GraphTraversal> Enrich(this GraphTraversal t, params object[] args)

By convention, this step could be called with any or all of the following: T values, Enrichment values, or string instances representative of property keys. The code for this step could validate that each of the "args" was one of these types and would internally build the Project() step depending on the type detected. The API might be more fluid in this sense and could be called as follows:

killr.Movies("Young Guns"). Enrich(Id, Label, "title", "year", Degree())

Neither approach is necessarily "wrong", but for purposes of this blog post and the example, the design was based on the assumption that type safety was important for the imaginary user of the DSL. With type safety, available in the former approach, there is no guesswork about what should be passed to the Enrich() step. With the generic use of object in the latter approach, there is a certain level of reliance on convention, documentation and runtime error checking. Decide who the DSL is for to help determine what API would be most suitable.

As a final note, it is worth mentioning that the Degree() expression and the other examples provided, act as a good step for demonstration of this topic, but keep in mind that degree counts can be expensive in practice (i.e. counting edges of a supernode will be costly). Consider the cost of traversals that are provided to users of the DSL and determine if they can be reasonably executed before making them available. At a minimum, be sure that users are aware of the implications of employing a costly expression in their usage. Remember, just because a slow traversal is tucked into a simple DSL step, won't make that same slow traversal run any faster.

Conclusion

To use the DSL with the DataStax .NET Driver, do something like this:

IDseCluster cluster = DseCluster.Builder(). AddContactPoint("127.0.0.1"). WithGraphOptions(new GraphOptions().SetName("killrvideo")). Build(); IDseSession session = cluster.Connect(); var killr = DseGraph.Traversal(session);

Assuming that the DSL classes with the appropriate Extension Methods are present, the killr GraphTraversalSource and the GraphTraversal instances that it spawns should make available the DSL steps that were created.

KillrVideo Github

The Enrich() DSL step is really just a permutation of the Project() step with some of the shortcut qualities of ValueMap() step. In some ways, this step could be treated as a general "graph" step and not necessarily one that is specific to the domain of the graph itself. In other words, a DSL might expose steps that are generally useful to graphs of any domain. Think of steps like this as a way to extend to the graph language of Gremlin itself. General purpose graph steps in a DSL could be just as powerful and time-saving as application domain specific ones.

Acknowledgements

Daniel Kuppitz reviewed the newly added Enrich() step, as well as the draft of this blog post, and provided feedback for both.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.