Three Common Mistakes When Building Graph Applications
At DataStax, we help customers build some of the largest production applications on graph databases in the world. From these experiences, we’ve collected a set of three common pitfalls where teams frequently misstep when getting started with graph technology.
These themes happen to parallel one of my favorite video games of all time: SimCity 2000.* SimCity 2000 is a game created in the early 1990s that requires a significant amount of trial and error to be successful. As a new player of SimCity 2000, you are naive to the effects of your decisions, and you only learn the consequences through multiple iterations of game play.
This blog will draw parallels. Following are the three most common mistakes my graph team sees with building graph applications, plus advice on how to avoid them so that you can save time, skip iterations, and get a head start on building amazing applications using graph data.
1. Not understanding branching factor
To effectively build graph applications, you need to understand what branching factor is and how it affects query runtime.
The introduction of graph data into your application brings a new paradigm of data modeling known as “relationship-first design” (as opposed to “entity-first design”). The transition to relationship-first design principles introduces a new set of rules to consider when thinking about your application’s performance.
If like me, you are a fan of SimCity 2000, then this should look familiar:
When you start a new game in SimCity 2000, you are introduced to the different game modes and tools available. Just like graph data modeling, the first step to being successful is to examine the options and determine what are the biggest and most important dials you can tweak.
Even though there are many tricks to graph data modeling, your graph’s branching factor is the most commonly overlooked dial of your graph’s schema. A graph’s branching factor is the expected or average number of edges that are traversed when you walk from one vertex to another.
Unsure what I mean by this? Consider the animation below.
In this animation, the goal is to walk the graph until you find your destination. Above, we start at the left-most vertex (the one in green) until we get to the destination vertex on the far right (the one in red).
For every vertex you walk (traverse) through from left to the right, you have to explore two more edges to get to the next level. Moving from one vertex to two edges creates two new paths to explore. This causes the number of traversers to double between each level. Each vertex’s branching factor causes the split from one to two traversers during your query.
A graph’s branching factor creates exponential growth in the number of traversers required to walk from one vertex to another. The growth in the number of traversers directly correlates to the computational overhead required to process a graph traversal:
Frequently, when we troubleshoot slow-running queries, the root cause is a graph data model that creates a higher-than-anticipated branching factor. A high branching factor is one of the main contributing factors to a poor performing Gremlin query.
2. Not planning for or monitoring the growth of supernodes
Relationship-first data modeling can create a sleeping time bomb in your graph data—namely, supernodes. A supernode is a node that contains a disproportionately high number of incident edges.
Just like in SimCity 2000, high volumes of progress without proper planning will eventually introduce a catastrophe.
For the advanced SimCity 2000 player, these catastrophes show up in your game as disasters and monsters. These are unplanned events that decimate your metropolis and bring your city’s advancement to a grinding halt. These time bombs appear as you are traversing your graph, and a bad data model brings your traversal to a grinding halt. These are your supernodes.
A supernode is any vertex in your graph that has approximately 100,000 or more edges. You will need to track, mitigate, and eliminate the potential for supernodes within your applications.
To find the most likely supernodes in your graph database, you should use your analytics engine to look for the top 10 vertices with the most connections in your graph. You can do this with DataStax via:
The results of your query above will let you know if you currently have a supernode in your graph.
You should monitor the results of that job periodically to detect and mitigate any potential supernode problems as your graph grows over the lifetime of your application.
Monitoring your graph’s health is a great way to reactively handle supernodes in your graph, but the best way to handle them is to be proactive and model your data in a way that mitigates any chance of the supernodes forming.
3. Not fitting the technology to the problem
Just like learning the tools and rules for building a successful city in SimCity 2000, you will inevitably make some mistakes. Consider this example below:
Naively, I was trying to connect the city in the upper right to the open land in the lower right and lower left. If you brute-force your infrastructure at this point in the game, you don’t end up building what you intended. Instead of connected land, I built broken roads and disconnected bridges. From this I learned, a bit too late, that the ground leveling tool is the place to start before hammering roads over my map.
When playing SimCity 2000, it is really easy to spot your errors—they look like broken roads or bridges to nowhere.
When starting to integrate graph technology into your stack, it is also inevitable that you will make mistakes. But unlike in SimCity 2000, there are no broken roads or bridges to easily locate these mistakes.
I’ve found that people often try to use graph to solve non-graph problems, and typically when my team dives in to troubleshoot a customer’s graph cluster, these are the three most common red herrings that we encounter:
- Indexes on every property on every vertex
- Comparing properties to match vertices
- Putting a graph database behind a BI dashboard
Having search indexes on every property of a vertex is an indication that the end use of the application is more about searching data than leveraging the relationships within the data. This type of usage pattern lends itself to using a dedicated search technology, such as DSE Search, which is better optimized to handle these types of questions.
The first step in saving your data in a graph is figuring out what determines a unique person, place, or thing within your data. Commonly, teams do not determine this before they load their raw data. Then, in their Gremlin query, they are left with trying to decide which vertices represent the same unique item. This is detrimental to your queries due to the additional computational overhead required and resulting branching factor explosion that it creates.
Instead of determining this uniqueness each time you run a query, we recommend using DSE Analytics to match and merge your data before you load it into DSE Graph.
Lastly, we know that insights and metrics drive the business decisions that you make on a daily basis. If the end goal of your application is to create a BI dashboard for tracking these metrics, there are a myriad of tools and technologies out there that are well-suited for business intelligence or business analytics. It isn’t very often that a graph database is the right tool for serving up global insights into your company’s analytics. What I mean is: the data from a graph might feed into that insight, and a graph is rarely the best way to serve out the business dashboard. If you have one, please let me know. I am still looking for a good example in this area.
Where to go for more graph database insights
The seasoned graph experts at DataStax help our customers create graph models that mitigate this risk. If you are looking for more on supernodes, we recommend this great talk by Jonathan Lacefield, our Sr. Director of Product Management.
DataStax continues to lead the charge with the most innovative enterprises in building production applications backed by distributed technologies. To accelerate innovation, DataStax formed their Graph Practice, a team of experts focused on growing, advocating, and enabling our customers and the graph industry. As the practice grows, keep watching for our posts, videos, and content on the lessons we learn from the frontlines.
We want to hear from and collaborate with you on your graph problem and interesting uses of graph technology. Reach out to us through this blog, check out our contributions to DataStax Academy, or come find us when we are at an event near you.