TechnologyApril 10, 2017

Reducing Computational Complexity with Correlate Traversals

Reducing Computational Complexity with Correlate Traversals

image 1

The field of graph computing is unique in that not only is it applied to querying structured data in a manner analogous to the database domain, but also, graphs can be subjected to an ever growing set of statistical algorithms developed in the disciplines of graph theory and network science. This distinction is made salient by two canonical graph uses: “Friend-of-a-Friend” (querying) and PageRank (statistics). Statistical techniques are primarily concerned with the shape of the graph and, more specifically, with what a particular vertex (or set of vertices) contributes to that overall shape/structure. A graph-centric statistic will typically analyze a graph structure and yield a single descriptive number as the statistic. For instance, the diameter of the graph is the longest shortest path between any two vertices in the graph, where diameter: G → ℕ+ maps () a graph (G) to a positive natural number (+). The concept of diameter is predicated on the concept of eccentricity. Eccentricity is a vertex-centric statistic that maps every vertex to a single number representing the longest shortest path from it to every other vertex. Thus, eccentricity: G → ℕ|V|, where |V| is a Map<Vertex,Integer> and diameter = max(eccentricity(G)). Eccentricity is a sub-statistic of a more computationally complex graph statistic known as betweenness centrality which measures for each vertex, the number of shortest paths it exists on between every pair of vertices in the graph. Vertices that are more “central” are those that tend to connect other vertices by short paths. In spoken language, this makes good sense as a centrality measure. However, in practice, betweenness centrality has a computational complexity reaching 𝒪(|V|3) and as graphs scale beyond the confines of a single machine, 𝒪(|V|3) algorithms become unfeasible. Even 𝒪(|V|2) are undesirable. In fact, algorithmic approaches to database analysis are trying to achieve sub-linear times in order to come to terms with the ever growing size of modern data sets. This article will present a technique that can reduce the computational complexity of a graph statistic by mapping it to a correlate statistic in a lower complexity class. In essence:

If traversalx yields the same results as traversaly and traversaly executes faster than traversalx, then use traversaly instead of traversalx.

Correlating Centrality Measures

centrality function

There exists numerous centrality measures that can be applied to a graph in order to rank its vertices according to how “centrally located” they are. Such measures are generally defined as:

centralityx: G → ℝ+|V|

A centrality measure will map () a graph (G) to a positive real number vector/array (+) of size |V|. The resultant vector contains the “centrality” score for each vertex in the graph — Map<Vertex,Double>.

Five common centrality measures are presented in the table below. Along with each measure is the means by which the measure is calculated. Degree-based centralities are founded on the degree of a vertex and for PageRank and eigenvector centrality, in a recursive manner, on the degree of its adjacent vertices. Closeness and betweenness centrality are geodesic-based centralities which leverage shortest paths through a graph in their calculation and thus, are not based on vertex degree — though in many situations, recursively defined degrees and shortest paths can be and often are correlated. Finally, for each metric, the computational complexity is provided denoting the time cost of executing the measure against a graph, where |V| and |E| are the size of the graph’s vertex and edge sets, respectively.

Centrality Measure Topological Aspect Computational Complexity
Degree Centrality Degree-Based 𝒪(|V|)
PageRank Centrality Degree-Based 𝒪(|V|+|E|)
Eigenvector Centrality Degree-Based 𝒪(|V|+|E|)
Closeness Centrality Geodesic-Based 𝒪(|V|*(|V|+|E|))
Betweenness Centrality Geodesic-Based 𝒪(|V|3)

degree geodesic types

Understanding the Complexity of Betweenness Centrality

 
The betweenness centrality of vertex v is defined as

formula

where σst is the number of shortest paths between vertex s and vertex t and σ(v)st is the number of those paths that contain vertex v. As such, it is not possible to compute betweenness centrality without first computing the shortest paths between all pairs of vertices. The Floyd-Warshall algorithm computes all pairs shortest paths on weighted graphs and has a cubic runtime because it computes the shortest paths for a subgraph of size n << |V| and then adds a single vertex and adjusts. Adding the vertex takes 𝒪(|V|2) time as it must check every pair of vertices in the subgraph to see if a new shortest path exists through the recently added vertex. This is done |V|-times and thus, 𝒪(|V|2) * 𝒪(|V|) = 𝒪(|V|3).

Conceptually, the simplest way to compute all shortest paths is to run Dijkstra’s algorithm from each vertex in the graph which, for any one vertex has a runtime of 𝒪(|E|+(|V|*log(|V|))). In the worst case, |E| = |V|2 and Dijkstra runs in 𝒪(|V|2). Running it |V|-times means that all shortest paths requires 𝒪(|V|3), which is identical to Floyd-Warshall.

This callout box was written by Anthony Cozzie (DSEGraph Engineer).

 

It should be apparent that geodesic-based centralities are more expensive than degree-based centralities. It is for this reason that in large-scale graph computations, it is important to avoid geodesics if possible. In other words, if another centrality measure yields the same or analogous rank vector but is cheaper to execute, then it should be leveraged in production. For instance, assume the following 1000 vertex scale-free graph generated using iGraph for the R:Statistics language.

g <- sample_pa(1000,directed=FALSE)
plot(g, layout=layout.fruchterman.reingold, vertex.size=3, vertex.color="red", edge.arrow.size=0, vertex.label=NA)

The five aforementioned centrality measures can be applied to this synthetically generated graph. Each centrality yields a vector with 1000 elements (i.e. |V|). Each of these vectors can be correlated with one another to create a correlation matrix. Given that the most important aspect of a centrality measure is the relative ordering of the vertices, the Spearman rank order correlation is used.

matrix <- cbind(centr_degree(g)$res,page.rank(g)$vector,eigen_centrality(g)$vector,closeness(g),betweenness(g))
colnames(matrix) <- c("degree","pagerank","eigenvector","closeness","betweenness")
cor(matrix,method="spearman")

scale free graph

  Degree PageRank Eigenvector Closeness Betweenness
Degree 1.0000000 0.88429222 0.31932223 0.30950586 0.9927900
PageRank 0.8842922 1.00000000 0.05497317 0.04750024 0.8720052
Eigenvector 0.3193222 0.05497317 1.00000000 0.99336752 0.3218426
Closeness 0.3095059 0.04750024 0.99336752 1.00000000 0.3120138
Betweenness 0.9927900 0.87200520 0.32184261 0.31201375 1.0000000

 

In the table above, for the specific synthetically generated scale-free graph, degree centrality and betweenness centrality are nearly perfectly correlated with ρ=0.9928. It would be prudent to use degree centrality instead of betweenness centrality as the computational complexity goes from 𝒪(|V|3) to 𝒪(|V|) and still yields a nearly identical rank vector. It is important to stress that this correlation matrix is specific to the presented graph and each graph instance will need to create it own correlation matrix. Thus, it isn’t always the case that degree and betweenness centrality are so strongly correlated. Identifying centrality correspondences must be determined on a case-by-case basis. As a counter example, a 1024 vertex 2-dimensional lattice graph has the following correlation matrix.

g <- make_lattice(length=32, dim=2)
plot(g, layout=layout.grid, vertex.size=3, vertex.color="red", edge.arrow.size=0, vertex.label=NA)
matrix <- cbind(centr_degree(g)$res,page.rank(g)$vector,eigen_centrality(g)$vector,closeness(g),betweenness(g))
colnames(matrix) <- c("degree","pagerank","eigenvector","closeness","betweenness"
cor(matrix,method="spearman")

lattice graph

  Degree PageRank Eigenvector Closeness Betweenness
Degree 1.0000000 0.5652051 0.5451981 0.4773026 0.5652057
PageRank 0.5652051 1.00000000 -0.3729564 -0.4237888 -0.3582003
Eigenvector 0.5451981 -0.3729564 1.00000000 0.9890972 0.9960206
Closeness 0.4773026 -0.4237888 0.9890972 1.00000000 0.9752467
Betweenness 0.5652057 -0.3582003 0.9960206 0.9752467 1.0000000

 

In the above lattice graph correlation matrix, degree centrality and betweenness centrality have a ρ=0.565 which is a positive, though not a strongly correlated relationship. Furthermore, notice that in this particular lattice, the popular PageRank algorithm is negatively correlated with most of the other centrality measures. In this way, lattices do not offer a good alternative for the semantics of PageRank centrality.

Understanding Correlation Methods

 
Any two vectors (arrays) of equal length can be correlated using a correlation method. Correlations return a real number (double) between -1 and 1, where -1 states that the two vectors are inversely related (negatively correlated), 0 denotes that there is no relationship (uncorrelated), and +1 states the vectors are related (positively correlated). The two most common correlation methods are Pearson and Spearman. Pearson correlation is used when the relative difference between the vector values is important. For instance, if the degree centrality values double in one vector, are the betweenness centrality values doubling in the other vector? If accounting for that linear relationship is important/meaningful, then a Pearson correlation should be used. On the other hand, Spearman should be used when either the vector values are ordinal or do not necessarily relate proportionally. Typically, for centrality measures, the relative difference is not as important as the ranking between the vertices in the two vectors.

spearman vs pearson

 

Diagram provided under Creative Commons by Wikimedia Foundation.

Using Scale-Free Subgraphs for Centrality Correlations

Facebook full graph

A Facebook social graph was derived via survey administered by Julian McAuley and Jure Leskovec and made publicly available on the Stanford SNAP archive. This real-world graph (i.e. not synthetically generated) is scale-free in that most vertices have few edges and few vertices have many edges. The graph’s correlation matrix for the 5 aforementioned centrality measures is provided below using the Spearman rank order correlation.

  Degree PageRank Eigenvector Closeness Betweenness
Degree 1.0000000 0.8459264 0.5270274 0.4300193 0.7881420
PageRank 0.8459264 1.00000000 0.2122095 0.2367609 0.7602193
Eigenvector 0.5270274 0.2122095 1.00000000 0.4484385 0.4133148
Closeness 0.4300193 0.2367609 0.4484385 1.00000000 0.4791568
Betweenness 0.7881420 0.7602193 0.4133148 0.4791568 1.0000000

 

Note that degree centrality and betweenness centrality are strongly correlated (ρ=0.788). Furthermore, degree centrality and PageRank centrality are strongly correlated (ρ=0.846). This means, that if Facebook were interested in determining the betweenness or PageRank centrality of their social graph, it would be much cheaper (computationally) to simply use degree centrality given the strength of the correlations between the respective measures.

Facebook community graph

In many real-world situations, graphs can be so large that to determine the above correlation matrix is expensive and furthermore, once the centrality measures are computed for correlation, it would be prudent to simply use the desired centrality measure’s result. However, one of the interesting aspects of scale-free graphs is that they are fractal in nature. This means that any connected subset of the full graph maintains the same (or similar) statistical properties. Much like the Mandelbrot set , it is possible to “zoom into” a scale-free graph and bear witness to yet another scale-free graph with analogous structural features.

Statistics Workflow

This self-similar aspect of scale-free graphs makes it possible to calculate a correlation matrix on a smaller subgraph of the larger graph and maintain confidence that the subgraph correlations will generalize to the full graph. Thus, for instance, instead of calculating the centrality correlation matrix on a 1 billion vertex graph, it is possible to calculate it on a 10,000 vertex subgraph, where the latter subgraph will be much easier to handle both in terms of time (computational complexity) and space (data set size). In order to demonstrate this point, the 10 largest communities in the presented Facebook graph are extracted using the random walk inspired walktrap community detection algorithm. Each subgraph community is then subjected to the 5 centrality measures in order to yield a respective correlation matrix. Finally, all 11 correlation matrices (full graph and 10 communities) are reduced to a single variance matrix which denotes the stability of the correlations across all matrices/subgraphs. This statistical workflow yields a result that is in accord with the theory of scale-free graphs as there is very little variance amongst the correlations across its fractal components.

# (1) isolate subgraphs using the walktrap community detection algorithm
communities <- cluster_walktrap(g)
membership <- membership(communities)
histogram <- hist(membership,breaks=max(membership))
sorted_communities <- sort(histogram$counts, decreasing=TRUE, index.return=TRUE)
 
# (2) create correlation matrices for the 10 largest communities
list <- list(cor(matrix,method="spearman"))
for(i in sorted_communities$ix[1:10]) {
  subgraph <- induced_subgraph(g,communities[[i]],impl="create_from_scratch")
  submatrix <- cbind(centr_degree(subgraph)$res,page.rank(subgraph)$vector,
                     eigen_centrality(subgraph)$vector,closeness(subgraph),
                     betweenness(subgraph))
  colnames(submatrix) <- c("degree","pagerank","eigenvector","closeness","betweenness")
  list <- append(list,list(cor(submatrix,method="spearman")))
}
 
# (3) reduce the correlation matrices to a single variance matrix
apply(simplify2array(list),1:2, var)
  Degree PageRank Eigenvector Closeness Betweenness
Degree 0.000000000 0.004386645 0.01923375 0.03285931 0.01544800
PageRank 0.004386645 0.000000000 0.06718792 0.06625113 0.00799879
Eigenvector 0.019233753 0.067187923 0.000000000 0.03008761 0.05339487
Closeness 0.032859313 0.066251132 0.03008761 0.000000000 0.03480015
Betweenness 0.015448004 0.007998790 0.05339487 0.03480015 0.000000000

 

Conclusion

The fields of graph theory and network science offer numerous statistical algorithms that can be used to project a complex graph structure into a smaller space in order to make salient some topological feature of the graph. The centrality measures presented in this article project a |V|2-space (i.e. an adjacency matrix) into a |V|-space (i.e. a centrality vector). All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.