Gremlin Recipes: 3 – Recommendation Engine Traversal
Part 3 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.
This blog post is the 3rd from the series Gremlin Recipes. It is recommended to read the previous blog posts first:
I KillrVideo dataset
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.
The graph schema of this dataset is:
II Recommendation engine
In this post we want to build a simple collaborative filtering engine with Gremlin traversal. Let’s say we want to find all movies in which Harrison Ford has played as an actor and order them by their average rating. The traversal is:
gremlin>g.
V().
has("person", "name", "Harrison Ford"). // Fetch Harrison Ford: Iterator<Person>
in("actor"). // acting as actor in movies: Iterator<Movie>
order().
by(inE("rated").values("rating").mean(), decr). // order movies by average ratings DESC
project("movie","average_rating"). // Display those movies
by("title"). // by title
by(inE("rated").values("rating").mean()) // by average rating: Iterator<Tuple2<String=title,Double=avg rating>>
==>{movie=Blade Runner, average_rating=8.20353982300885}
==>{movie=Star Wars. Episode V: The Empire Strikes Back, average_rating=8.121739130434783}
==>{movie=Star Wars. Episode VI: Return of the Jedi, average_rating=7.900990099009901}
==>{movie=Star Wars IV: A New Hope, average_rating=7.885964912280702}
==>{movie=Indiana Jones: Raiders of the Lost Ark, average_rating=7.872881355932203}
==>{movie=Indiana Jones and the Last Crusade, average_rating=7.809523809523809}
==>{movie=Indiana Jones and the Temple of Doom, average_rating=7.402061855670103}
==>{movie=The Fugitive, average_rating=7.0}
==>{movie=Indiana Jones and the Kingdom of the Crystal Skull (Indiana Jones 4), average_rating=5.5}
==>{movie=Six Days Seven Nights, average_rating=5.132075471698113}
Gremlin recipes: 3 – Recommendation Engine traversal
So it seems that the Harrison Ford movie having the best average rating is Blade Runner and not one of the Star Wars series.I will not explain into the detail the above traversal, if you have some difficulty to understand it, please read the 2 previous blog posts.
Now suppose that you are an user of an online video club (à-la Netflix) and you have just watched Blade Runner. You really enjoyed it so you rated it 9/10.
The challenge for the video club is now to find a movie similar to Blade Runner that could potentially suit your taste.
One of the classical recommendation engine technique, also called collaborative filtering, works as follows:
What has user XXX liked? Who else has liked those things? What have they liked that XXX hasn’t already liked?
With our graph schema, the verb “like” is translated into “has rated with xxx rating”.
So we will have our Gremlin traversal starting as:
gremlin>g.
V().
has("movie", "title", "Blade Runner"). // Fetch Blade Runner: Iterator<Movie>
as("blade_runner") // Save the movie under the label "blade_runner"
...
The as("blade_runner")
step will save the movie using the alias “blade_runner” for later re-use. From this Blade Runner movie, we want to find all the users who rated this movie more than its average rating e.g. more than 8.203 (rounding from 8.20353982300885). The corresponding traversal is:
nE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
outV() // Iterator<User>
From those users (who rated Blade Runner more than 8.203) we want to find all the movies they rated more than 8.203 which are NOT Blade Runner itself :
outE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
inV(). // Iterator<Movie>
where(neq("blade_runner")). // iterator.filter(movie -> !movie.eq("blade_runner"))
dedup() // Remove possible duplicated movies
If you have trouble to following the direction of the traversal (inE(), outV()
, …) just throw an eye to the graph schema and pay attention to the direction of each edge(arrow).
We then need to project the resulting movies by displaying their title, average rating and genres:
roject("title","average_rating","genres").
by(values("title").
by(outE("rated").values("rating").mean()).
by(out("belongsTo").values("name").fold())
The complete traversal is then:
gremlin>g.
V().
has("movie", "title", "Blade Runner"). // Fetch Blade Runner: Iterator<Movie>
as("blade_runner"). // Save Blade Runner under the label "blade_runner"
inE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
outV(). // Iterator<User>
outE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
inV(). // Iterator<Movie>
where(neq("blade_runner")). // iterator.filter(movie -> !movie.eq("blade_runner"))
dedup(). // Remove possible duplicated movies
project("title","average_rating","genres"). // Project on
by(values("title")). // movie's title
by(inE("rated").values("rating").mean()). // movie average rating
by(out("belongsTo").values("name").fold()). // movie's genres
limit(10)
==>{title=Inglourious Basterds, average_rating=7.801526717557252, genres=[War, Action, Comedy]}
==>{title=Untouchable, average_rating=8.141176470588235, genres=[Comedy, Drama]}
==>{title=Batman, average_rating=6.801980198019802, genres=[Thriller, Action, Fantasy]}
==>{title=The Band Wagon, average_rating=7.8, genres=[Musical]}
==>{title=Silverado, average_rating=6.697916666666667, genres=[Western]}
==>{title=The Last Samurai, average_rating=6.767123287671233, genres=[Action, Adventure]}
==>{title=Life Is Beautiful, average_rating=8.502923976608187, genres=[Comedy, Drama]}
==>{title=Love's a Bitch, average_rating=7.656716417910448, genres=[Drama]}
==>{title=Requiem for a Dream, average_rating=7.79047619047619, genres=[Drama]}
==>{title=The Matrix, average_rating=7.895348837209302, genres=[Sci-Fi, Thriller, Action, Fantasy]}
>If we look at the first 10 movies, we can see that they are not at all a good match for our Blade Runner fan.
- some of the movies have a poor average rating: 6.8 for Batman
- some of the movies do not belong to neither Sci-Fi nor Action genres: Life Is Beautiful
Indeed our previous traversal contains several caveats. First, the query outE("rated").where(values("rating").is(gte(8.203)))inV() means “give me all the movies rated more than 8.203 by those users” (those users == those who rated Blade Runner more than 8.203). The problem is that even if those users may have rated those movie more than 8.203, it doesn’t mean necessarily that those movies have received a good rating from all other users. We want to retain only a subset of those movies with an average rating >= 8.203. For this we need to filter further the results set with:
where(neq("blade_runner")). // iterator.filter(movie -> !movie.eq("blade_runner"))
where(inE("rated").values("rating").mean().is(gte(8.203))). // iterator.filter(movie -> getAvgRating(movie) >= 8.203))
But this is not sufficient enough. We can have movies with good average rating (>= 8.203) but the genres do not match at all those of Blade Runner. We want also to enforce that the matching movies belong to either Action or Sci-Fi genres. Again, an extra filtering step is necessary:
where(neq("blade_runner")). // iterator.filter(movie -> !movie.eq("blade_runner"))
where(inE("rated").values("rating").mean().is(gte(8.203))). // iterator.filter(movie -> getAvgRating(movie) >= 8.203))
where(out("belongsTo").has("name", within("Sci-Fi", "Action"))). // iterator.filter(movie -> getGenres(movie).contains("Sci-Fi") || getGenres(movie).contains("Action"))
The correct traversal for this collaborative filtering is then:
gremlin>g.
V().
has("movie", "title", "Blade Runner"). // Fetch Blade Runner: Iterator<Movie>
as("blade_runner"). // Save Blade Runner under the label "blade_runner"
inE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
outV(). // Iterator<User>
outE("rated"). // Iterator<Rated_Edge>
where(values("rating").is(gte(8.203))). // iterator.filter(rated -> rated.getRating() >= 8.2)
inV(). // Iterator<Movie>
where(neq("blade_runner")). // iterator.filter(movie -> !movie.eq("blade_runner"))
where(inE("rated").values("rating").mean().
is(gte(8.203))). // iterator.filter(movie -> getAvgRating(movie) >= 8.203))
where(out("belongsTo").has("name",
within("Sci-Fi", "Action"))). // iterator.filter(movie -> getGenres(movie).contains("Sci-Fi") || getGenres(movie).contains("Action"))
dedup(). // Remove possible duplicated movies
project("title","average_rating","genres"). // Project on
by(values("title")). // movie's title
by(inE("rated").values("rating").mean()). // movie average rating
by(out("belongsTo").values("name").fold()). // movie's genres
limit(10)
==>{title=Pulp Fiction, average_rating=8.581005586592179, genres=[Thriller, Action]}
==>{title=Seven Samurai, average_rating=8.470588235294118, genres=[Action, Adventure, Drama]}
==>{title=A Clockwork Orange, average_rating=8.215686274509803, genres=[Sci-Fi, Drama]}
The results are now more satisfactory. All 3 movies have average rating >= 8.203 and each of them belong to either Action or Sci-Fi genre.
And that’s all folks! Do not miss the other Gremlin recipes in this series.
If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai