Freebase Meets Materialize 1: Introduction

I recently started working at Materialize. Friday here is called “Skunkworks Friday” and is reserved for personal/professional development, moonshot projects, and other things that don’t get priority as part of the normal product+engineering planning cycle. I’ve decided to use my first few to prototype using Materialize as a generalized replacement for some hand-rolled infrastructure microservices that we had at a previous company.

Background #

For several years, I worked at Foursquare, back when they were mostly a consumer tech company. I was on the monetization team, but most people worked on the user facing app and website. Foursquare, like most apps at the time, kept data in a database but encapsulated this in a REST API. This API is what the mobile apps and the website talked to.

As was (and is) best practice, the data in the system of record database was normalized. Each user, venue, checkin, tip, etc was its own record with a unique identifier. A checkin (user U is at place P at time T) would then refer to the associated venue and user by embedding their unique identifiers. Normalization is powerful because it means that updates (e.g. changing the name of a venue) only need to happen in one place: the canonical record.

The flip side to normalization is that most uses of data require joining the records together. The API to retrieve information about a checkin would bring in the user and venue records so that the app could render their names. Some API endpoints (like information about a single checkin) were simple enough that these joins could be done on the fly and still be fast enough that the app felt responsive to the user. Others required joining so much together that if we’d done it when the endpoint was called, it would take too long and the app would have felt unresponsive. This might be something like the API to get information about a venue, which pulled in tips about the venue, the users that wrote those tips, your friend graph relationship to the users that wrote those tips, and so on.

The opposite of normalization is denormalization. For example, though we didn’t do this, we could have embed a user and venue name in each checkin record next to the respective unique identifiers. Denormalized data is fast on read because there are fewer joins, but loses the update-in-one-place property of a fully normalized database. It also takes more space because data is stored more than once (this is usually a lesser transgression). Taken to the extreme, one could imagine many fully denormalized copies - one tailored for each API endpoint.

Performant application development often involves careful thinking about where your data will fall along this normalization/denormalization spectrum. When you’re lucky, there’s some obvious point that’s both fast enough and straightforward to keep updated. When you’re not… there’s some fairly unsatisfying options.

One option is to keep data normalized and then also keep denormalized versions of it in the same database. Then, when the normalized data changes, all denormalized copies of that data are updated in the same transaction. This pushes the burden for keeping them in agreement onto the application developer. It can work, but becomes increasingly burdensome and bug-prone as the app grows in complexity. (However, note that this is exactly what a database index is! Indexes are denormalizations that the database maintains for you. Most databases are limited in the shapes of the indexes that they can automatically keep updated, but as we’ll see below, some databases *cough* Materialize *cough* support much more generality in their “indexes”.)

Another option is to use software to maintain the denormalized copies, which is what Foursquare did. They had an engineering team, as part of infrastructure, that wrote bespoke microservices to follow changes as they happened in the database, and update whatever denormalizations are affected with the minimal necessary work. At the time, we slurped the database logs directly, though nowadays this would likely be done as part of a change-data-capture based streaming architecture.

These microservices worked well, but required a team with full-time staffing to maintain them. This involved performance work and bug fixes, but also a ton of work to spin up a new one when required for a feature launch. Inevitably, the denormalizations were all just different enough that they couldn’t be nicely generalized and each required a good bit of custom code.

Some databases have the concept of a VIEW, which can be thought of as one of these denormalizations written in SQL. An API endpoint could use one of these, but a bare VIEW executes its logic when queried, which doesn’t save any time. On the other hand, a MATERIALIZED VIEW fully computes the denormalization and is fast to query. This is exactly what we want!

Unfortunately, implementations of MATERIALIZED VIEW in existing databases are almost never recomputed incrementally as the underlying data changes. Periodically and/or at the user’s request, the system runs a big batch computation of the entire view and saves it, using it in future queries. Even if this recompilation is run continually in a loop, it introduces latency between when the normalized data changes and when the denormalized data catches up. This repeated “full refresh” recomputation is also resource intensive. As the amount of data increases, latency and CPU utilization go up. There are a few databases that can incrementally compute a MATERIALIZED VIEW but only for a fairly restrictive set of special cases.

So Why are you Telling me This? #

Enter Materialize, which maintains SQL MATERIALIZED VIEWs incrementally, doing as little work as possible in response to each change in the underlying data. It is also much more expressive in the kinds of SQL queries it can incrementally maintain, including many-way joins and complex aggregations. This is pretty obviously useful for things like analytics dashboards, but when I first heard about the timely and differential dataflow projects that power Materialize, my immediate thought was Foursquare’s denormalization microservices.

As I mentioned, I’ve decided to use my first few Skunkworks Fridays to prototype using Materialize as a replacement for what Foursquare was doing by hand. The basic idea, as hinted above, is that the data of record will be stored fully normalized, but in Materialize I’ll have a MATERIALIZED VIEW corresponding to each API endpoint of a consumer-facing app. A nice side-benefit is that this will give me experience using the product I’m now developing and the opportunity to see it as a user.

Freebase #

A long time ago (pre-Foursquare), I heard about the Freebase project. Freebase was a sort of “structured data” Wikipedia for storing facts. For example: the height of the Eiffel Tower, actor A played role R in movie M, and the hierarchy of administrative regions in the United States. These facts are stored as <subject, predicate, object> triples (more on this in the next post). The company behind Freebase was called “Metaweb” because the structure of this data was also expressed as these triples. In some sense, it’s the “ultimate normalization” of data, in which the schema and constraints (foreign keys/checks/etc) aren’t stored as part of database table structure, but as part of the data itself. (Notice that the MATERIALIZED VIEW per endpoint is then a parallel “ultimate denormalization” of data. Why do anything halfway amirite?)

Freebase was acquired by Google and the database has been internalized (RIP), but Google still hosts a copy of the last publicly available freebase dataset. So, my plan is to play with the idea of building an application on top of triples (seeded with the Freebase data) and using Materialize to maintain the denormalizations needed to keep it performant.

What’s Next #

Well that’s what I’m planning to do and a bit of my motivations. In the next post, I’ll download a dump of the Freebase data and extract a smaller, more manageable chunk to work with. In post 3, I’ll fire up Materialize and use it to render something useful. See you then!

Thanks to Arjun

 
2
Kudos
 
2
Kudos

Now read this

Implementing Column Families in CockroachDB

Originally published at www.cockroachlabs.com on September 29, 2016. CockroachDB is a scalable SQL database built on top of a transactional key value store. We don’t (yet) expose the kv layer but it’s general purpose enough that we’ve... Continue →