Freebase Meets Materialize 2: The Data
Last post, I introduced the idea of using Materialize to implement fast reads of highly normalized Freebase data for an API endpoint. Today, we start by downloading the data and doing a bit of preprocessing.
1.9 Billion Triples #
The final public copy of the Freebase data can be downloaded at https://developers.google.com/freebase/. It’s a 22 GB gzip (250 GB uncompressed) of N-Triples, which is a text-based data format with a spec and everything. Each line is a
<subject, predicate, object> triple and according to this page, there are 1.9 billion of them.
In the interest of fast iteration, I’d like to start with something that comfortably fits in memory. Before we can trim down the data, we have to look at how it’s structured.
Structure of Freebase Data #
This is all better explained by the since-removed [API documentation] (thank you Internet Archive), but I’ll go over a bit of it.
Freebase data is a structured representation of things and relationships between those things. In this case, “things” includes concrete stuff like people, films, and music but also more nebulous concepts like love (which can be the topic of a book) plus really anything with a page in Wikipedia. The things are called topics and each has a stable unique identifier. Most of these IDs look like
http://rdf.freebase.com/ns/m.09r8m05 (what Freebase calls a MID). The interesting part of this is the last bit (
m.09r8m05) which is a base-32 encoded integer. The
m can also be a
g for reasons. Some things in Freebase use a more human readable ID that looks like
http://rdf.freebase.com/ns/film.film. (I think the human readable ones also have a corresponding MID, but it’s been a while and I’m not sure.)
Each line in the data represents a 3-tuple of subject, predicate, and object. I personally understand this best with some examples (IDs shortened for clarity):
<.../m.09r8m05> <.../type.object.name> "Erica Albright"@en
m.09r8m05 is the ID of the character Erica Albright in the 2010 film The Social Network. This tuple expresses the name of the character in English. This
m.09r8m05 topic also has a type (in fact it has multiple), which tells you what sort of thing it is:
<.../m.09r8m05> <.../type.object.type> <.../common.topic> <.../m.09r8m05> <.../type.object.type> <.../film.film_character>
film.film_character is mostly self-explanatory and
common.topic is the most general topic type. It roughly corresponds to anything that could have an entry in Wikipedia (ignoring Wikipedia’s notability requirement).
When something (like
m.09r8m05) has a type (like
film.film_character) it receives the ability to have the sorts of relationships granted by the type. Said another way, getting typed as
film.film_character opens up some new predicates for use with
m.09r8m05. The interesting thing about the predicates is that they also get IDs and information about them is also stored in Freebase, meaning that the schema of the data is stored in the data itself.
I looked for
film.film_character but grep-ing through 250GB takes… a while, so here’s
<.../film.film.directed_by> <.../type.property.reverse_property> <.../film.director.film> <.../film.film.directed_by> <.../type.property.unique> "false" <.../film.film.directed_by> <..../type.object.type> <.../type.property> <.../film.film.directed_by> <.../type.property.expected_type> <.../film.director> <.../ns/film.film.directed_by> <.../type.object.name> "Directed by"@en <.../ns/film.film.directed_by> <.../type.property.schema> <.../film.film>
type.property means that
film.film.directed_by can be used as a predicate. An
film.director means that the object of triples with that predicate will be of
film.director. And then
type.property.schema is what means the subject of that triple will be a
film.film (I think).
Also note here
false, meaning a film can have multiple directors. This is an instance of what I was talking about in the last post where the constraints (foreign keys/checks/etc) are also part of the data.
The remaining triple here is
type.property.reverse_property, which establishes a relationship between “film F was directed by director D” and “director D directed film F”. At initial glance this seems to me to be completely redundant information, but who knows.
It’s clear from just grep-ing around this data that my intuition is correct and 250GB is too much to play around with, so it’s time to cut it down.
Fewer than 1.9 Billion Triples #
In something like
film.film, the first “film” is something Freebase calls a domain, which is a grouping of related types. (In addition to things like “film” and “people”, there is also a “user” domain in Freebase which let anybody explore making their own schemas. There’s some real gems in there, but I’ll leave that for you to explore.)
I decided to roughly group everything by domain. So, for example, all the film schemas and data will be in their own N-Triples file. That feels separable enough that I could do some iterative prototyping. The immediate hiccup is that a topic can have multiple types (
film.director). This wouldn’t otherwise be an issue except I’m certainly going to want to render the name, which is a
common.topic.alias. Having the names of everything in the file for the “common” domain isn’t really going to cut a lot of data.
It may be true that “David Fincher” is a
person.person, but it’s probably more interesting that he’s a
film.director, so I think it’d be okay for all of the “David Fincher” data to end up with the film data. Luckily there’s a property called
kg.object_profile.prominent_type that’s exactly what I want. This is an attempt to assign one most notable type to a topic. It’s not present for all topics, but it’s good enough.
This means I can use prominent_type to create an ID -> domain map that will be used to route each triple to its domain file. Sadly the data in the dump isn’t ordered such that I can make this map on the fly without buffering, so I do a preprocessing step of grep-ing every
kg.object_profile.prominent_type into one file.
$ time zfgrep "kg.object_profile.prominent_type" freebase-rdf-latest.gz | gzip -c > ids.nt.gz real 33m36.427s user 33m4.282s sys 0m18.399s
Good thing we only have to run that once.
Then I write an over-engineered and under-documented rust program to partition the full data into one-per-domain files. It’s over-engineered because it sounded fun and that sort of thing is exactly what Skunkworks Fridays are for. I also suspect things like an N-Triples parser (tested against the spec’s golden file!) will be “useful” “later” for “something”. (This is not foreshadowing, I’m genuinely just here madly hand-waving over everything.)
To further cut down data, I filter-out stuff like non-English names. I also filter out the fun stuff in the user domain for now. :(
$ time cargo run -p cmd --release -- ids.nt.gz freebase-rdf-latest.gz ./freebase/ ... real 114m1.647s user 113m13.981s sys 0m30.703s
Glad I only have to run that once, too.
Here’s the resulting Totally Reasonable ™ file sizes:
-rw-r--r-- 1 dan staff 2.7G Apr 24 12:10 music.nt.gz -rw-r--r-- 1 dan staff 804M Apr 24 12:10 book.nt.gz -rw-r--r-- 1 dan staff 384M Apr 24 12:10 film.nt.gz -rw-r--r-- 1 dan staff 373M Apr 24 12:10 tv.nt.gz -rw-r--r-- 1 dan staff 363M Apr 24 12:10 location.nt.gz -rw-r--r-- 1 dan staff 275M Apr 24 12:10 business.nt.gz -rw-r--r-- 1 dan staff 233M Apr 24 12:10 people.nt.gz -rw-r--r-- 1 dan staff 112M Apr 24 12:10 biology.nt.gz -rw-r--r-- 1 dan staff 66M Apr 24 12:10 education.nt.gz -rw-r--r-- 1 dan staff 64M Apr 24 12:10 government.nt.gz ...
Yawns Pointedly #
I know, I know. Next week, I (finally) fire up Materialize and do something with it.