Freebase Meets Materialize 2: The Data

Last post, I introduced the idea of using Materialize to implement fast reads of highly normalized Freebase data for an API endpoint. Today, we start by downloading the data and doing a bit of preprocessing.

1.9 Billion Triples #

The final public copy of the Freebase data can be downloaded at https://developers.google.com/freebase/. It’s a 22 GB gzip (250 GB uncompressed) of N-Triples, which is a text-based data format with a spec and everything. Each line is a <subject, predicate, object> triple and according to this page, there are 1.9 billion of them.

In the interest of fast iteration, I’d like to start with something that comfortably fits in memory. Before we can trim down the data, we have to look at how it’s structured.

Structure of Freebase Data #

This is all better explained by the since-removed [API documentation] (thank you Internet Archive), but I’ll go over a bit of it.

Freebase data is a structured representation of things and relationships between those things. In this case, “things” includes concrete stuff like people, films, and music but also more nebulous concepts like love (which can be the topic of a book) plus really anything with a page in Wikipedia. The things are called topics and each has a stable unique identifier. Most of these IDs look like http://rdf.freebase.com/ns/m.09r8m05 (what Freebase calls a MID). The interesting part of this is the last bit (m.09r8m05) which is a base-32 encoded integer. The m can also be a g for reasons. Some things in Freebase use a more human readable ID that looks like http://rdf.freebase.com/ns/film.film. (I think the human readable ones also have a corresponding MID, but it’s been a while and I’m not sure.)

Each line in the data represents a 3-tuple of subject, predicate, and object. I personally understand this best with some examples (IDs shortened for clarity):

<.../m.09r8m05> <.../type.object.name> "Erica Albright"@en

Here m.09r8m05 is the ID of the character Erica Albright in the 2010 film The Social Network. This tuple expresses the name of the character in English. This m.09r8m05 topic also has a type (in fact it has multiple), which tells you what sort of thing it is:

<.../m.09r8m05> <.../type.object.type> <.../common.topic>
<.../m.09r8m05> <.../type.object.type> <.../film.film_character>

Here film.film_character is mostly self-explanatory and common.topic is the most general topic type. It roughly corresponds to anything that could have an entry in Wikipedia (ignoring Wikipedia’s notability requirement).

When something (like m.09r8m05) has a type (like film.film_character) it receives the ability to have the sorts of relationships granted by the type. Said another way, getting typed as film.film_character opens up some new predicates for use with m.09r8m05. The interesting thing about the predicates is that they also get IDs and information about them is also stored in Freebase, meaning that the schema of the data is stored in the data itself.

I looked for film.film_character but grep-ing through 250GB takes… a while, so here’s film.film.directed_by:

<.../film.film.directed_by> <.../type.property.reverse_property> <.../film.director.film>
<.../film.film.directed_by> <.../type.property.unique> "false"
<.../film.film.directed_by> <..../type.object.type> <.../type.property>
<.../film.film.directed_by> <.../type.property.expected_type> <.../film.director>
<.../ns/film.film.directed_by> <.../type.object.name> "Directed by"@en
<.../ns/film.film.directed_by> <.../type.property.schema> <.../film.film>

A type.object.type of type.property means that film.film.directed_by can be used as a predicate. An expected_type of film.director means that the object of triples with that predicate will be of type.object.type -> film.director. And then type.property.schema is what means the subject of that triple will be a film.film (I think).

Also note here type.property.unique -> false, meaning a film can have multiple directors. This is an instance of what I was talking about in the last post where the constraints (foreign keys/checks/etc) are also part of the data.

The remaining triple here is type.property.reverse_property, which establishes a relationship between “film F was directed by director D” and “director D directed film F”. At initial glance this seems to me to be completely redundant information, but who knows.

It’s clear from just grep-ing around this data that my intuition is correct and 250GB is too much to play around with, so it’s time to cut it down.

Fewer than 1.9 Billion Triples #

In something like film.film, the first “film” is something Freebase calls a domain, which is a grouping of related types. (In addition to things like “film” and “people”, there is also a “user” domain in Freebase which let anybody explore making their own schemas. There’s some real gems in there, but I’ll leave that for you to explore.)

I decided to roughly group everything by domain. So, for example, all the film schemas and data will be in their own N-Triples file. That feels separable enough that I could do some iterative prototyping. The immediate hiccup is that a topic can have multiple types (person.person and film.director). This wouldn’t otherwise be an issue except I’m certainly going to want to render the name, which is a common.topic.alias. Having the names of everything in the file for the “common” domain isn’t really going to cut a lot of data.

It may be true that “David Fincher” is a person.person, but it’s probably more interesting that he’s a film.director, so I think it’d be okay for all of the “David Fincher” data to end up with the film data. Luckily there’s a property called kg.object_profile.prominent_type that’s exactly what I want. This is an attempt to assign one most notable type to a topic. It’s not present for all topics, but it’s good enough.

This means I can use prominent_type to create an ID -> domain map that will be used to route each triple to its domain file. Sadly the data in the dump isn’t ordered such that I can make this map on the fly without buffering, so I do a preprocessing step of grep-ing every kg.object_profile.prominent_type into one file.

$ time zfgrep "kg.object_profile.prominent_type" freebase-rdf-latest.gz | gzip -c > ids.nt.gz

real    33m36.427s
user    33m4.282s
sys 0m18.399s

Good thing we only have to run that once.

Then I write an over-engineered and under-documented rust program to partition the full data into one-per-domain files. It’s over-engineered because it sounded fun and that sort of thing is exactly what Skunkworks Fridays are for. I also suspect things like an N-Triples parser (tested against the spec’s golden file!) will be “useful” “later” for “something”. (This is not foreshadowing, I’m genuinely just here madly hand-waving over everything.)

To further cut down data, I filter-out stuff like non-English names. I also filter out the fun stuff in the user domain for now. :(

$ time cargo run -p cmd --release -- ids.nt.gz freebase-rdf-latest.gz ./freebase/
...
real    114m1.647s
user    113m13.981s
sys 0m30.703s

Glad I only have to run that once, too.

Here’s the resulting Totally Reasonable ™ file sizes:

-rw-r--r--    1 dan  staff   2.7G Apr 24 12:10 music.nt.gz
-rw-r--r--    1 dan  staff   804M Apr 24 12:10 book.nt.gz
-rw-r--r--    1 dan  staff   384M Apr 24 12:10 film.nt.gz
-rw-r--r--    1 dan  staff   373M Apr 24 12:10 tv.nt.gz
-rw-r--r--    1 dan  staff   363M Apr 24 12:10 location.nt.gz
-rw-r--r--    1 dan  staff   275M Apr 24 12:10 business.nt.gz
-rw-r--r--    1 dan  staff   233M Apr 24 12:10 people.nt.gz
-rw-r--r--    1 dan  staff   112M Apr 24 12:10 biology.nt.gz
-rw-r--r--    1 dan  staff    66M Apr 24 12:10 education.nt.gz
-rw-r--r--    1 dan  staff    64M Apr 24 12:10 government.nt.gz
...

Yawns Pointedly #

I know, I know. Next week, I (finally) fire up Materialize and do something with it.

 
1
Kudos
 
1
Kudos

Now read this

Compile Times and Code Graphs

Cross-posted on the Materialize Blog. At Materialize, Rust compile times are a frequent complaint. On one hand, I’m forever anchored by the Scala compile times from my days at Foursquare; a clean build without cache hits took over an... Continue →