Dan Harrison

My 38-Key Layout (ZSA Voyager)

2024-05-17T13:36:54-07:00

I’ve never loved the feel of mechanical keyboards, but my wrists decided it was time for me to give up my beloved Apple Magic Keyboard. Years ago, Frank at Materialize almost talked me into a Ferris Sweep with nice!nanos, but I went down a rabbit-hole of which choc color would feel most like a scissor-switch and never emerged. Then recently, I saw the ZSA Voyager, which was almost perfect¹, and impulse bought it². It arrived last Friday and I figured out my key layout over the weekend. Still taking recs for switches, but currently eyeing either the sunsets or the upcoming silent version of same.

I did all this for the better wrist position of a split keyboard, but I also got excited about putting programming symbols in a better place. Some people are delighted to incrementally tweak a layout forever, but I knew I wanted to get an initial one basically right, and then live with it. Through this process, I discovered some broad preferences:

No home row mods: anything with this much configuration is not for me.
No key duplication: minimizing choices is key for staying in my flow state.
Minimal number of layers: ditto flow state.
Minimal divergence from qwerty: still use the laptop keyboard sometimes.

As a result of the last, I quickly settled on a primary layer of basically stock qwerty plus a symbol layer. In the weekend I allotted for this project, the most compelling symbol layer inspirations I found were:

Miryoku: If the Voyager had 6 keys in the thumb cluster, I might have used this basically verbatim.
BEAKL: I’ve got weird pinkies that are weaker than normal.
This post on symbol layers

After some initial false starts, I hit on what turned out to be the key insight for where I ended up. The way the shift key works on any keyboard is basically a layer, but one baked into the protocol. I could add one more key that acts like shift, except switching into the symbol layer and be done.

A bit of quick counting helped narrow down my possibility space. Here’s everything that I wanted to find a home for:

26 letters (plus shifted capitals)
32 symbols []{}()<>\/ ,.;:?!'"`~ @#$%^&*-+= _|
10 numbers
4 arrow keys
4 mods: shift, cmd, ctrl, option
6 thumb cluster: esc, space, tab, enter, backspace, del

First I had to do something about two of the thumb cluster keys. Basically every 4 thumb key layout agrees on kicking out esc and del, leaving space, tab, enter, backspace as the thumb cluster. I have muscle memory for ctrl-d, so I don’t even need to find a home for the latter, but esc will have to go somewhere.

As mentioned above, I didn’t want to deal with home row mods, but having them as holds on the thumb cluster seemed great. There’s 4 and 4 but I really wanted my symbol layer shift-like to be there, so now we also have to find a home for option.

There are 30 keys I deemed “most easily reachable” for everything else. Was that enough? Unmodified, shifted, and symbol-layer gets me 3 things per key. 26*2 + 32 + 10 + 4 + option + esc = 100, nope! I had to find 4 more keys to get above 100 (to 102). Sad, but add in some pinky-reach wings on the side and move on, I suppose.

My qwerty goal made 30 of the 34 on the primary layer easy. There’s a block of 3 by 10 that I kept unchanged.

3 symbols per key depends on shift being meaningful, so the last 4 all had to be things that changed when shifted. This meant I had 7 to choose from: []\'`-=. '-= are super common and were easy picks, [] wanted to be on the symbol layer, and \` didn’t seem worth the prime placement. Then I realized I could tap+hold esc+option as the last, perfect!

With the primary layer done, I moved on to the symbol layer. The inputs I used were roughly:

Arrow keys on the right (primary) hand
Everything arranged with my estimated frequency (Rust+Slack) corresponding to my estimated reach effort (pinkies are the worst, up is easier than down)
Numpad-ish shape: I wanted these to be easier to reach than the ones up top on the primary layer. Numbers aren’t super common for me, so I pushed it all the way to the side. I don’t have numpad muscle memory, so I ended up with an atypical shape that puts smaller (more common) numbers on the top row where they’re easier to reach.
! on the home row
All the paired brackets: no inward rolls, no pairs on the same finger, no ][ on the right hand nonsense to get an outward roll (which would drive me crazy)
I used my 102-100=2 extra keys to duplicate <> on the symbol layer so the paired brackets are all here. It’s a small violation of one of my rules up top, but incredibly satisfying.

I threw everything that was going in the symbol layer in Oryx and enlisted my (extremely patient) wife to help with the swaps over coffee one morning. The galaxy-brain double-inverted numpad as well as the all-pointer paired brackets are her contributions. After a lot of thinking about how often I type each symbol and also a lot of miming finger reaches, I ended up sorting the rest into the symbol layer v0 you see here.

I still might permute the thumb cluster³, but it’s survived a week of work, and felt great!

Edit: I’ve discovered that I sometimes need to cmd-click or shift-click, so I need them on the left hand. I considered shuffling the thumb cluster, but ended up deciding to put them as left-hand home row holds on the symbol layer, which is my second bit of duplication.

I’ve read their post about wireless, but I have to respectfully disagree! Literally not one of the issues mentioned have been an issue in the 6+ years I’ve used my previous keyboard, and there are real drawbacks to wires that are not marketing hype. <3 ↩
Which, every interaction I’ve had with ZSA has been so lovely! The website, Oryx key configurator, packaging, and the keyboard itself all feel thoughtfully designed but unfussy. The (optional) post-purchase survey they sent had insightful questions in a way these sort of things never do. I got a personalized email in response to my survey answers. And I can’t even begin to guess where the idea for Keyboard Yoga, came from, but it’s clearly the sort of thing that only comes out of a small company of enthusiastic people. ↩
Apparently I have residual muscle memory from the Kinesis I used for a year about a decade ago, which is wild. I started with the Voyager default ordering, but looks like that has backspace and enter swapped from what I’m expecting. I think the Voyager layout is more logical, so might go with it and try to retrain myself. In the meantime, every time I mis-type a slack message that I’m composing, I try to backspace and immediately send it. 😂 ↩

Compile Times and Code Graphs

2023-10-13T11:26:31-07:00

Cross-posted on the Materialize Blog.

At Materialize, Rust compile times are a frequent complaint. On one hand, I’m forever anchored by the Scala compile times from my days at Foursquare; a clean build without cache hits took over an hour. On the other, Go at Cockroach Labs was great. Rust is in between, but much closer to Go than to Scala.

So far, I’ve mostly insulated myself from this here by carving out an isolated corner where unit tests catch almost all the bugs and so iteration is fast. But recently, I’ve been pitching in on some cross-cutting projects, felt the pain that everyone else is feeling, and so was motived to improve them a bit. Here’s how I did it.

First, a note that there are lots of other ways to improve compile times¹, but today we’re going to talk about dependency graphs in code.

In general, the following will be talking about the smallest compilation unit that doesn’t allow cyclic dependencies. In Rust, modules do but crates don’t and indeed today we’re talking about crates. For simplicity, I’ll just use “crate” below, but go ahead and mentally substitute whatever the equivalent is in your language of choice.

Ideal Code Dependency Structure #

This is going to sound obvious when written up, but bear with me.

Large software projects that involve lots of business logic will typically be broken up internally into crates (or crate equivalent). Day-to-day work will then involve typing up and iterating on some change until a good structure is worked out, the bugs are fixed, new tests are passing, old tests are passing, etc. In practice, the majority of these iterations of the edit-compile-run loop will only touch one crate (or a few). For this to be fast, you want as few crates as possible to depend on the one you’re changing, and for the dependents that do exist to be as small as possible.

Secondarily, when you pull in new code to your branch, or switch branches, you want your crate’s dependencies to be as small as possible. However, note that a dependency that doesn’t change often isn’t as bad because your compiler will get cache hits for it.

At some point, you’ll be happy with your change and will move on to integration testing, which requires compiling all binaries that transitively depend on it. This means you want your crate to only be in the binaries where it “belongs” (it’s surprisingly easy to end up with “incidental” dependencies if it’s not something you’re looking out for).

The logical conclusion of the above is a shape where a small number of infrequently changing foundational crates are at the “bottom” of the graph, then a lot of fanning out to business logic crates, which fan in to some number of binaries (production binaries, test binaries, etc) at the “top” of the graph. This shape also is particularly friendly for hermetic build systems (a la bazel, buck2, pants) that can reuse compilation artifacts generated by machines (e.g. CI).

A Pattern Emerges #

The above image describes an ideal, but what does that look like concretely? Both Foursquare and Materialize have ended up with a similar manifestation.

For each unit of business logic foo, separate crates for:

Types: for Plain Old Data, protobuf, traits that users of foo implement, etc.
Interface: for the public API without an implementation. 4sq called this FooService, mz calls it foo-client.
Implementation: for the implementation of the public API. 4sq called this FooConcrete, mz calls it foo.
Note that not every foo will have all three of these, and some will be more complicated, but I’ve found these three to be a reasonable default.

Foursquare leaned heavily into microservices and, as a result, broke things up into lots of fine-grained business logic units. The cost of manually maintaining the transitive interface/implementation graph for each of these microservice binaries was high enough that they eventually ended up writing bespoke tooling to do it. It all felt a little silly, but the compile time benefits were absolutely worth it.

On the other end of the spectrum, as Arjun and Frank as well as Brennan have described, materialize has three high-level architectural concepts: adaptor (control plane), storage (data in and out), and compute (efficient incremental computation, the heart of mz). There are additionally a small handful of internal utilities, one of which you’ll see below (stash).

Case Study: Materialize Storage #

I recently started doing a bit of work within the implementation of our “storage” layer and found myself surprised with some of the crates that got invalidated while I was iterating. This resulted in a PR to tease out some *-types crates that had previously been in the *-client ones.

Interestingly, the times for building binaries (necessary to run integration tests) while iterating was essentially unchanged: 1m40s -> 1m39s. This is likely because our link times are high and tend to dominate. However, the time it took to check that I had no compile errors was cut in half: 45s -> 23s. This is largely because the heavyweight mz-sql and mz-transform no longer get invalidated (i.e notice that they dissappear from the graph below).

Deps above mz-storage-client (before)²

Deps above mz-storage-client (after)

Case Study: Materialize Stash #

Shortly after, a co-worker mentioned in a weekly team sync that he was spending quite a bit of his time compiling while iterating on our internal stash utility. This was particularly interesting to me because each time he changed it, both of our environmentd and clusterd binaries would be invalidated and recompiled. But conceptually, the stash is only used by the former and it shouldn’t be in the dependency graph of the latter at all. The fix turned out (yet again) to be a new -types crate.

This result was more dramatic. The full-binary integration test iteration time went from 2m12s to 53s.

Deps above mz-stash (before)

Deps above mz-stash (after)

Difficulties #

As always, things in software are never black and white, nor are they easy. Here is a non-exhaustive list of a few things I’ve seen come up when working on code dependencies:

Dependency spaghetti! Foursquare started as a single compilation unit and everything depended on everything else. We had to gradually tease it apart over the course of years. Materialize has the dual benefits of starting with a CTO that understands the importance of internal dependency hygiene (ty Nikhil! <3) as well as a recent rework from local, single-binary deployment to cloud-only (abstraction boundaries are still in good shape from this).
This sort of work often forces bits of code to be public when they’d rather not be public. The stash example above had a number of these tradeoffs involved. Just this morning I investigated another possible separation where the balance went the other way and I aborted.
Regressions. It’s easy to accidentally re-introduce a dependency that you’ve taken care to remove, even when you’re looking out for it. It’s even easier when co-workers are not yet sold on the benefits. I wrote a tool for Rust called cargo-deplint that we run in CI to prevent backsliding.

For example, one of my co-workers has been using Rust’s excellent introspection tools on our codebase and had some results that point at monomorphization. This work is still ongoing. ↩
Generated with cargo-depgraph ↩

Freebase Meets Materialize 3 - First Impressions

2021-05-14T09:20:21-07:00

Previous posts talked about what I’m hoping to do and some background on the Freebase data. Today, we (finally) take Materialize out for a spin.

Part 1: Introduction
Part 2: The Data
Part 3: First Impressions (you’re here)

First, a quick note: one of my motivations for doing this is to get a feel for Materialize as a user, so I’m going to take my developer hat off and put my user hat on. I’ve only been here a couple weeks and the first things I’ve been working on have to do more with internals than UX, so I’m hoping this will mostly work.

Spoilers from the future: it turns out to have worked pretty well! The following is all my real, unabridged first interactions with Materialize’s docs and Materialize itself. I end up finding some papercuts as well as some great touch-points where we could have helped conceptually with someone transitioning from traditional databases to streaming. This was exactly the sort of feedback I was hoping to gather.

Installation #

There is a cloud version of Materialize (currently in private beta), but I prefer to do my development locally, so I downloaded it. Following the Install instructions for Homebrew (which is my preference for this sort of thing):

$ brew install MaterializeInc/materialize/materialized
[...]

Hmm, it compiles instead of using a brew bottle? It’s probably because I’m on arm64. I’m going to assume that we (Materialize) have a bottle for x86 but not for arm64 yet. It also feels like brew could have done better here. I would have been okay with installing an x86 binary and running it with Rosetta 2, so I wish it would have asked me if I wanted that or to install a bunch of compile-time dependencies and do a slow compile.

After finishing the compile and installation, brew tells me I can start it with materialized --threads=1.

$ materialized --threads=1
error: Found argument '--threads' which wasn't expected, or isn't valid in this context

USAGE:
    materialized [OPTION]...

For more information try --help

The install page has a pointer to [Get Started] which informs me that the updated name for this flag is materialized -w 1. So one issue here is the brew instructions are out of date. It also seems weird to have renamed this flag without supporting the old one. We’re pre-1.0, so I don’t think we need to commit to perfect backward compatibility, but at initial glance, this was a simple rename from --threads to --workers/-w. It’s pretty easy to alias the old name to the new one (and hide it from docs).

[get started]: https://materialize.com/docs/get-started

Next up on the Get Started page is to connect to my running Materialize instance:

$ psql -U materialize -h localhost -p 6875 materialize
bash: psql: command not found

I happen to know from working at Cockroach Labs that “psql” comes from PostgreSQL. The Get Started page also notes this in a “Prerequisites” section that I only notice now. This is incredibly nit-picky on my part, but having Prerequisites here feels heavyweight. It makes a ton of sense for other pages in the docs, where it might list having Materialize itself set up as a prerequisite (CockroachDB used to have things like “set up a multi-node cluster” in Prerequisites for some of the docs examples), but for this page I would have had “Make sure you have psql installed” as a step instead. To each their own.

$ brew install postgresql
[...]
$ psql -U materialize -h localhost -p 6875 materialize
psql (13.2, server 9.5.0)
Type "help" for help.

materialize=>

Now we’re in business.

Loading a File #

Just a bit further down the Get Started page, there’s an example of creating a file SOURCE, which is exactly what I’m looking for. My file isn’t changing, so I’m not going to tail it (yet). I copy the example, remove the tail bit and swap the regex for the [monstrosity in my over-engineered tuple partitioning program][regex monstrosity] (I knew this would come in handy). A quick glance at the [CREATE SOURCE] page for local files shows that Materialize supports `COMPRESSION GZIP, so we’re ready to go!

[regex monstrosity]: https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L16-L42
[CREATE SOURCE]: https://materialize.com/docs/sql/create-source/text-file/

materialize=> CREATE SOURCE film
FROM FILE 'film.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE

materialize=> SHOW COLUMNS FROM film;
       name       | nullable |  type
------------------+----------+--------
 blank            | t        | text
 column15         | t        | text
 comment          | t        | text
 mz_line_no       | f        | bigint
 obj_data_lit     | t        | text
 obj_data_type    | t        | text
 obj_data_type_fb | t        | text
 obj_data_type_w3 | t        | text
 obj_lang_lit     | t        | text
 obj_lang_type    | t        | text
 obj_node         | t        | text
 obj_str_lit      | t        | text
 obj_uri          | t        | text
 obj_uri_fb       | t        | text
 obj_uri_w3       | t        | text
 prd_uri          | t        | text
 prd_uri_fb       | t        | text
 prd_uri_w3       | t        | text
 sub_node         | t        | text
 sub_uri          | t        | text
 sub_uri_fb       | t        | text
 sub_uri_w3       | t        | text
(22 rows)

Sweet! Let’s make sure the regex is working correctly. (Also what’s column15?)

materialize=> SELECT * FROM film LIMIT 10;
ERROR:  Unable to automatically determine a timestamp for your query; this can happen if your query depends on non-materialized sources.
For more details, see https://materialize.com/s/non-materialized-error

After reading the link, I get why this is the case, but it’s kind of a bummer. I just want to look at enough of the data to verify that the regex is working. The Get Started page is making materialized views and selecting from them, I’ll do that instead:

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film LIMIT 10;
CREATE VIEW
materialize=> SELECT * FROM foo;
 comment | sub_uri_fb | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
(0 rows)

Huh, okay let’s try putting the LIMIT 10 on the SELECT instead of the materialized view definition.

materialize=> DROP VIEW foo;
DROP VIEW
materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
 comment |  sub_uri_fb  | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+--------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
         | ns/m.01y67v  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | kbs         |       |      26591
         | ns/m.0h34n   |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | gimli       |       |     120215
         | ns/m.0fns_b  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | nezu        |       |      99581
         | ns/m.0xcy    |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | atlantis    |       |     184125
         | ns/m.03k9l5  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | figwit      |       |      39070
         | ns/m.04zl2r  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | ovidie      |       |      49112
         | ns/m.06r6zc  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | kyknet      |       |      56779
         | ns/m.0btr9d  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | libaas      |       |      81760
         | ns/m.0kprc8  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | pg_usa      |       |     141054
         | ns/m.09rypdl |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | frode       |       |      67198
(10 rows)

Cool! It’s either processed that file really fast or is doing something lazily.

materialize=> SELECT COUNT(*) FROM foo;
  count
---------
 9346717
(1 row)

$ ls -alh film.nt.gz
-rw-r--r--  1 dan  staff   384M Apr 24 12:10 film.nt.gz

Yeah the count seems low. Nine million tuples aren’t gonna take up 384MB compressed.

materialize=> SELECT COUNT(*) FROM foo;
  count
----------
 19143660
(1 row)

It’s bigger now! This confirms that the select is returning while the file is still loading. That makes sense given the streaming focus. It’s nice to get results immediately, but my initial impression is that it’s odd for a non-tail file source, which I expect to be loaded atomically. I’m going to chalk this up to my background in OLTP databases and my ongoing adjustment to this new streaming world. (Note from future self: this will become a recurring theme.)

Is my File Loaded? #

I want to know when my file is finished loading, so I poke around the docs sidebar and [Monitoring] seems promising. It talks about prometheus inside docker, but I haven’t polluted this pristine new work computer with docker yet. A bunch of our testing infra uses it, so I will eventually, but let’s see what other options we have (there was mention of SQL at the top of the page).

[monitoring]: https://materialize.com/docs/ops/monitoring/

Side Note: I happened to notice the [On macOS, with Materialize running outside of Docker][macosdocker] section, which is amazing. I know from experience that if I’d gone the docker route, this would have saved me a lot of time.

[macosdocker]: https://materialize.com/docs/ops/monitoring/#on-macos-with-materialize-running-outside-of-docker

[System catalog SQL interface][mzcatalog mon] sounds promising! There are links to [SQL documentation][mzcatalog sql] and “walkthrough of useful [diagnostic queries]”. I open them both.

[mzcatalog mon]: https://materialize.com/docs/ops/monitoring/#system-catalog-sql-interface
[mzcatalog sql]: https://materialize.com/docs/sql/system-catalog/
[diagnostic queries]: https://materialize.com/docs/ops/diagnosing-using-sql/

Looking at the system catalog SQL docs, I see [mz_sources], which doesn’t look like it will include loading progress, but I want to run it anyway.

[mz_sources]: https://materialize.com/docs/sql/system-catalog/#mz_sources

materialize=> SELECT * FROM mz_sources;
  id   |  oid  | schema_id |                name                 | volatility
-------+-------+-----------+-------------------------------------+------------
 u1    | 20234 |         3 | film                                | unknown
 s3022 | 20153 |         1 | mz_peek_active                      | volatile
 s3026 | 20157 |         1 | mz_source_info                      | volatile
 s3024 | 20155 |         1 | mz_peek_durations                   | volatile
...
(18 rows)

Cool! I assume u is user and s is system.

The second page of useful diagnostic queries has a section titled [Are my sources loading data in a reasonable fashion?][loading] Exactly what I’m here for, bravo! <3

[loading]: https://materialize.com/docs/ops/diagnosing-using-sql/#are-my-sources-loading-data-in-a-reasonable-fashion

Oh the answer is to run SELECT count(*). I literally LOL’d.

It also mentions mz_materialization_frontiers:

materialize=> select * from mz_materialization_frontiers;
 global_id |     time
-----------+---------------
 s3001     | 1619797365000
 s3003     | 1619797365000
 s3005     | 1619797365000
...
(38 rows)

This is clearly the same id space as my mz_sources query above but none of them match up. Oh right! It’s probably materialized views (which I can select from), not sources (which I can’t). Before I bother to reopen that first page, I bet mz_views is a thing.

materialize=> SELECT * FROM mz_views;
  id   |  oid  | schema_id |               name                | volatility
-------+-------+-----------+-----------------------------------+------------
 u4    | 20237 |         3 | foo                               | unknown
 s5022 | 20229 |         2 | pg_proc                           | unknown
 s5024 | 20231 |         2 | pg_enum                           | unknown
 s5021 | 20228 |         2 | pg_type                           | volatile
...
(26 rows)

Yup but nope. Still none of them match up.

At this point, I’m going give up for now and decide that wc -l and waiting for that number in SELECT count(*) is how I’d do it. I don’t see how this would work for more complex materialized views because I wouldn’t have a good way to reason about how many rows would be in them once they finished loading. I guess I could keep re-running the SELECT count(*) until it stops changing? Dunno, maybe this is all just me still adjusting to streaming paradigms.

Let’s drop this test view. It showed what it needed to show.

materialize=> DROP VIEW foo;
DROP VIEW

Debug Endpoint #

So, what’s my first API endpoint going to be? Honestly, at this point, I just want to explore the data. Let’s start with a page that, given an id, shows the name of the thing, all the triples where it is the subject, and linkifies everything. That will let me easily poke around.

I’ve got a film source, but it’ll be useful to have links to stuff in common.nt.gz work as well, so let’s make a second source.

materialize=> CREATE SOURCE common
FROM FILE 'common.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE

And union them together.

materialize=> CREATE VIEW freebase AS SELECT * FROM common UNION ALL SELECT * FROM film;
CREATE VIEW

I plan to join everything to its user-facing name, so let’s make a view for that to make it easier later.

materialize=> CREATE VIEW id_names AS SELECT sub_uri_fb AS id, obj_lang_lit AS name, obj_lang_type AS lang FROM freebase WHERE prd_uri_fb = 'ns/type.object.name';
CREATE VIEW

I’m the type of person that likes to see things work as I go and we’re about at that point, so I inspect id_names using the (soon to be very common) throwaway materialized view plus select trick.

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM id_names;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
      id      | name | lang
--------------+------+------
 ns/m.0dq6p   | VHS  | en
 ns/m.05r9bx  | Ed   | en
 ns/m.09nrrz  | 15   | en
 ns/m.0gxwk_  | RJ   | en
 ns/m.0kvd3l  | Da   | en
 ns/m.03d019n | G    | en
 ns/m.03hj3r6 | K    | en
 ns/m.03w7wq_ | A    | en
 ns/m.03w7wtc | D    | en
 ns/m.03w854r | S    | en
(10 rows)

materialize=> DROP VIEW foo;
DROP VIEW

That’s working nicely. Next up is a view that uses id_names to name everything in each tuple.

This is where I note that 5 years of building a SQL database doesn’t make you a SQL expert. In fact, I’m very much still a SQL novice: probably 95%+ of the SQL I’ve written in my life is in CockroachDB unit tests and most of it is for stuff like BACKUP, RESTORE, CHANGEFEED, and IMPORT. I write down the first thing that comes to mind, which is probably a terrible way to do this:

materialize=> CREATE VIEW named_tuples AS
    SELECT
        f.sub_uri_fb AS sub_id,
        sub_n.name AS sub_name_en,
        f.prd_uri_fb AS prd_id,
        prd_n.name AS prd_name_en,
        f.obj_uri_fb AS obj_id,
        obj_n.name AS obj_name_en
    FROM
        (
            SELECT
                *
            FROM
                freebase
            WHERE
                sub_uri_fb IS NOT NULL
                AND prd_uri_fb IS NOT NULL
                AND obj_uri_fb IS NOT NULL
        )
            AS f
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS sub_n ON f.sub_uri_fb = sub_n.id
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS prd_n ON f.prd_uri_fb = prd_n.id
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS obj_n ON f.obj_uri_fb = obj_n.id;
CREATE VIEW

Side note: Thank deity (and [Matt Jibson]) for [https://sqlfum.pt/].

[Matt Jibson]: https://twitter.com/mjibson
[https://sqlfum.pt/]: https://sqlfum.pt/

You know what’s coming next.

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM named_tuples;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
    sub_id     |            sub_name_en            |                prd_id                | prd_name_en |     obj_id     |         obj_name_en
---------------+-----------------------------------+--------------------------------------+-------------+----------------+------------------------------
 ns/m.01119bmt | 2009 QDoc                         | ns/film.film_festival_event.festival | Festival    | ns/m.0lm919d   | QDoc
 ns/m.0r9mpb7  | 2010 KidFilm Festival             | ns/film.film_festival_event.festival | Festival    | ns/m.011dxlp0  | KidFilm Festival
 ns/m.0zb4wdg  | 2013 Palić Film Festival          | ns/film.film_festival_event.festival | Festival    | ns/m.02wxclb   | Palić Film Festival
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.010h2sg3 | 2011 11mm Filmfestival Berlin     | ns/film.film_festival_event.festival | Festival    | ns/m.0bdxcb4   | 11mm Filmfestival Berlin
 ns/m.0rh99r7  | 2003 Panorama of European Cinema  | ns/film.film_festival_event.festival | Festival    | ns/m.0107tj0d  | Panorama of European Cinema
 ns/m.01069sst | 2013 Neum Animated Film Festival  | ns/film.film_festival_event.festival | Festival    | ns/m.01069v41  | Neum Animated Film Festival
 ns/m.010h61_1 | 2012 Portland Maine Film Festival | ns/film.film_festival_event.festival | Festival    | ns/m.0100zwb5  | Portland Maine Film Festival
(10 rows)
materialize=> DROP VIEW foo;
DROP VIEW

Beautiful.

Here, I’d like to take a brief pause to mention that I had been vaguely planning to, at some point, redo all this on top of file sources with the TAIL option to show off Materialize’s incremental computation. But it turns out I don’t need to, because it’s doing it for me. I think this is just another data point in favor of “I’m still wrapping my head around streaming paradigms”.

A Materialized View of My Very Own #

On to our final view, this one materialized because, as described in Part 1, it’s what our API server will be a thin wrapper around. I went back and forth on how to structure it. The natural SQL way would be rows like:

id, name, prd_id1, prd_name_en1, obj_id1, obj_name_en1
id, name, prd_id2, prd_name_en2, obj_id2, obj_name_en2

This would repeat “id” and “name” for each tuple, which is wasteful and unsatisfying. Given that the API endpoint is going to return json anyway, why explode it just to unexplode it later? After some mulling, I ended up with a key-value structure of id -> jsonb endpoint response.

Without further ado… (I’m so so sorry.)

materialize=> CREATE MATERIALIZED VIEW api_debug AS
        SELECT
                sub_n.id AS sub_id,
                jsonb_build_object(
                        'sub_id',
                        sub_n.id,
                        'sub_name_en',
                        sub_n.name_en,
                        'sub_tuples',
                        jsonb_build_array(sub_t.named_tuples)
                ) AS json
        FROM
                (
                        SELECT
                                sub_id AS id, sub_name_en AS name_en
                        FROM
                                named_tuples
                )
                        AS sub_n
                JOIN LATERAL (
                                SELECT
                                        sub_id AS id,
                                        jsonb_agg(named_tuple) AS named_tuples
                                FROM
                                        (
                                                SELECT
                                                        sub_id,
                                                        jsonb_build_object(
                                                                'sub_id',
                                                                sub_id,
                                                                'sub_name_en',
                                                                sub_name_en,
                                                                'prd_id',
                                                                prd_id,
                                                                'prd_name_en',
                                                                prd_name_en,
                                                                'obj_id',
                                                                obj_id,
                                                                'obj_name_en',
                                                                obj_name_en
                                                        )
                                                                AS named_tuple
                                                FROM
                                                        named_tuples
                                        )
                                GROUP BY
                                        sub_id
                        )
                                AS sub_t ON sub_n.id = sub_t.id;
CREATE VIEW
materialize=> SELECT * FROM api_debug LIMIT 10;
    sub_id     |                                                                                                                                                                                                                                                                                                                                                         json
---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ns/m.01119bmt | {"sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc","sub_tuples":[[{"obj_id":"ns/m.0lm919d","obj_name_en":"QDoc","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc"}]]}
 ns/m.0r9mpb7  | {"sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival","sub_tuples":[[{"obj_id":"ns/m.011dxlp0","obj_name_en":"KidFilm Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival"}]]}
 ns/m.0zb4wdg  | {"sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival","sub_tuples":[[{"obj_id":"ns/m.02wxclb","obj_name_en":"Palić Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival"}]]}
 ns/m.010h2sg3 | {"sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin","sub_tuples":[[{"obj_id":"ns/m.0bdxcb4","obj_name_en":"11mm Filmfestival Berlin","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin"}]]}
 ns/m.0rh99r7  | {"sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema","sub_tuples":[[{"obj_id":"ns/m.0107tj0d","obj_name_en":"Panorama of European Cinema","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema"}]]}
 ns/m.01069sst | {"sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival","sub_tuples":[[{"obj_id":"ns/m.01069v41","obj_name_en":"Neum Animated Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival"}]]}
 ns/m.010h61_1 | {"sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival","sub_tuples":[[{"obj_id":"ns/m.0100zwb5","obj_name_en":"Portland Maine Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
(10 rows)

Alright, I’m really changing my tune on this whole “show stuff right away” bit. Getting this SQL working took a number of tries. If I’d been using anything but Materialize to prototype this, I would have had to wait quite a while to get the results of each attempt (or manually make an even smaller subset of the freebase data). With Materialize, creating the view, selecting from it, dropping it, and trying again were all basically instantaneous.

Brainstorming: Thinking about how I’d use this in production, I wish there was some way for my select to block until all the sources were “caught up enough” so I don’t serve incomplete results. For a non-tailed file, what this means is straightforward: when the whole file is loaded. On the other hand, what it means for tailed files, kafka, etc is less clear. Maybe when it’s caught up to within some time delta of where the source is at? I can see why we haven’t solved this yet, there are some meaty product and UX questions here.

I’m going to be using this view for simple id lookups, so I want to make sure there’s an index on id. I didn’t see a place in the CREATE MATERIALIZED VIEW grammar to specify an index, but the docs page mentions an [index is automatically made][materialized view index] for me.

[materialized view index]: https://materialize.com/docs/sql/create-materialized-view/#indexes

materialize=> SHOW INDEXES FROM api_debug;
  on_name  |       key_name        | seq_in_index | column_name | expression | nullable
-----------+-----------------------+--------------+-------------+------------+----------
 api_debug | api_debug_primary_idx |            1 | sub_id      |            | t
 api_debug | api_debug_primary_idx |            2 | json        |            | t
(2 rows)

Yep, that should work. Let’s look for Erica Albright from part 2.

materialize=> SELECT * FROM api_debug WHERE sub_id = 'ns/m.09r8m05';
 sub_id | api_debug
--------+--------------------
(0 rows)

Bummer, but that makes sense. It’s still loading things. Let’s wait for Materialize to catch up and try again.

Before that happens…

OOM #

Killed: 9

After a brief investigation (aka asking in slack), it looks like this means Materialize is running out of memory (and swap?). I restarted it while watching Activity Monitor and confirmed. At some point, while browsing the docs today, I saw something about debugging and optimizing memory usage, but if possible I’d like to come back to that later.

Let’s see if something dumb and easy works to unblock us. I restart Materialize, quickly drop the view and recreate it with LIMIT 10 added to the (SELECT ... FROM named_tuples) AS sub_n above.

Sorta? It’s using swap but not crashing anymore. Now to hook it up to a webpage.

Success! #

I’m out of practice here, so would like for something minimal to serve HTTP requests and query Materialize for the data. With a bit of looking, I found [tide], which seems to be associated with the official rust folks in some way. After a bit of iteration, I managed to get something working end-to-end!

[tide]: https://github.com/http-rs/tide

I’m running out of time for the day, so I’ll have to go into more detail in a later post. The code needs a bit of cleanup before I push it anywhere, so that will have to wait, too.

Funny enough, while iterating on the web frontend, I noticed that the set of 10 things being selected by my LIMIT 10 hack is changing over time, which means I have to keep finding another id to test with. Makes sense once I think about it.

Next Up: Debugging Memory Usage #

I’d like to polish up the web frontend and get it deployed somewhere, but it’s pretty clear that my next task is to figure out how to write these views so they don’t knock over the materialized process. I’d only planned through here when I started all this, but now I know what Part 4 is going to be.

Encouragingly, it feels like the basic idea is workable. When our Chief Scientist [Frank] read part 1, he pointed me at [Declarative Dataflow], which efficiently processes queries over (subject, predicate, object) 3-tuples and is built on top of the same [Differential Dataflow] incremental computation framework that powers Materialize. So there’s no reason we shouldn’t be able to do it, too.

[frank]: https://github.com/frankmcsherry/
[declarative dataflow]: https://github.com/comnik/declarative-dataflow
[differential dataflow]: https://github.com/TimelyDataflow/differential-dataflow

There were a few bumps along the way that we (Materialize) can polish up pretty easily, and I’ll file issues for those. I think there’s also a larger takeaway here around helping users that are new to streaming wrap their heads around its unfamiliar paradigms. These sorts of discoveries are exactly why dogfooding is so important and why I wanted to do it right when I started and had fresh eyes.

Stay tuned!

Part 1: Introduction
Part 2: The Data
Part 3: First Impressions (you’re here)

Freebase Meets Materialize 2: The Data

2021-04-27T08:43:02-07:00

Last post, I introduced the idea of using Materialize to implement fast reads of highly normalized Freebase data for an API endpoint. Today, we start by downloading the data and doing a bit of preprocessing.

Part 1: Introduction
Part 2: The Data (you’re here)
Part 3: First Impressions

1.9 Billion Triples #

The final public copy of the Freebase data can be downloaded at https://developers.google.com/freebase/. It’s a 22 GB gzip (250 GB uncompressed) of N-Triples, which is a text-based data format with a spec and everything. Each line is a <subject, predicate, object> triple and according to this page, there are 1.9 billion of them.

In the interest of fast iteration, I’d like to start with something that comfortably fits in memory. Before we can trim down the data, we have to look at how it’s structured.

Structure of Freebase Data #

This is all better explained by the since-removed [API documentation] (thank you Internet Archive), but I’ll go over a bit of it.

Freebase data is a structured representation of things and relationships between those things. In this case, “things” includes concrete stuff like people, films, and music but also more nebulous concepts like love (which can be the topic of a book) plus really anything with a page in Wikipedia. The things are called topics and each has a stable unique identifier. Most of these IDs look like http://rdf.freebase.com/ns/m.09r8m05 (what Freebase calls a MID). The interesting part of this is the last bit (m.09r8m05) which is a base-32 encoded integer. The m can also be a g for reasons. Some things in Freebase use a more human readable ID that looks like http://rdf.freebase.com/ns/film.film. (I think the human readable ones also have a corresponding MID, but it’s been a while and I’m not sure.)

Each line in the data represents a 3-tuple of subject, predicate, and object. I personally understand this best with some examples (IDs shortened for clarity):

<.../m.09r8m05> <.../type.object.name> "Erica Albright"@en

Here m.09r8m05 is the ID of the character Erica Albright in the 2010 film The Social Network. This tuple expresses the name of the character in English. This m.09r8m05 topic also has a type (in fact it has multiple), which tells you what sort of thing it is:

<.../m.09r8m05> <.../type.object.type> <.../common.topic>
<.../m.09r8m05> <.../type.object.type> <.../film.film_character>

Here film.film_character is mostly self-explanatory and common.topic is the most general topic type. It roughly corresponds to anything that could have an entry in Wikipedia (ignoring Wikipedia’s notability requirement).

When something (like m.09r8m05) has a type (like film.film_character) it receives the ability to have the sorts of relationships granted by the type. Said another way, getting typed as film.film_character opens up some new predicates for use with m.09r8m05. The interesting thing about the predicates is that they also get IDs and information about them is also stored in Freebase, meaning that the schema of the data is stored in the data itself.

I looked for film.film_character but grep-ing through 250GB takes… a while, so here’s film.film.directed_by:

<.../film.film.directed_by> <.../type.property.reverse_property> <.../film.director.film>
<.../film.film.directed_by> <.../type.property.unique> "false"
<.../film.film.directed_by> <..../type.object.type> <.../type.property>
<.../film.film.directed_by> <.../type.property.expected_type> <.../film.director>
<.../ns/film.film.directed_by> <.../type.object.name> "Directed by"@en
<.../ns/film.film.directed_by> <.../type.property.schema> <.../film.film>

A type.object.type of type.property means that film.film.directed_by can be used as a predicate. An expected_type of film.director means that the object of triples with that predicate will be of type.object.type -> film.director. And then type.property.schema is what means the subject of that triple will be a film.film (I think).

Also note here type.property.unique -> false, meaning a film can have multiple directors. This is an instance of what I was talking about in the last post where the constraints (foreign keys/checks/etc) are also part of the data.

The remaining triple here is type.property.reverse_property, which establishes a relationship between “film F was directed by director D” and “director D directed film F”. At initial glance this seems to me to be completely redundant information, but who knows.

It’s clear from just grep-ing around this data that my intuition is correct and 250GB is too much to play around with, so it’s time to cut it down.

Fewer than 1.9 Billion Triples #

In something like film.film, the first “film” is something Freebase calls a domain, which is a grouping of related types. (In addition to things like “film” and “people”, there is also a “user” domain in Freebase which let anybody explore making their own schemas. There’s some real gems in there, but I’ll leave that for you to explore.)

I decided to roughly group everything by domain. So, for example, all the film schemas and data will be in their own N-Triples file. That feels separable enough that I could do some iterative prototyping. The immediate hiccup is that a topic can have multiple types (person.person and film.director). This wouldn’t otherwise be an issue except I’m certainly going to want to render the name, which is a common.topic.alias. Having the names of everything in the file for the “common” domain isn’t really going to cut a lot of data.

It may be true that “David Fincher” is a person.person, but it’s probably more interesting that he’s a film.director, so I think it’d be okay for all of the “David Fincher” data to end up with the film data. Luckily there’s a property called kg.object_profile.prominent_type that’s exactly what I want. This is an attempt to assign one most notable type to a topic. It’s not present for all topics, but it’s good enough.

This means I can use prominent_type to create an ID -> domain map that will be used to route each triple to its domain file. Sadly the data in the dump isn’t ordered such that I can make this map on the fly without buffering, so I do a preprocessing step of grep-ing every kg.object_profile.prominent_type into one file.

$ time zfgrep "kg.object_profile.prominent_type" freebase-rdf-latest.gz | gzip -c > ids.nt.gz

real    33m36.427s
user    33m4.282s
sys 0m18.399s

Good thing we only have to run that once.

Then I write an [over-engineered and under-documented rust program][partition] to partition the full data into one-per-domain files. It’s over-engineered because it sounded fun and that sort of thing is exactly what Skunkworks Fridays are for. I also suspect things like an [N-Triples parser][parser] ([tested] against the [spec’s golden file][golden]!) will be “useful” “later” for “something”. (This is not foreshadowing, I’m genuinely just here madly hand-waving over everything.)

To further cut down data, I filter-out stuff like non-English names. I also filter out the fun stuff in the user domain for now. :(

$ time cargo run -p cmd --release -- ids.nt.gz freebase-rdf-latest.gz ./freebase/
...
real    114m1.647s
user    113m13.981s
sys 0m30.703s

Glad I only have to run that once, too.

[partition]: https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/partition_triples.rs
[parser]: https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L16-L42
[tested]: https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L325-L331
[golden]: https://github.com/danhhz/scribbles/blob/master/cmd/src/w3_golden.nt

Here’s the resulting Totally Reasonable ™ file sizes:

-rw-r--r--    1 dan  staff   2.7G Apr 24 12:10 music.nt.gz
-rw-r--r--    1 dan  staff   804M Apr 24 12:10 book.nt.gz
-rw-r--r--    1 dan  staff   384M Apr 24 12:10 film.nt.gz
-rw-r--r--    1 dan  staff   373M Apr 24 12:10 tv.nt.gz
-rw-r--r--    1 dan  staff   363M Apr 24 12:10 location.nt.gz
-rw-r--r--    1 dan  staff   275M Apr 24 12:10 business.nt.gz
-rw-r--r--    1 dan  staff   233M Apr 24 12:10 people.nt.gz
-rw-r--r--    1 dan  staff   112M Apr 24 12:10 biology.nt.gz
-rw-r--r--    1 dan  staff    66M Apr 24 12:10 education.nt.gz
-rw-r--r--    1 dan  staff    64M Apr 24 12:10 government.nt.gz
...

Yawns Pointedly #

I know, I know. Next week, I (finally) fire up Materialize and do something with it.

Part 1: Introduction
Part 2: The Data (you’re here)
Part 3: First Impressions

Freebase Meets Materialize 1: Introduction

2021-04-23T08:38:08-07:00

I recently started working at Materialize. Friday here is called “Skunkworks Friday” and is reserved for personal/professional development, moonshot projects, and other things that don’t get priority as part of the normal product+engineering planning cycle. I’ve decided to use my first few to prototype using Materialize as a generalized replacement for some hand-rolled infrastructure microservices that we had at a previous company.

Part 1: Introduction (you’re here)
Part 2: The Data
Part 3: First Impressions

Background #

For several years, I worked at Foursquare, back when they were mostly a consumer tech company. I was on the monetization team, but most people worked on the user facing app and website. Foursquare, like most apps at the time, kept data in a database but encapsulated this in a REST API. This API is what the mobile apps and the website talked to.

As was (and is) best practice, the data in the system of record database was normalized. Each user, venue, checkin, tip, etc was its own record with a unique identifier. A checkin (user U is at place P at time T) would then refer to the associated venue and user by embedding their unique identifiers. Normalization is powerful because it means that updates (e.g. changing the name of a venue) only need to happen in one place: the canonical record.

The flip side to normalization is that most uses of data require joining the records together. The API to retrieve information about a checkin would bring in the user and venue records so that the app could render their names. Some API endpoints (like information about a single checkin) were simple enough that these joins could be done on the fly and still be fast enough that the app felt responsive to the user. Others required joining so much together that if we’d done it when the endpoint was called, it would take too long and the app would have felt unresponsive. This might be something like the API to get information about a venue, which pulled in tips about the venue, the users that wrote those tips, your friend graph relationship to the users that wrote those tips, and so on.

The opposite of normalization is denormalization. For example, though we didn’t do this, we could have embed a user and venue name in each checkin record next to the respective unique identifiers. Denormalized data is fast on read because there are fewer joins, but loses the update-in-one-place property of a fully normalized database. It also takes more space because data is stored more than once (this is usually a lesser transgression). Taken to the extreme, one could imagine many fully denormalized copies - one tailored for each API endpoint.

Performant application development often involves careful thinking about where your data will fall along this normalization/denormalization spectrum. When you’re lucky, there’s some obvious point that’s both fast enough and straightforward to keep updated. When you’re not… there’s some fairly unsatisfying options.

One option is to keep data normalized and then also keep denormalized versions of it in the same database. Then, when the normalized data changes, all denormalized copies of that data are updated in the same transaction. This pushes the burden for keeping them in agreement onto the application developer. It can work, but becomes increasingly burdensome and bug-prone as the app grows in complexity. (However, note that this is exactly what a database index is! Indexes are denormalizations that the database maintains for you. Most databases are limited in the shapes of the indexes that they can automatically keep updated, but as we’ll see below, some databases *cough* Materialize *cough* support much more generality in their “indexes”.)

Another option is to use software to maintain the denormalized copies, which is what Foursquare did. They had an engineering team, as part of infrastructure, that wrote bespoke microservices to follow changes as they happened in the database, and update whatever denormalizations are affected with the minimal necessary work. At the time, we slurped the database logs directly, though nowadays this would likely be done as part of a change-data-capture based streaming architecture.

These microservices worked well, but required a team with full-time staffing to maintain them. This involved performance work and bug fixes, but also a ton of work to spin up a new one when required for a feature launch. Inevitably, the denormalizations were all just different enough that they couldn’t be nicely generalized and each required a good bit of custom code.

Some databases have the concept of a VIEW, which can be thought of as one of these denormalizations written in SQL. An API endpoint could use one of these, but a bare VIEW executes its logic when queried, which doesn’t save any time. On the other hand, a MATERIALIZED VIEW fully computes the denormalization and is fast to query. This is exactly what we want!

Unfortunately, implementations of MATERIALIZED VIEW in existing databases are almost never recomputed incrementally as the underlying data changes. Periodically and/or at the user’s request, the system runs a big batch computation of the entire view and saves it, using it in future queries. Even if this recompilation is run continually in a loop, it introduces latency between when the normalized data changes and when the denormalized data catches up. This repeated “full refresh” recomputation is also resource intensive. As the amount of data increases, latency and CPU utilization go up. There are a few databases that can incrementally compute a MATERIALIZED VIEW but only for a fairly restrictive set of special cases.

So Why are you Telling me This? #

Enter Materialize, which maintains SQL MATERIALIZED VIEWs incrementally, doing as little work as possible in response to each change in the underlying data. It is also much more expressive in the kinds of SQL queries it can incrementally maintain, including many-way joins and complex aggregations. This is pretty obviously useful for things like analytics dashboards, but when I first heard about the timely and differential dataflow projects that power Materialize, my immediate thought was Foursquare’s denormalization microservices.

As I mentioned, I’ve decided to use my first few Skunkworks Fridays to prototype using Materialize as a replacement for what Foursquare was doing by hand. The basic idea, as hinted above, is that the data of record will be stored fully normalized, but in Materialize I’ll have a MATERIALIZED VIEW corresponding to each API endpoint of a consumer-facing app. A nice side-benefit is that this will give me experience using the product I’m now developing and the opportunity to see it as a user.

Freebase #

A long time ago (pre-Foursquare), I heard about the Freebase project. Freebase was a sort of “structured data” Wikipedia for storing facts. For example: the height of the Eiffel Tower, actor A played role R in movie M, and the hierarchy of administrative regions in the United States. These facts are stored as <subject, predicate, object> triples (more on this in the next post). The company behind Freebase was called “Metaweb” because the structure of this data was also expressed as these triples. In some sense, it’s the “ultimate normalization” of data, in which the schema and constraints (foreign keys/checks/etc) aren’t stored as part of database table structure, but as part of the data itself. (Notice that the MATERIALIZED VIEW per endpoint is then a parallel “ultimate denormalization” of data. Why do anything halfway amirite?)

Freebase was acquired by Google and the database has been internalized (RIP), but Google still hosts a copy of the last publicly available freebase dataset. So, my plan is to play with the idea of building an application on top of triples (seeded with the Freebase data) and using Materialize to maintain the denormalizations needed to keep it performant.

What’s Next #

Well that’s what I’m planning to do and a bit of my motivations. In the next post, I’ll download a dump of the Freebase data and extract a smaller, more manageable chunk to work with. In post 3, I’ll fire up Materialize and use it to render something useful. See you then!

Part 1: Introduction (you’re here)
Part 2: The Data
Part 3: First Impressions

Thanks to Arjun

Simplenote

2019-01-18T15:33:52-08:00

“What note-taking app do you use? Do you like it? I currently use Evernote and kind of hate it.” — co-worker on our internal slack

“Twitter productivity gurus and tinkerers. What is the best cross-platform, light-weight note-taking app these days?” — Noah Weiss

Allow me to introduce you to my favorite piece of software, Simplenote. As the name suggests, it’s for notes and it’s intentionally simple. The notes are plain text, shareable, and they sync instantly and seamlessly. This means that I never need to think about where I typed a note, it’s available and editable on my computer, my phone, my partner’s phone, the web. There are a very small number of features built on top of this, but they’re carefully chosen. Omitted features, like inline images, may sound limiting but there’s a reason people seem to always be looking for an Evernote replacement.

I’ve used Simplenote for nearly 10 years now and along the way it became the place where I put everything. Everything I type, besides code and emails/texts/chats, is typed into Simplenote. Sometimes the emails get drafted in Simplenote. In fact, I’m writing this post in Simplenote. I have thousands of notes, so I can’t list them all, but here’s a sample: every todo list I have, a grocery list shared with my partner, poetry I’ve liked, quotes, recipes, vacation planning, things I pack on every trip, a log of books I’ve read, books I want to read, frequent flyer account numbers, restaurants I want to go to, potential dog names, my sizes in various clothing brands, thoughts I want to remember but didn’t know what to do with, snippets of code, dinner party menus. The shared grocery list alone is worth it.

Simplenote is my goto example of a great user experience. I have a theory that the largest part of successful UX is that every UI element does exactly what you expect (no surprises!) and when you want to do something, your first guess is correct. Simplenote nails this and the simplicity is a big part of how. They’ve selected exactly the minimum feature set that people really need, which means those features can all be exposed in the most obvious way and yet it doesn’t end up cluttered.

The features, basically: tags, search, full revision history, read-only publishing to the web, sharing with other Simplenote accounts, markdown rendering, focus mode, and that’s about it. Just check out Noah’s list of requirements in the tweet I linked above, it matches almost perfectly.

My life is organized entirely in Simplenote. This has been the case since at least 2009, when I was working at Google and they gave out the first Android phones as holiday gifts. Like a good employee, I tried to switch, but at the time there was no native Android app and I had to switch back to iOS. Early Android had a lot of rough edges, but the one thing I couldn’t adapt to was the lack of Simplenote.

When syncing only works most of the time, it never completely fades from your headspace, but when it works every time, it stops being something that you worry about. Simplenote’s syncing works every time. In ~10 years, it’s glitched once and lost one set of edits to a single note. I remember this so clearly because for the rest of that time, I’ve never had to think about where I’ve typed something. It’s always in Simplenote.

My enthusiasm for Simplenote may sound like hyperbole, but it’s not. Official apps are available for Mac, iOS, Android, and the web. Try it out!

Easy Bread

2018-10-24T14:59:21-07:00

My take on making Jim Lahey’s No-Knead Bread even simpler

I love making bread. It does, however, lead to more bread than I can eat, which leads to me giving away loaves of bread at every opportunity. Occasionally this leads to someone asking me how I make bread.

These days I follow the recipes in Ken Forkish’s book Flour Water Salt Yeast as closely as possible. It’s an excellent book and I’ve had much better results with his recipes than any other source I’ve tried.

For about year when I was getting started, I used Jim Lahey’s “No-Knead Bread” recipe. It strikes the perfect balance of easy, beginner friendly, and tasty. There is a delightful tradeoff in bread between fast and easy; you can make good bread in 5 hours with a lot of work or you can make great bread in 18-24 hours with almost no work. Mr. Lahey’s recipe swings all the way toward easy, eliminating the kneading entirely. A friend of mine made a couple tweaks to simplify it and over time, I’ve made a couple tweaks of my own to make it more foolproof.

The only ingredients you need are all-purpose flour, water, salt, and instant yeast. The long rise means you only need a tiny bit of yeast, so one packet will make several loaves of bread. This recipe can also work with a sourdough starter instead of yeast, but it’s more complicated. I also find I end up with less consistent results. If you want sourdough, I recommend jumping right to the recipes in Flour Water Salt Yeast.

I use a kitchen scale to measure ingredients because it gives much more consistent results. It’s also less cleanup! Just set your bowl on the scale and dump the ingredients into it, no measuring cup necessary. I really believe weighing ingredients is one of the easiest things you can do to improve your baking. If you’re not ready to commit to a scale, try the volumetric amounts in the No-Knead recipe I linked above, but your milage may vary, I’ve never done it.

You’ll also need some parchment and something to bake in. Ideally, it will be enclosed to keep the moisture in. I use two cast iron pans that are sold as a set and fit together. A dutch oven works well but most of them have a handle on top that can’t stand the temperatures involved, so make sure to remove it beforehand. If all you have is a baking sheet, try it out and let me know how it goes. Yeasted bread is incredibly forgiving.

The timeline here is: a bit of work, 12–18 hours of waiting, a bit of work, 2–3 hours of waiting (with the oven pre-heating at the end), and an hour or so bake. There are peak times during the two windows of waiting, but at first, it’s fine to just use whatever is most convenient for your schedule.

That’s it. Let’s make bread.

500g all-purpose flour
400g water
11g salt
1/8 teaspoon instant yeast (most scales, including mine, are not accurate enough to weigh such a small amount)

Step 1: In the evening on the day before you want bread, mix together all the ingredients in a bowl that’s big enough for it to double or triple in size. Cover in plastic wrap and leave it somewhere that’s as close to 70F as possible.

Step 2: Anywhere between 12–18 hours later, it’ll look like below. This is called the bulk rise.

Step 3: Next is shaping. Most bread recipes have you use as little flour as possible, but more flour makes it easier and the worst I’ve seen happen is a streak of flour in the middle of the finished bread. Which is fine. If anything it just makes it seem more homemade. A trick of my own: if you have a large enough cutting board to fit the bread, it makes cleanup much easier than a counter. Sprinkle a generous amount of flour and dump the dough onto it. Use a scraper or a spatula or anything flat to fold it into a (flat, loose) ball with the flour side out.

I find that 10–15 (gentle!) kneads here make it easier to shape. This is pretty late in the process to be working it this much and will push out some bubbles. There’s a better way to accomplish the same thing (using the “folds” method described by Flour Water Salt Yeast), but the kneads here are a lot easier and you still end up with a good bread.

Before shaping, make sure the dough isn’t sticky on the outside. If it is, dust a good amount of flour on it (in the video below, I don’t have nearly enough flour and you can see my hands sticking). If your hands have dough on them, wash it off. Then cover them with as much flour as will stick to them. Dust a piece of parchment paper with some flour.

Pick the dough up with your fingertips pointed toward the middle of the bottom. Then create some tension by gently pushing the bottom up into the middle. Rotate and repeat a few times. Set the dough seam side down on the floured parchment. Dust the top with flour and put another piece of parchment on the top to keep it from drying out.

Step 4: Let it sit for 2–3 hours before baking. This is called the final rise. It will take a while for the oven to preheat to 500F, so make sure to turn it on (with your baking container in it!) an hour or so before the end of the final rise.

Step 5: When the final rise finishes and the oven is pre-heated, take the extremely hot container out and flip the dough into it so the seam side is up. The seam will allow gases to escape (good) and bake into a beautiful top. Put on your container’s top and put it in the oven. Lower the temp to 450F and let it bake for 30 minutes before uncovering. This uncovering is my favorite part! You get to see the final shape of your bread, which I find incredibly satisfying and a bit magical. Bake at least 15 minutes uncovered before starting to check for doneness.

It’s done when the outside is at least golden brown, but the darker it gets short of burning the more flavor it will have, so leave it in for as long as your nerve holds, up to 60 minutes total.

Once it’s ready, take it out and let it cool on a wire rack or leaned up against something, so the bottom can breathe. If you cut it before it has cooled 20–30 minutes, the rest of the loaf will collapse a bit. But warm bread is awesome, so maybe you don’t care.

Here’s my first bread, from ~1 year ago.

Enjoy and send me pictures!

Thanks to Kat and Arjun.

Implementing Backup

2017-08-09T14:35:08-07:00

Originally published at www.cockroachlabs.com on August 9, 2017.

Almost all widely used database systems include the ability to backup and restore a snapshot of their data. The replicated nature of CockroachDB’s distributed architecture means that the cluster survives the loss of disks or nodes, and yet many users still want to make regular backups. This led us to develop distributed backup and restore, the first feature available in our CockroachDB Enterprise offering.

When we set out to work on this feature, the first thing we did was figure out why customers wanted it. The reasons we discovered included a general sense of security, “Oops I dropped a table”, finding a bug in new code only when it’s deployed, legally required data archiving, and the “extract” phase of an ETL pipeline. So as it turns out, even in a system that was built to never lose your data, backup is still a critical feature for many of our customers.

At the same time, we brainstormed whether CockroachDB’s unique architecture allowed any improvements to the status quo. In the end, we felt it was important that both backup and restore be consistent across nodes (just like our SQL), distributed (so it scales as your data scales), and incremental (to avoid wasting resources).

Additionally, we knew that backups need to keep only a single copy of each piece of data and should impact production traffic as little as possible. You can see the full list of goals and non-goals in the Backup & Restore RFC.

In this post, we’ll focus on backup and how we made it work.

Step 0: Why We Reinvented the Wheel #

One strategy for implementing backup is to take a snapshot of the database’s files, which is how a number of other systems work. CockroachDB uses RocksDB as its disk format and RocksDB already has a consistent backup feature, which would let us do consistent backups without any particular filesystem support for snapshots of files. Unfortunately, because CockroachDB does such a good job of balancing and replicating your data evenly across all nodes, there’s not a good way to use RocksDB’s backup feature without saving multiple copies of every piece of data.

Step 1: Make it Consistent #

Correctness is the foundation of everything we do here at Cockroach Labs. We believe that once you have correctness, then stability and performance will follow. With this in mind, when we began work on backup, we started with consistency.

Broadly speaking, CockroachDB is a SQL database built on top of a consistent, distributed key-value store. Each table is assigned a unique integer id, which is used in the mapping from table data to key-values. The table schema (which we call a TableDescriptor) is stored at key /DescriptorPrefix/<tableid>. Each row in the table is stored at key /<tableid>/<primarykey>. (This is a simplification; the real encoding is much more complicated and efficient than this. For full details see the Table Data blog post).

I’m a big fan of pre-RFC exploratory prototypes, so the first version of backup used the existing Scan primitive to fetch the table schema and to page through the table data (everything with a prefix of /<tableid>). This was easy, quick, and it worked!

It also meant the engineering work was now separable. The SQL syntax for BACKUP, the format of the backup files (described below), and RESTORE could now be divvied up among the team members.

Unfortunately, the node sending all the Scans was also responsible for writing the entire backup to disk. This was sloooowwww (less than 1 MB/s), and it didn’t scale as the cluster scaled. We built a database to handle petabytes, but this could barely handle gigabytes.

With consistency in hand, the natural next step was to distribute the work.

Step 2: Make it Distributed #

We decided early on that backups would output their files to the storage offered by cloud providers (Amazon, Google, Microsoft, private clouds, etc). So what we needed was a command that was like Scan, except instead of returning the data, it would write it to cloud storage. And so we created Export.

Export is a new transactionally-consistent command that iterates over a range of data and writes it to cloud storage. Because we break up a large table and its secondary indexes into multiple pieces (called “ranges”), the request that is sent gets split up by the kv layer and sent to many nodes. The exported files use LevelDB’s SSTable as the format because it supports efficient seeking (in case we want to query the backup) and because it was already used elsewhere in CockroachDB.

Along with the exported data, a serialized backup descriptor is written with metadata about the backup, a copy of the schema of each included SQL table, and the locations of the exported data files.

Once we had a backup system that could scale to clusters with many nodes and lots of data, we had to make it more efficient. It was particularly wasteful (both cpu and storage) to export the full contents of tables that change infrequently. What we wanted was a way to write only what had changed since the last backup.

Step 3: Make it Incremental #

CockroachDB uses MVCC. This means each of the keys I mentioned above actually has a timestamp suffix, something like /<tableid>/<primarykey>:<timestamp>. Mutations to a key don’t overwrite the current version, they write the same key with a higher timestamp. Then the old versions of each key are cleaned up after 25 hours.

To make an incremental version of our distributed backup, all we needed to do was leverage these MVCC versions. Each backup has an associated timestamp. An incremental backup simply saves any keys that have changed between its timestamp and the timestamp of the previous backup.
We plumbed these time ranges to our new Export command and voilà! Incremental backup.

One small wrinkle: if a given key (say /<customers>/<4>) is deleted, then 25 hours later when the old MVCC versions are cleaned out of RocksDB, this deletion (called a tombstone) is also collected. This means incremental backup can’t tell the difference between a key that’s never existed and one that was deleted more than 25 hours ago. As a result, an incremental backup can only run if the most recent backup was fewer than 25 hours ago (though full backups can always be run). The 25 hour period is not right for every user, so it’s configurable using replication zones.

Go Forth and Backup #

Backup is run via a simple BACKUP SQL command, and with our work to make it consistent first, then distributed and incremental, it turned out blazing fast. We’re getting about 30MB/s per node and there’s still lots of low-hanging performance fruit. It’s our first enterprise feature, so head on over to our license page to grab an evaluation license and try it out.

While CockroachDB was built to survive failures and prevent data loss, we want to make sure every team, regardless of size, has the ability to survive any type of disaster. Backup and restore were built for large clusters that absolutely need to minimize downtime, but for smaller clusters, a simpler tool will work just fine. For this, we’ve built cockroach dump, which is available in CockroachDB Core.

What’s Next? #

We have plans for a number of future projects to build on this foundation: Change Feeds for point-in-time backup and restore, read-only SQL queries over backups, an admin ui page with progress and scheduling, pause/resume/cancel control of running backups, and more.

Plus, BACKUP is worth far more with RESTORE (which turned out to be much harder and more technically interesting) and there’s a lot more that didn’t fit in this blog post, so stay tuned.

Implementing Column Families in CockroachDB

2016-09-28T15:44:38-07:00

Originally published at www.cockroachlabs.com on September 29, 2016.

CockroachDB is a scalable SQL database built on top of a transactional key value store. We don’t (yet) expose the kv layer but it’s general purpose enough that we’ve used it to implement SQL without any special trickery.
The particulars of how we represent data in a SQL table as well as the table metadata are internally called the “format version”. Our first format version was deliberately simple, causing some performance inefficiencies. We recently improved performance with a technique called column families, which pack multiple columns in one kv entry.

Once implemented, column families produced dramatic improvements in our benchmarks. A table with more columns benefits more from this optimization, so we added a benchmark of INSERTs, UPDATEs, and DELETEs against a table with 20 INT columns and it ran 5 times faster.

Press on, dear reader, and I’ll explain the details of how we did it and how they work.

Format Version 1: CockroachDB Before Column Families #

CockroachDB requires every SQL table to have a primary index; one is generated if it was not provided by the user. Our first format version stored the table data as kv entries with keys prefixed by the columns in the primary index. The remaining columns were each encoded as the value in a kv entry. Additionally, a sentinel key with an empty value was always written and used to indicate the existence of a row. This resulted in N+1 entries for a table with N non-primary index columns. Secondary indexes work a little differently, but we don’t need them for today.

This all results in something like:

/<tableID>/<indexID>/<primaryKeyColumns...>/<columnID>  ->  <4 byte CRC><encoded value>

And more concretely:

CREATE TABLE users (id INT PRIMARY KEY, name STRING, email STRING);
INSERT INTO users (11, "Hal", "hal@cockroachlabs.com");
INSERT INTO users (13, "Orin", "orin@cockroachlabs.com");

/<tableid>/0/11/0 -> <empty>
/<tableid>/0/11/1 -> "Hal"
/<tableid>/0/11/2 -> "hal@cockroachlabs.com"
/<tableid>/0/13/0 -> <empty>
/<tableid>/0/13/1 -> "Orin"
/<tableid>/0/13/2 -> "orin@cockroachlabs.com"

Note that columns never use ID 0 because it’s reserved for use as the sentinel. This is all described in much more detail in the original SQL in CockroachDB: Mapping Table Data to Key-Value Storage blog post. If you haven’t read it, I highly recommend you do.

The Trouble with Format Version 1 #

Everything has to start somewhere, and while our first format version worked, it was a little inefficient. The encoded primary index data in the key was repetitive, and there is an MVCC timestamp and checksum for each entry, collectively wasting disk space and network bandwidth.

Perhaps worse was that there was per-key overhead at the transaction level. Every key written within a transaction has a “write intent” associated with it. These intents need to be resolved when the transaction is committed, taxing performance.

While our disk format avoids the key repetition with an incremental prefix encoding, the timestamp and the checksum still create ~12 bytes of overhead per key, not to mention the intents.

Since the problem was using one kv entry per column in the table, the natural solution was to group multiple columns into one value. Several NoSQL databases use a similar technique and call each group a “column family”.

When we set out to implement column families, the first wrinkle was deciding whether to support get and set on individual columns in a family or to load and store an entire family to change one column. The former would allow us to make every table’s primary data one key value entry. Unfortunately, it would also require the kv layer to understand the encoding that packs multiple columns in one value. If we later decided to change the encoding, it would be much more difficult to migrate if it were baked into the key value layer. Plus, the tidy separation they’ve enjoyed so far has been a big help to testability and moving quickly. We felt this wasn’t a worthwhile tradeoff.

As a result, we support multiple column families per table, so that setting a small field doesn’t necessitate roundtripping any large fields in the same table.

Side note: A common question we get is whether we support use of the key value layer directly. We don’t right now, but by using one entry instead of two, we’ve gotten much closer to eliminating the overhead of using the CockroachDB key value store via a two column key and value SQL table.

How Do Column Families in CockroachDB Work? #

Before column families, the value of an encoded table column was structured as:

<crc><typetag><encodedvalue>

With column families, this is now:

<crc><columnid0><typetag0><encodedvalue0>...<columnidN><typetagN><encodedvalueN>

or for our example above

/<tableid>/0/11/0 -> <crc>/1/string/"Hal"/1/string/"hal@cockroachlabs.com"
/<tableid>/0/13/0 -> <crc>/1/string/"Orin"/1/string/"orin@cockroachlabs.com"

Notably, the column IDs in the keys have been replaced by family IDs. The first family ID is 0, doubling as the sentinel, and is always present. We use a variable length encoding for integers, including column IDs. This encoding is shorter for smaller numbers, so instead of storing the column ID directly, we store the difference to keep them smaller. NULLs are omitted to save space.

A couple of the existing data encodings (DECIMAL and BYTES) didn’t self-delimit their length. It’s desirable if we can extract the data for some of the columns without decoding them all, so we added variants of these two encodings that are length prefixed.

A constant concern of working in any system that persists data is how to read old data with new code. We made column families backward compatible by special casing a family that’s only ever had one column; it’s encoded exactly as it was before (with no column ID). This also happens to have the side benefit of being a nice space optimization.

All this and more is detailed in the Column Families RFC if you’re interested.

Using Column Families #

When a table is created, some simple heuristics are used to determine which columns get grouped together. You can see these assignments in the output of the SHOW CREATE TABLE command.

CockroachDB can’t know the query patterns of a table when it’s created, but the way a table is queried has a big impact on the optimal column family mapping. So, we decided to allow a user to manually tune these assignments when necessary. A small extension (FAMILY) was added to our SQL dialect to allow for user tuning of the assignments. The various tradeoffs are detailed in our column families documentation.

Building a SQL database after the rise of NoSQL means that CockroachDB gets to pick the best parts of both. In this case, we were able to use column families, an optimization commonly found in NoSQL databases, to speed up our SQL implementation. The resulting performance improvement moves us one step closer to our 1.0 release.