tag:blog.danhhz.com,2014:/feedDan Harrison2023-10-13T11:26:31-07:00Dan Harrisonhttp://blog.danhhz.comSvbtle.comtag:blog.danhhz.com,2014:Post/compile-times-and-code-graphs2023-10-13T11:26:31-07:002023-10-13T11:26:31-07:00Compile Times and Code Graphs<p><em>Cross-posted on the <a href="https://materialize.com/blog/engineering/compile-times-and-code-graphs/">Materialize Blog</a>.</em></p>
<p>At <a href="https://materialize.com/">Materialize</a>, Rust compile times are a frequent complaint. On one hand, I’m forever anchored by the Scala compile times from my days at Foursquare; a clean build without cache hits took over an hour. On the other, Go at Cockroach Labs was great. Rust is in between, but much closer to Go than to Scala.</p>
<p>So far, I’ve mostly insulated myself from this here by carving out an isolated corner where unit tests catch almost all the bugs and so iteration is fast. But recently, I’ve been pitching in on some cross-cutting projects, felt the pain that everyone else is feeling, and so was motived to improve them a bit. Here’s how I did it.</p>
<p>First, a note that there are lots of other ways to improve compile times<sup id="fnref1"><a href="#fn1">1</a></sup>, but today we’re going to talk about dependency graphs in code.</p>
<p>In general, the following will be talking about the smallest compilation unit that <em>doesn’t</em> allow cyclic dependencies. In Rust, <em>modules</em> do but <em>crates</em> don’t and indeed today we’re talking about crates. For simplicity, I’ll just use “crate” below, but go ahead and mentally substitute whatever the equivalent is in your language of choice.</p>
<h1 id="ideal-code-dependency-structure_1">Ideal Code Dependency Structure <a class="head_anchor" href="#ideal-code-dependency-structure_1">#</a>
</h1>
<p>This is going to sound obvious when written up, but bear with me.</p>
<p>Large software projects that involve lots of business logic will typically be broken up internally into crates (or crate equivalent). Day-to-day work will then involve typing up and iterating on some change until a good structure is worked out, the bugs are fixed, new tests are passing, old tests are passing, etc. In practice, the majority of these iterations of the edit-compile-run loop will only touch one crate (or a few). For this to be fast, you want as few crates as possible to depend on the one you’re changing, and for the dependents that do exist to be as small as possible.</p>
<p>Secondarily, when you pull in new code to your branch, or switch branches, you want your crate’s dependencies to be as small as possible. However, note that a dependency that doesn’t change often isn’t as bad because your compiler will get cache hits for it.</p>
<p>At some point, you’ll be happy with your change and will move on to integration testing, which requires compiling all binaries that transitively depend on it. This means you want your crate to only be in the binaries where it “belongs” (it’s surprisingly easy to end up with “incidental” dependencies if it’s not something you’re looking out for).</p>
<p>The logical conclusion of the above is a shape where a small number of infrequently changing foundational crates are at the “bottom” of the graph, then a lot of fanning out to business logic crates, which fan in to some number of binaries (production binaries, test binaries, etc) at the “top” of the graph. This shape also is particularly friendly for hermetic build systems (a la <a href="https://bazel.build/">bazel</a>, <a href="https://buck2.build/">buck2</a>, <a href="https://www.pantsbuild.org/">pants</a>) that can reuse compilation artifacts generated by machines (e.g. CI).</p>
<p><a href="https://svbtleusercontent.com/rUCssSBVfrjrQubXpRPECT0xspap.jpeg"><img src="https://svbtleusercontent.com/rUCssSBVfrjrQubXpRPECT0xspap_small.jpeg" alt="IMG_5021.JPEG"></a></p>
<h1 id="a-pattern-emerges_1">A Pattern Emerges <a class="head_anchor" href="#a-pattern-emerges_1">#</a>
</h1>
<p>The above image describes an ideal, but what does that look like concretely? Both Foursquare and Materialize have ended up with a similar manifestation.</p>
<p>For each unit of business logic <code class="prettyprint">foo</code>, separate crates for:</p>
<ul>
<li>
<em>Types</em>: for Plain Old Data, <a href="https://protobuf.dev/">protobuf</a>, traits that users of <code class="prettyprint">foo</code> implement, etc.</li>
<li>
<em>Interface</em>: for the public API without an implementation. 4sq called this <code class="prettyprint">FooService</code>, mz calls it <code class="prettyprint">foo-client</code>.</li>
<li>
<em>Implementation</em>: for the implementation of the public API. 4sq called this <code class="prettyprint">FooConcrete</code>, mz calls it <code class="prettyprint">foo</code>.</li>
<li>Note that not every <code class="prettyprint">foo</code> will have all three of these, and some will be more complicated, but I’ve found these three to be a reasonable default.</li>
</ul>
<p>Foursquare leaned heavily into microservices and, as a result, broke things up into lots of fine-grained business logic units. The cost of manually maintaining the transitive interface/implementation graph for each of these microservice binaries was high enough that they eventually ended up writing bespoke tooling to do it. It all felt a little silly, but the compile time benefits were absolutely worth it.</p>
<p>On the other end of the spectrum, as <a href="https://materialize.com/blog/next-generation/">Arjun and Frank</a> as well as <a href="https://materialize.com/blog/materialize-architecture/">Brennan</a> have described, materialize has three high-level architectural concepts: <em>adaptor</em> (control plane), <em>storage</em> (data in and out), and <em>compute</em> (efficient incremental computation, the heart of mz). There are additionally a small handful of internal utilities, one of which you’ll see below (stash).</p>
<h1 id="case-study-materialize-storage_1">Case Study: Materialize Storage <a class="head_anchor" href="#case-study-materialize-storage_1">#</a>
</h1>
<p>I recently started doing a bit of work within the implementation of our “storage” layer and found myself surprised with some of the crates that got invalidated while I was iterating. This resulted in a PR <a href="https://github.com/MaterializeInc/materialize/pull/21554">to tease out some <code class="prettyprint">*-types</code> crates that had previously been in the <code class="prettyprint">*-client</code> ones</a>.</p>
<p>Interestingly, the times for building binaries (necessary to run integration tests) while iterating was essentially unchanged: 1m40s -> 1m39s. This is likely because our link times are high and tend to dominate. However, the time it took to check that I had no compile errors was cut in half: 45s -> 23s. This is largely because the heavyweight <code class="prettyprint">mz-sql</code> and <code class="prettyprint">mz-transform</code> no longer get invalidated (i.e notice that they dissappear from the graph below).</p>
<p>Deps above <code class="prettyprint">mz-storage-client</code> (before)<sup id="fnref2"><a href="#fn2">2</a></sup></p>
<p><a href="https://svbtleusercontent.com/oYbJ9kx1XEKTqnUBc3nSrn0xspap.png"><img src="https://svbtleusercontent.com/oYbJ9kx1XEKTqnUBc3nSrn0xspap_small.png" alt="storage-before.png"></a></p>
<p>Deps above <code class="prettyprint">mz-storage-client</code> (after)</p>
<p><a href="https://svbtleusercontent.com/hxbWQ33Hc6xCBUTpa5oi9x0xspap.png"><img src="https://svbtleusercontent.com/hxbWQ33Hc6xCBUTpa5oi9x0xspap_small.png" alt="storage-after.png"></a></p>
<h1 id="case-study-materialize-stash_1">Case Study: Materialize Stash <a class="head_anchor" href="#case-study-materialize-stash_1">#</a>
</h1>
<p>Shortly after, a co-worker mentioned in a weekly team sync that he was spending quite a bit of his time compiling while iterating on our internal <em>stash</em> utility. This was particularly interesting to me because each time he changed it, both of our <code class="prettyprint">environmentd</code> and <code class="prettyprint">clusterd</code> binaries would be invalidated and recompiled. But conceptually, the stash is only used by the former and it shouldn’t be in the dependency graph of the latter at all. The fix turned out (yet again) to be <a href="https://github.com/MaterializeInc/materialize/pull/22240">a new <code class="prettyprint">-types</code> crate</a>.</p>
<p>This result was more dramatic. The full-binary integration test iteration time went from 2m12s to 53s.</p>
<p>Deps above <code class="prettyprint">mz-stash</code> (before)</p>
<p><a href="https://svbtleusercontent.com/eeHvh7wvSfiXzGaWVqSv8v0xspap.png"><img src="https://svbtleusercontent.com/eeHvh7wvSfiXzGaWVqSv8v0xspap_small.png" alt="stash-before.png"></a></p>
<p>Deps above <code class="prettyprint">mz-stash</code> (after)</p>
<p><a href="https://svbtleusercontent.com/2Zq84uT9iwpuj9mnBCVW5G0xspap.png"><img src="https://svbtleusercontent.com/2Zq84uT9iwpuj9mnBCVW5G0xspap_small.png" alt="stash-after.png"></a></p>
<h1 id="difficulties_1">Difficulties <a class="head_anchor" href="#difficulties_1">#</a>
</h1>
<p>As always, things in software are never black and white, nor are they easy. Here is a non-exhaustive list of a few things I’ve seen come up when working on code dependencies:</p>
<ul>
<li>Dependency spaghetti! Foursquare started as a single compilation unit and everything depended on everything else. We had to gradually tease it apart over the course of years. Materialize has the dual benefits of starting with a CTO that understands the importance of internal dependency hygiene (ty Nikhil! <3) as well as a recent rework from local, single-binary deployment to cloud-only (abstraction boundaries are still in good shape from this).</li>
<li>This sort of work often forces bits of code to be public when they’d rather not be public. The stash example above had a number of these tradeoffs involved. Just this morning I investigated another possible separation where the balance went the other way and I aborted.</li>
<li>Regressions. It’s easy to accidentally re-introduce a dependency that you’ve taken care to remove, even when you’re looking out for it. It’s even easier when co-workers are not yet sold on the benefits. I wrote a tool for Rust called <a href="https://crates.io/crates/cargo-deplint">cargo-deplint</a> that we run in CI to prevent backsliding.</li>
</ul>
<div class="footnotes">
<hr>
<ol>
<li id="fn1">
<p>For example, one of my co-workers has been using Rust’s excellent introspection tools on our codebase and had some results that point at monomorphization. This work is still ongoing. <a href="#fnref1">↩</a></p>
</li>
<li id="fn2">
<p>Generated with <a href="https://crates.io/crates/cargo-depgraph">cargo-depgraph</a> <a href="#fnref2">↩</a></p>
</li>
</ol>
</div>
tag:blog.danhhz.com,2014:Post/freebase-meets-materialize-3-first-impressions2021-05-14T09:20:21-07:002021-05-14T09:20:21-07:00Freebase Meets Materialize 3 - First Impressions<p>Previous posts talked about what I’m hoping to do and some background on the Freebase data. Today, we (finally) take Materialize out for a spin.</p>
<ul>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Part 1: Introduction</a></li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-2-the-data">Part 2: The Data</a></li>
<li>Part 3: First Impressions (you’re here)</li>
</ul>
<p>First, a quick note: one of my motivations for doing this is to get a feel for Materialize as a user, so I’m going to take my developer hat off and put my user hat on. I’ve only been here a couple weeks and the first things I’ve been working on have to do more with internals than UX, so I’m hoping this will mostly work.</p>
<p>Spoilers from the future: it turns out to have worked pretty well! The following is all my real, unabridged first interactions with Materialize’s docs and Materialize itself. I end up finding some papercuts as well as some great touch-points where we could have helped conceptually with someone transitioning from traditional databases to streaming. This was exactly the sort of feedback I was hoping to gather.</p>
<h1 id="installation_1">Installation <a class="head_anchor" href="#installation_1">#</a>
</h1>
<p>There is a <a href="https://materialize.com/docs/cloud/">cloud version of Materialize</a> (currently in private beta), but I prefer to do my development locally, so I downloaded it. Following the <a href="https://materialize.com/docs/install/#homebrew">Install</a> instructions for Homebrew (which is my preference for this sort of thing):</p>
<pre><code class="prettyprint lang-text">$ brew install MaterializeInc/materialize/materialized
[...]
</code></pre>
<p>Hmm, it compiles instead of using a brew bottle? It’s probably because I’m on arm64. I’m going to assume that we (Materialize) have a bottle for x86 but not for arm64 yet. It also feels like brew could have done better here. I would have been okay with installing an x86 binary and running it with Rosetta 2, so I wish it would have asked me if I wanted that or to install a bunch of compile-time dependencies and do a slow compile.</p>
<p>After finishing the compile and installation, brew tells me I can start it with <code class="prettyprint">materialized --threads=1</code>.</p>
<pre><code class="prettyprint lang-text">$ materialized --threads=1
error: Found argument '--threads' which wasn't expected, or isn't valid in this context
USAGE:
materialized [OPTION]...
For more information try --help
</code></pre>
<p>The install page has a pointer to <a href="https://materialize.com/docs/get-started">Get Started</a> which informs me that the updated name for this flag is <code class="prettyprint">materialized -w 1</code>. So one issue here is the brew instructions are out of date. It also seems weird to have renamed this flag without supporting the old one. We’re pre-1.0, so I don’t think we need to commit to perfect backward compatibility, but at initial glance, this was a simple rename from <code class="prettyprint">--threads</code> to <code class="prettyprint">--workers</code>/<code class="prettyprint">-w</code>. It’s pretty easy to alias the old name to the new one (and hide it from docs).</p>
<p>Next up on the Get Started page is to connect to my running Materialize instance:</p>
<pre><code class="prettyprint lang-text">$ psql -U materialize -h localhost -p 6875 materialize
bash: psql: command not found
</code></pre>
<p>I happen to know from working at Cockroach Labs that “psql” comes from PostgreSQL. The Get Started page also notes this in a “Prerequisites” section that I only notice now. This is incredibly nit-picky on my part, but having Prerequisites here feels heavyweight. It makes a ton of sense for other pages in the docs, where it might list having Materialize itself set up as a prerequisite (CockroachDB used to have things like “set up a multi-node cluster” in Prerequisites for some of the docs examples), but for this page I would have had “Make sure you have psql installed” as a step instead. To each their own.</p>
<pre><code class="prettyprint lang-text">$ brew install postgresql
[...]
$ psql -U materialize -h localhost -p 6875 materialize
psql (13.2, server 9.5.0)
Type "help" for help.
materialize=>
</code></pre>
<p>Now we’re in business.</p>
<h1 id="loading-a-file_1">Loading a File <a class="head_anchor" href="#loading-a-file_1">#</a>
</h1>
<p>Just a bit further down the Get Started page, there’s an example of creating a file <code class="prettyprint">SOURCE</code>, which is exactly what I’m looking for. My file isn’t changing, so I’m not going to tail it (yet). I copy the example, remove the tail bit and swap the regex for the <a href="https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L16-L42">monstrosity in my over-engineered tuple partitioning program</a> (I knew this would come in handy). A quick glance at the <a href="https://materialize.com/docs/sql/create-source/text-file/">CREATE SOURCE</a> page for local files shows that Materialize supports `COMPRESSION GZIP, so we’re ready to go!</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE SOURCE film
FROM FILE 'film.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE
</code></pre>
<pre><code class="prettyprint lang-text">materialize=> SHOW COLUMNS FROM film;
name | nullable | type
------------------+----------+--------
blank | t | text
column15 | t | text
comment | t | text
mz_line_no | f | bigint
obj_data_lit | t | text
obj_data_type | t | text
obj_data_type_fb | t | text
obj_data_type_w3 | t | text
obj_lang_lit | t | text
obj_lang_type | t | text
obj_node | t | text
obj_str_lit | t | text
obj_uri | t | text
obj_uri_fb | t | text
obj_uri_w3 | t | text
prd_uri | t | text
prd_uri_fb | t | text
prd_uri_w3 | t | text
sub_node | t | text
sub_uri | t | text
sub_uri_fb | t | text
sub_uri_w3 | t | text
(22 rows)
</code></pre>
<p>Sweet! Let’s make sure the regex is working correctly. (Also what’s column15?)</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT * FROM film LIMIT 10;
ERROR: Unable to automatically determine a timestamp for your query; this can happen if your query depends on non-materialized sources.
For more details, see https://materialize.com/s/non-materialized-error
</code></pre>
<p>After reading the link, I get why this is the case, but it’s kind of a bummer. I just want to look at enough of the data to verify that the regex is working. The Get Started page is making materialized views and selecting from them, I’ll do that instead:</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film LIMIT 10;
CREATE VIEW
materialize=> SELECT * FROM foo;
comment | sub_uri_fb | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
(0 rows)
</code></pre>
<p>Huh, okay let’s try putting the <code class="prettyprint">LIMIT 10</code> on the <code class="prettyprint">SELECT</code> instead of the materialized view definition.</p>
<pre><code class="prettyprint lang-text">materialize=> DROP VIEW foo;
DROP VIEW
materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
comment | sub_uri_fb | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+--------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
| ns/m.01y67v | | | | key/en | | | | | | | | | | | | | | kbs | | 26591
| ns/m.0h34n | | | | key/en | | | | | | | | | | | | | | gimli | | 120215
| ns/m.0fns_b | | | | key/en | | | | | | | | | | | | | | nezu | | 99581
| ns/m.0xcy | | | | key/en | | | | | | | | | | | | | | atlantis | | 184125
| ns/m.03k9l5 | | | | key/en | | | | | | | | | | | | | | figwit | | 39070
| ns/m.04zl2r | | | | key/en | | | | | | | | | | | | | | ovidie | | 49112
| ns/m.06r6zc | | | | key/en | | | | | | | | | | | | | | kyknet | | 56779
| ns/m.0btr9d | | | | key/en | | | | | | | | | | | | | | libaas | | 81760
| ns/m.0kprc8 | | | | key/en | | | | | | | | | | | | | | pg_usa | | 141054
| ns/m.09rypdl | | | | key/en | | | | | | | | | | | | | | frode | | 67198
(10 rows)
</code></pre>
<p>Cool! It’s either processed that file really fast or is doing something lazily.</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT COUNT(*) FROM foo;
count
---------
9346717
(1 row)
</code></pre>
<pre><code class="prettyprint lang-text">$ ls -alh film.nt.gz
-rw-r--r-- 1 dan staff 384M Apr 24 12:10 film.nt.gz
</code></pre>
<p>Yeah the count seems low. Nine million tuples aren’t gonna take up 384MB compressed.</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT COUNT(*) FROM foo;
count
----------
19143660
(1 row)
</code></pre>
<p>It’s bigger now! This confirms that the select is returning while the file is still loading. That makes sense given the streaming focus. It’s nice to get results immediately, but my initial impression is that it’s odd for a non-tail file source, which I expect to be loaded atomically. I’m going to chalk this up to my background in OLTP databases and my ongoing adjustment to this new streaming world. (Note from future self: this will become a recurring theme.)</p>
<h1 id="is-my-file-loaded_1">Is my File Loaded? <a class="head_anchor" href="#is-my-file-loaded_1">#</a>
</h1>
<p>I want to know when my file is finished loading, so I poke around the docs sidebar and <a href="https://materialize.com/docs/ops/monitoring/">Monitoring</a> seems promising. It talks about prometheus inside docker, but I haven’t polluted this pristine new work computer with docker yet. A bunch of our testing infra uses it, so I will eventually, but let’s see what other options we have (there was mention of SQL at the top of the page).</p>
<p><em>Side Note: I happened to notice the <a href="https://materialize.com/docs/ops/monitoring/#on-macos-with-materialize-running-outside-of-docker">On macOS, with Materialize running outside of Docker</a> section, which is amazing. I know from experience that if I’d gone the docker route, this would have saved me a lot of time.</em></p>
<p><a href="https://materialize.com/docs/ops/monitoring/#system-catalog-sql-interface">System catalog SQL interface</a> sounds promising! There are links to <a href="https://materialize.com/docs/sql/system-catalog/">SQL documentation</a> and “walkthrough of useful <a href="https://materialize.com/docs/ops/diagnosing-using-sql/">diagnostic queries</a>”. I open them both.</p>
<p>Looking at the system catalog SQL docs, I see <a href="https://materialize.com/docs/sql/system-catalog/#mz_sources">mz_sources</a>, which doesn’t look like it will include loading progress, but I want to run it anyway.</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT * FROM mz_sources;
id | oid | schema_id | name | volatility
-------+-------+-----------+-------------------------------------+------------
u1 | 20234 | 3 | film | unknown
s3022 | 20153 | 1 | mz_peek_active | volatile
s3026 | 20157 | 1 | mz_source_info | volatile
s3024 | 20155 | 1 | mz_peek_durations | volatile
...
(18 rows)
</code></pre>
<p>Cool! I assume <code class="prettyprint">u</code> is user and <code class="prettyprint">s</code> is system.</p>
<p>The second page of useful diagnostic queries has a section titled <a href="https://materialize.com/docs/ops/diagnosing-using-sql/#are-my-sources-loading-data-in-a-reasonable-fashion">Are my sources loading data in a reasonable fashion?</a> Exactly what I’m here for, bravo! <3</p>
<p>Oh the answer is to run <code class="prettyprint">SELECT count(*)</code>. I literally LOL’d.</p>
<p>It also mentions <code class="prettyprint">mz_materialization_frontiers</code>:</p>
<pre><code class="prettyprint lang-text">materialize=> select * from mz_materialization_frontiers;
global_id | time
-----------+---------------
s3001 | 1619797365000
s3003 | 1619797365000
s3005 | 1619797365000
...
(38 rows)
</code></pre>
<p>This is clearly the same id space as my <code class="prettyprint">mz_sources</code> query above but none of them match up. Oh right! It’s probably materialized views (which I can select from), not sources (which I can’t). Before I bother to reopen that first page, I bet <code class="prettyprint">mz_views</code> is a thing.</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT * FROM mz_views;
id | oid | schema_id | name | volatility
-------+-------+-----------+-----------------------------------+------------
u4 | 20237 | 3 | foo | unknown
s5022 | 20229 | 2 | pg_proc | unknown
s5024 | 20231 | 2 | pg_enum | unknown
s5021 | 20228 | 2 | pg_type | volatile
...
(26 rows)
</code></pre>
<p>Yup but nope. Still none of them match up.</p>
<p>At this point, I’m going give up for now and decide that <code class="prettyprint">wc -l</code> and waiting for that number in <code class="prettyprint">SELECT count(*)</code> is how I’d do it. I don’t see how this would work for more complex materialized views because I wouldn’t have a good way to reason about how many rows would be in them once they finished loading. I guess I could keep re-running the <code class="prettyprint">SELECT count(*)</code> until it stops changing? Dunno, maybe this is all just me still adjusting to streaming paradigms.</p>
<p>Let’s drop this test view. It showed what it needed to show.</p>
<pre><code class="prettyprint lang-text">materialize=> DROP VIEW foo;
DROP VIEW
</code></pre>
<h1 id="debug-endpoint_1">Debug Endpoint <a class="head_anchor" href="#debug-endpoint_1">#</a>
</h1>
<p>So, what’s my first API endpoint going to be? Honestly, at this point, I just want to explore the data. Let’s start with a page that, given an id, shows the name of the thing, all the triples where it is the subject, and linkifies everything. That will let me easily poke around.</p>
<p>I’ve got a <code class="prettyprint">film</code> source, but it’ll be useful to have links to stuff in <code class="prettyprint">common.nt.gz</code> work as well, so let’s make a second source.</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE SOURCE common
FROM FILE 'common.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE
</code></pre>
<p>And union them together.</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE VIEW freebase AS SELECT * FROM common UNION ALL SELECT * FROM film;
CREATE VIEW
</code></pre>
<p>I plan to join everything to its user-facing name, so let’s make a view for that to make it easier later.</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE VIEW id_names AS SELECT sub_uri_fb AS id, obj_lang_lit AS name, obj_lang_type AS lang FROM freebase WHERE prd_uri_fb = 'ns/type.object.name';
CREATE VIEW
</code></pre>
<p>I’m the type of person that likes to see things work as I go and we’re about at that point, so I inspect <code class="prettyprint">id_names</code> using the (soon to be very common) throwaway materialized view plus select trick.</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM id_names;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
id | name | lang
--------------+------+------
ns/m.0dq6p | VHS | en
ns/m.05r9bx | Ed | en
ns/m.09nrrz | 15 | en
ns/m.0gxwk_ | RJ | en
ns/m.0kvd3l | Da | en
ns/m.03d019n | G | en
ns/m.03hj3r6 | K | en
ns/m.03w7wq_ | A | en
ns/m.03w7wtc | D | en
ns/m.03w854r | S | en
(10 rows)
materialize=> DROP VIEW foo;
DROP VIEW
</code></pre>
<p>That’s working nicely. Next up is a view that uses <code class="prettyprint">id_names</code> to name everything in each tuple.</p>
<p>This is where I note that 5 years of building a SQL database doesn’t make you a SQL expert. In fact, I’m very much still a SQL novice: probably 95%+ of the SQL I’ve written in my life is in CockroachDB unit tests and most of it is for stuff like <code class="prettyprint">BACKUP</code>, <code class="prettyprint">RESTORE</code>, <code class="prettyprint">CHANGEFEED</code>, and <code class="prettyprint">IMPORT</code>. I write down the first thing that comes to mind, which is probably a terrible way to do this:</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE VIEW named_tuples AS
SELECT
f.sub_uri_fb AS sub_id,
sub_n.name AS sub_name_en,
f.prd_uri_fb AS prd_id,
prd_n.name AS prd_name_en,
f.obj_uri_fb AS obj_id,
obj_n.name AS obj_name_en
FROM
(
SELECT
*
FROM
freebase
WHERE
sub_uri_fb IS NOT NULL
AND prd_uri_fb IS NOT NULL
AND obj_uri_fb IS NOT NULL
)
AS f
JOIN (SELECT * FROM id_names WHERE lang = 'en')
AS sub_n ON f.sub_uri_fb = sub_n.id
JOIN (SELECT * FROM id_names WHERE lang = 'en')
AS prd_n ON f.prd_uri_fb = prd_n.id
JOIN (SELECT * FROM id_names WHERE lang = 'en')
AS obj_n ON f.obj_uri_fb = obj_n.id;
CREATE VIEW
</code></pre>
<p><em>Side note: Thank deity (and <a href="https://twitter.com/mjibson">Matt Jibson</a>) for <a href="https://sqlfum.pt/">https://sqlfum.pt/</a>.</em></p>
<p>You know what’s coming next.</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM named_tuples;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
sub_id | sub_name_en | prd_id | prd_name_en | obj_id | obj_name_en
---------------+-----------------------------------+--------------------------------------+-------------+----------------+------------------------------
ns/m.01119bmt | 2009 QDoc | ns/film.film_festival_event.festival | Festival | ns/m.0lm919d | QDoc
ns/m.0r9mpb7 | 2010 KidFilm Festival | ns/film.film_festival_event.festival | Festival | ns/m.011dxlp0 | KidFilm Festival
ns/m.0zb4wdg | 2013 Palić Film Festival | ns/film.film_festival_event.festival | Festival | ns/m.02wxclb | Palić Film Festival
ns/m.0111b2xs | 2012 Fête de l'Animation | ns/film.film_festival_event.festival | Festival | ns/g.12214qrxp | Fête de l'Animation
ns/m.0111b2xs | 2012 Fête de l'Animation | ns/film.film_festival_event.festival | Festival | ns/g.12214qrxp | Fête de l'Animation
ns/m.0111b2xs | 2012 Fête de l'Animation | ns/film.film_festival_event.festival | Festival | ns/g.12214qrxp | Fête de l'Animation
ns/m.010h2sg3 | 2011 11mm Filmfestival Berlin | ns/film.film_festival_event.festival | Festival | ns/m.0bdxcb4 | 11mm Filmfestival Berlin
ns/m.0rh99r7 | 2003 Panorama of European Cinema | ns/film.film_festival_event.festival | Festival | ns/m.0107tj0d | Panorama of European Cinema
ns/m.01069sst | 2013 Neum Animated Film Festival | ns/film.film_festival_event.festival | Festival | ns/m.01069v41 | Neum Animated Film Festival
ns/m.010h61_1 | 2012 Portland Maine Film Festival | ns/film.film_festival_event.festival | Festival | ns/m.0100zwb5 | Portland Maine Film Festival
(10 rows)
materialize=> DROP VIEW foo;
DROP VIEW
</code></pre>
<p>Beautiful.</p>
<p>Here, I’d like to take a brief pause to mention that I had been vaguely planning to, at some point, redo all this on top of file sources <em>with</em> the <code class="prettyprint">TAIL</code> option to show off Materialize’s incremental computation. But it turns out I don’t need to, because it’s doing it for me. I think this is just another data point in favor of “I’m still wrapping my head around streaming paradigms”.</p>
<h1 id="a-materialized-view-of-my-very-own_1">A Materialized View of My Very Own <a class="head_anchor" href="#a-materialized-view-of-my-very-own_1">#</a>
</h1>
<p>On to our final view, this one materialized because, as described in <a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Part 1</a>, it’s what our API server will be a thin wrapper around. I went back and forth on how to structure it. The natural SQL way would be rows like:</p>
<pre><code class="prettyprint lang-text">id, name, prd_id1, prd_name_en1, obj_id1, obj_name_en1
id, name, prd_id2, prd_name_en2, obj_id2, obj_name_en2
</code></pre>
<p>This would repeat “id” and “name” for each tuple, which is wasteful and unsatisfying. Given that the API endpoint is going to return json anyway, why explode it just to unexplode it later? After some mulling, I ended up with a key-value structure of id -> jsonb endpoint response.</p>
<p>Without further ado… (I’m so so sorry.)</p>
<pre><code class="prettyprint lang-text">materialize=> CREATE MATERIALIZED VIEW api_debug AS
SELECT
sub_n.id AS sub_id,
jsonb_build_object(
'sub_id',
sub_n.id,
'sub_name_en',
sub_n.name_en,
'sub_tuples',
jsonb_build_array(sub_t.named_tuples)
) AS json
FROM
(
SELECT
sub_id AS id, sub_name_en AS name_en
FROM
named_tuples
)
AS sub_n
JOIN LATERAL (
SELECT
sub_id AS id,
jsonb_agg(named_tuple) AS named_tuples
FROM
(
SELECT
sub_id,
jsonb_build_object(
'sub_id',
sub_id,
'sub_name_en',
sub_name_en,
'prd_id',
prd_id,
'prd_name_en',
prd_name_en,
'obj_id',
obj_id,
'obj_name_en',
obj_name_en
)
AS named_tuple
FROM
named_tuples
)
GROUP BY
sub_id
)
AS sub_t ON sub_n.id = sub_t.id;
CREATE VIEW
materialize=> SELECT * FROM api_debug LIMIT 10;
sub_id | json
---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ns/m.01119bmt | {"sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc","sub_tuples":[[{"obj_id":"ns/m.0lm919d","obj_name_en":"QDoc","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc"}]]}
ns/m.0r9mpb7 | {"sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival","sub_tuples":[[{"obj_id":"ns/m.011dxlp0","obj_name_en":"KidFilm Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival"}]]}
ns/m.0zb4wdg | {"sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival","sub_tuples":[[{"obj_id":"ns/m.02wxclb","obj_name_en":"Palić Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival"}]]}
ns/m.010h2sg3 | {"sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin","sub_tuples":[[{"obj_id":"ns/m.0bdxcb4","obj_name_en":"11mm Filmfestival Berlin","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin"}]]}
ns/m.0rh99r7 | {"sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema","sub_tuples":[[{"obj_id":"ns/m.0107tj0d","obj_name_en":"Panorama of European Cinema","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema"}]]}
ns/m.01069sst | {"sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival","sub_tuples":[[{"obj_id":"ns/m.01069v41","obj_name_en":"Neum Animated Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival"}]]}
ns/m.010h61_1 | {"sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival","sub_tuples":[[{"obj_id":"ns/m.0100zwb5","obj_name_en":"Portland Maine Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival"}]]}
ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
(10 rows)
</code></pre>
<p>Alright, I’m really changing my tune on this whole “show stuff right away” bit. Getting this SQL working took a number of tries. If I’d been using anything but Materialize to prototype this, I would have had to wait quite a while to get the results of each attempt (or manually make an even smaller subset of the freebase data). With Materialize, creating the view, selecting from it, dropping it, and trying again were all basically instantaneous.</p>
<p><em>Brainstorming: Thinking about how I’d use this in production, I wish there was some way for my select to block until all the sources were “caught up enough” so I don’t serve incomplete results. For a non-tailed file, what this means is straightforward: when the whole file is loaded. On the other hand, what it means for tailed files, kafka, etc is less clear. Maybe when it’s caught up to within some time delta of where the source is at? I can see why we haven’t solved this yet, there are some meaty product and UX questions here.</em></p>
<p>I’m going to be using this view for simple id lookups, so I want to make sure there’s an index on <code class="prettyprint">id</code>. I didn’t see a place in the <code class="prettyprint">CREATE MATERIALIZED VIEW</code> grammar to specify an index, but the docs page mentions an <a href="https://materialize.com/docs/sql/create-materialized-view/#indexes">index is automatically made</a> for me.</p>
<pre><code class="prettyprint lang-text">materialize=> SHOW INDEXES FROM api_debug;
on_name | key_name | seq_in_index | column_name | expression | nullable
-----------+-----------------------+--------------+-------------+------------+----------
api_debug | api_debug_primary_idx | 1 | sub_id | | t
api_debug | api_debug_primary_idx | 2 | json | | t
(2 rows)
</code></pre>
<p>Yep, that should work. Let’s look for Erica Albright from part 2.</p>
<pre><code class="prettyprint lang-text">materialize=> SELECT * FROM api_debug WHERE sub_id = 'ns/m.09r8m05';
sub_id | api_debug
--------+--------------------
(0 rows)
</code></pre>
<p>Bummer, but that makes sense. It’s still loading things. Let’s wait for Materialize to catch up and try again.</p>
<p>Before that happens…</p>
<h1 id="oom_1">OOM <a class="head_anchor" href="#oom_1">#</a>
</h1>
<pre><code class="prettyprint lang-text">Killed: 9
</code></pre>
<p>After a brief investigation (aka asking in slack), it looks like this means Materialize is running out of memory (and swap?). I restarted it while watching Activity Monitor and confirmed. At some point, while browsing the docs today, I saw something about debugging and optimizing memory usage, but if possible I’d like to come back to that later.</p>
<p>Let’s see if something dumb and easy works to unblock us. I restart Materialize, quickly drop the view and recreate it with <code class="prettyprint">LIMIT 10</code> added to the <code class="prettyprint">(SELECT ... FROM named_tuples) AS sub_n</code> above.</p>
<p>Sorta? It’s using swap but not crashing anymore. Now to hook it up to a webpage.</p>
<h1 id="success_1">Success! <a class="head_anchor" href="#success_1">#</a>
</h1>
<p>I’m out of practice here, so would like for something minimal to serve HTTP requests and query Materialize for the data. With a bit of looking, I found <a href="https://github.com/http-rs/tide">tide</a>, which seems to be associated with the official rust folks in some way. After a bit of iteration, I managed to get something working end-to-end!</p>
<p><a href="https://svbtleusercontent.com/3QnvvGQs1j6PDKaWG3YPnE0xspap.png"><img src="https://svbtleusercontent.com/3QnvvGQs1j6PDKaWG3YPnE0xspap_small.png" alt="Screen Shot 2021-04-30 at 3.27.06 PM.png"></a></p>
<p>I’m running out of time for the day, so I’ll have to go into more detail in a later post. The code needs a bit of cleanup before I push it anywhere, so that will have to wait, too.</p>
<p>Funny enough, while iterating on the web frontend, I noticed that the set of 10 things being selected by my <code class="prettyprint">LIMIT 10</code> hack is changing over time, which means I have to keep finding another id to test with. Makes sense once I think about it.</p>
<h1 id="next-up-debugging-memory-usage_1">Next Up: Debugging Memory Usage <a class="head_anchor" href="#next-up-debugging-memory-usage_1">#</a>
</h1>
<p>I’d like to polish up the web frontend and get it deployed somewhere, but it’s pretty clear that my next task is to figure out how to write these views so they don’t knock over the materialized process. I’d only planned through here when I started all this, but now I know what Part 4 is going to be.</p>
<p>Encouragingly, it feels like the basic idea is workable. When our Chief Scientist <a href="https://github.com/frankmcsherry/">Frank</a> read part 1, he pointed me at <a href="https://github.com/comnik/declarative-dataflow">Declarative Dataflow</a>, which efficiently processes queries over <code class="prettyprint">(subject, predicate, object)</code> 3-tuples and is built on top of the same <a href="https://github.com/TimelyDataflow/differential-dataflow">Differential Dataflow</a> incremental computation framework that powers Materialize. So there’s no reason we shouldn’t be able to do it, too.</p>
<p>There were a few bumps along the way that we (Materialize) can polish up pretty easily, and I’ll file issues for those. I think there’s also a larger takeaway here around helping users that are new to streaming wrap their heads around its unfamiliar paradigms. These sorts of discoveries are exactly why dogfooding is so important and why I wanted to do it right when I started and had fresh eyes.</p>
<p>Stay tuned!</p>
<ul>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Part 1: Introduction</a></li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-2-the-data">Part 2: The Data</a></li>
<li>Part 3: First Impressions (you’re here)</li>
</ul>
tag:blog.danhhz.com,2014:Post/freebase-meets-materialize-2-the-data2021-04-27T08:43:02-07:002021-04-27T08:43:02-07:00Freebase Meets Materialize 2: The Data<p><a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Last post</a>, I introduced the idea of using <a href="https://materialize.com/">Materialize</a> to implement fast reads of highly normalized <a href="https://en.wikipedia.org/wiki/Freebase_(database)">Freebase</a> data for an API endpoint. Today, we start by downloading the data and doing a bit of preprocessing.</p>
<ul>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Part 1: Introduction</a></li>
<li>Part 2: The Data (you’re here)</li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-3-first-impressions">Part 3: First Impressions</a></li>
</ul>
<h1 id="19-billion-triples_1">1.9 Billion Triples <a class="head_anchor" href="#19-billion-triples_1">#</a>
</h1>
<p>The final public copy of the Freebase data can be downloaded at <a href="https://developers.google.com/freebase/">https://developers.google.com/freebase/</a>. It’s a 22 GB gzip (250 GB uncompressed) of <a href="https://www.w3.org/TR/rdf-testcases/#ntriples">N-Triples</a>, which is a text-based data format with a spec and everything. Each line is a <code class="prettyprint"><subject, predicate, object></code> triple and according to this page, there are 1.9 billion of them.</p>
<p>In the interest of fast iteration, I’d like to start with something that comfortably fits in memory. Before we can trim down the data, we have to look at how it’s structured.</p>
<h1 id="structure-of-freebase-data_1">Structure of Freebase Data <a class="head_anchor" href="#structure-of-freebase-data_1">#</a>
</h1>
<p>This is all better explained by the since-removed [API documentation] (thank you Internet Archive), but I’ll go over a bit of it.</p>
<p>Freebase data is a structured representation of things and relationships between those things. In this case, “things” includes concrete stuff like people, films, and music but also more nebulous concepts like love (which can be the topic of a book) plus really anything with a page in Wikipedia. The things are called <em>topics</em> and each has a stable unique identifier. Most of these IDs look like <code class="prettyprint">http://rdf.freebase.com/ns/m.09r8m05</code> (what Freebase calls a <em>MID</em>). The interesting part of this is the last bit (<code class="prettyprint">m.09r8m05</code>) which is a base-32 encoded integer. The <code class="prettyprint">m</code> can also be a <code class="prettyprint">g</code> for reasons. Some things in Freebase use a more human readable ID that looks like <code class="prettyprint">http://rdf.freebase.com/ns/film.film</code>. (I think the human readable ones also have a corresponding MID, but it’s been a while and I’m not sure.)</p>
<p>Each line in the data represents a 3-tuple of subject, predicate, and object. I personally understand this best with some examples (IDs shortened for clarity):</p>
<pre><code class="prettyprint lang-text"><.../m.09r8m05> <.../type.object.name> "Erica Albright"@en
</code></pre>
<p>Here <code class="prettyprint">m.09r8m05</code> is the ID of the character Erica Albright in the 2010 film The Social Network. This tuple expresses the name of the character in English. This <code class="prettyprint">m.09r8m05</code> topic also has a <em>type</em> (in fact it has multiple), which tells you what sort of thing it is:</p>
<pre><code class="prettyprint lang-text"><.../m.09r8m05> <.../type.object.type> <.../common.topic>
<.../m.09r8m05> <.../type.object.type> <.../film.film_character>
</code></pre>
<p>Here <code class="prettyprint">film.film_character</code> is mostly self-explanatory and <code class="prettyprint">common.topic</code> is the most general topic type. It roughly corresponds to <em>anything</em> that could have an entry in Wikipedia (ignoring Wikipedia’s notability requirement).</p>
<p>When something (like <code class="prettyprint">m.09r8m05</code>) has a type (like <code class="prettyprint">film.film_character</code>) it receives the ability to have the sorts of relationships granted by the type. Said another way, getting typed as <code class="prettyprint">film.film_character</code> opens up some new predicates for use with <code class="prettyprint">m.09r8m05</code>. The interesting thing about the predicates is that they also get IDs and information about them is also stored in Freebase, meaning that the schema of the data is stored in the data itself.</p>
<p>I looked for <code class="prettyprint">film.film_character</code> but grep-ing through 250GB takes… a while, so here’s <code class="prettyprint">film.film.directed_by</code>:</p>
<pre><code class="prettyprint lang-text"><.../film.film.directed_by> <.../type.property.reverse_property> <.../film.director.film>
<.../film.film.directed_by> <.../type.property.unique> "false"
<.../film.film.directed_by> <..../type.object.type> <.../type.property>
<.../film.film.directed_by> <.../type.property.expected_type> <.../film.director>
<.../ns/film.film.directed_by> <.../type.object.name> "Directed by"@en
<.../ns/film.film.directed_by> <.../type.property.schema> <.../film.film>
</code></pre>
<p>A <code class="prettyprint">type.object.type</code> of <code class="prettyprint">type.property</code> means that <code class="prettyprint">film.film.directed_by</code> can be used as a predicate. An <code class="prettyprint">expected_type</code> of <code class="prettyprint">film.director</code> means that the object of triples with that predicate will be of <code class="prettyprint">type.object.type</code> -> <code class="prettyprint">film.director</code>. And then <code class="prettyprint">type.property.schema</code> is what means the subject of that triple will be a <code class="prettyprint">film.film</code> (I think).</p>
<p>Also note here <code class="prettyprint">type.property.unique</code> -> <code class="prettyprint">false</code>, meaning a film can have multiple directors. This is an instance of what I was talking about in the last post where the constraints (foreign keys/checks/etc) are also part of the data.</p>
<p>The remaining triple here is <code class="prettyprint">type.property.reverse_property</code>, which establishes a relationship between “film F was directed by director D” and “director D directed film F”. At initial glance this seems to me to be completely redundant information, but who knows.</p>
<p>It’s clear from just grep-ing around this data that my intuition is correct and 250GB is too much to play around with, so it’s time to cut it down.</p>
<h1 id="fewer-than-19-billion-triples_1">Fewer than 1.9 Billion Triples <a class="head_anchor" href="#fewer-than-19-billion-triples_1">#</a>
</h1>
<p>In something like <code class="prettyprint">film.film</code>, the first “film” is something Freebase calls a <em>domain</em>, which is a grouping of related types. (In addition to things like “film” and “people”, there is also a “user” domain in Freebase which let anybody explore making their own schemas. There’s some real gems in there, but I’ll leave that for you to explore.)</p>
<p>I decided to roughly group everything by domain. So, for example, all the film schemas and data will be in their own N-Triples file. That feels separable enough that I could do some iterative prototyping. The immediate hiccup is that a topic can have multiple types (<code class="prettyprint">person.person</code> and <code class="prettyprint">film.director</code>). This wouldn’t otherwise be an issue except I’m certainly going to want to render the name, which is a <code class="prettyprint">common.topic.alias</code>. Having the names of everything in the file for the “common” domain isn’t really going to cut a lot of data.</p>
<p>It may be true that “David Fincher” is a <code class="prettyprint">person.person</code>, but it’s probably more interesting that he’s a <code class="prettyprint">film.director</code>, so I think it’d be okay for all of the “David Fincher” data to end up with the film data. Luckily there’s a property called <code class="prettyprint">kg.object_profile.prominent_type</code> that’s exactly what I want. This is an attempt to assign one most notable type to a topic. It’s not present for all topics, but it’s good enough.</p>
<p>This means I can use prominent_type to create an ID -> domain map that will be used to route each triple to its domain file. Sadly the data in the dump isn’t ordered such that I can make this map on the fly without buffering, so I do a preprocessing step of grep-ing every <code class="prettyprint">kg.object_profile.prominent_type</code> into one file.</p>
<pre><code class="prettyprint lang-text">$ time zfgrep "kg.object_profile.prominent_type" freebase-rdf-latest.gz | gzip -c > ids.nt.gz
real 33m36.427s
user 33m4.282s
sys 0m18.399s
</code></pre>
<p>Good thing we only have to run that once.</p>
<p>Then I write an <a href="https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/partition_triples.rs">over-engineered and under-documented rust program</a> to partition the full data into one-per-domain files. It’s over-engineered because it sounded fun and that sort of thing is exactly what Skunkworks Fridays are for. I also suspect things like an <a href="https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L16-L42">N-Triples parser</a> (<a href="https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L325-L331">tested</a> against the <a href="https://github.com/danhhz/scribbles/blob/master/cmd/src/w3_golden.nt">spec’s golden file</a>!) will be “useful” “later” for “something”. (This is not foreshadowing, I’m genuinely just here madly hand-waving over everything.)</p>
<p>To further cut down data, I filter-out stuff like non-English names. I also filter out the fun stuff in the user domain for now. :(</p>
<pre><code class="prettyprint lang-text">$ time cargo run -p cmd --release -- ids.nt.gz freebase-rdf-latest.gz ./freebase/
...
real 114m1.647s
user 113m13.981s
sys 0m30.703s
</code></pre>
<p>Glad I only have to run that once, too.</p>
<p>Here’s the resulting Totally Reasonable ™ file sizes:</p>
<pre><code class="prettyprint lang-text">-rw-r--r-- 1 dan staff 2.7G Apr 24 12:10 music.nt.gz
-rw-r--r-- 1 dan staff 804M Apr 24 12:10 book.nt.gz
-rw-r--r-- 1 dan staff 384M Apr 24 12:10 film.nt.gz
-rw-r--r-- 1 dan staff 373M Apr 24 12:10 tv.nt.gz
-rw-r--r-- 1 dan staff 363M Apr 24 12:10 location.nt.gz
-rw-r--r-- 1 dan staff 275M Apr 24 12:10 business.nt.gz
-rw-r--r-- 1 dan staff 233M Apr 24 12:10 people.nt.gz
-rw-r--r-- 1 dan staff 112M Apr 24 12:10 biology.nt.gz
-rw-r--r-- 1 dan staff 66M Apr 24 12:10 education.nt.gz
-rw-r--r-- 1 dan staff 64M Apr 24 12:10 government.nt.gz
...
</code></pre>
<h1 id="yawns-pointedly_1">Yawns Pointedly <a class="head_anchor" href="#yawns-pointedly_1">#</a>
</h1>
<p>I know, I know. Next week, I (finally) fire up Materialize and do something with it.</p>
<ul>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-1-introduction">Part 1: Introduction</a></li>
<li>Part 2: The Data (you’re here)</li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-3-first-impressions">Part 3: First Impressions</a></li>
</ul>
tag:blog.danhhz.com,2014:Post/freebase-meets-materialize-1-introduction2021-04-23T08:38:08-07:002021-04-23T08:38:08-07:00Freebase Meets Materialize 1: Introduction<p>I recently started working at <a href="https://materialize.com/">Materialize</a>. Friday here is called “Skunkworks Friday” and is reserved for personal/professional development, moonshot projects, and other things that don’t get priority as part of the normal product+engineering planning cycle. I’ve decided to use my first few to prototype using Materialize as a generalized replacement for some hand-rolled infrastructure microservices that we had at a previous company.</p>
<ul>
<li>Part 1: Introduction (you’re here)</li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-2-the-data">Part 2: The Data</a></li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-3-first-impressions">Part 3: First Impressions</a></li>
</ul>
<h1 id="background_1">Background <a class="head_anchor" href="#background_1">#</a>
</h1>
<p>For several years, I worked at Foursquare, back when they were mostly a consumer tech company. I was on the monetization team, but most people worked on the user facing app and website. Foursquare, like most apps at the time, kept data in a database but encapsulated this in a REST API. This API is what the mobile apps and the website talked to.</p>
<p>As was (and is) best practice, the data in the system of record database was <em>normalized</em>. Each user, venue, checkin, tip, etc was its own record with a <em>unique identifier</em>. A checkin (user U is at place P at time T) would then refer to the associated venue and user by embedding their unique identifiers. Normalization is powerful because it means that updates (e.g. changing the name of a venue) only need to happen in one place: the canonical record.</p>
<p>The flip side to normalization is that most uses of data require joining the records together. The API to retrieve information about a checkin would bring in the user and venue records so that the app could render their names. Some API endpoints (like information about a single checkin) were simple enough that these joins could be done on the fly and still be fast enough that the app felt responsive to the user. Others required joining so much together that if we’d done it when the endpoint was called, it would take too long and the app would have felt unresponsive. This might be something like the API to get information about a venue, which pulled in tips about the venue, the users that wrote those tips, your friend graph relationship to the users that wrote those tips, and so on.</p>
<p>The opposite of normalization is <em>denormalization</em>. For example, though we didn’t do this, we could have embed a user and venue name in each checkin record next to the respective unique identifiers. Denormalized data is fast on read because there are fewer joins, but loses the update-in-one-place property of a fully normalized database. It also takes more space because data is stored more than once (this is usually a lesser transgression). Taken to the extreme, one could imagine many fully denormalized copies - one tailored for each API endpoint.</p>
<p>Performant application development often involves careful thinking about where your data will fall along this normalization/denormalization spectrum. When you’re lucky, there’s some obvious point that’s both fast enough and straightforward to keep updated. When you’re not… there’s some fairly unsatisfying options.</p>
<p>One option is to keep data normalized and then also keep denormalized versions of it in the same database. Then, when the normalized data changes, all denormalized copies of that data are updated in the same transaction. This pushes the burden for keeping them in agreement onto the application developer. It can work, but becomes increasingly burdensome and bug-prone as the app grows in complexity. (However, note that this is exactly what a database index is! <u>Indexes are denormalizations that the database maintains for you.</u> Most databases are limited in the shapes of the indexes that they can automatically keep updated, but as we’ll see below, some databases *cough* Materialize *cough* support much more generality in their “indexes”.)</p>
<p>Another option is to use software to maintain the denormalized copies, which is what Foursquare did. They had an engineering team, as part of infrastructure, that wrote bespoke microservices to follow changes as they happened in the database, and update whatever denormalizations are affected with the minimal necessary work. At the time, we slurped the database logs directly, though nowadays this would likely be done as part of a change-data-capture based streaming architecture.</p>
<p>These microservices worked well, but required a team with full-time staffing to maintain them. This involved performance work and bug fixes, but also a ton of work to spin up a new one when required for a feature launch. Inevitably, the denormalizations were all <em>just</em> different enough that they couldn’t be nicely generalized and each required a good bit of custom code.</p>
<p>Some databases have the concept of a <code class="prettyprint">VIEW</code>, which can be thought of as one of these denormalizations written in SQL. An API endpoint could use one of these, but a bare <code class="prettyprint">VIEW</code> executes its logic when queried, which doesn’t save any time. On the other hand, a <code class="prettyprint">MATERIALIZED VIEW</code> fully computes the denormalization and is fast to query. This is exactly what we want!</p>
<p>Unfortunately, implementations of <code class="prettyprint">MATERIALIZED VIEW</code> in existing databases are almost never recomputed incrementally as the underlying data changes. Periodically and/or at the user’s request, the system runs a big batch computation of the entire view and saves it, using it in future queries. Even if this recompilation is run continually in a loop, it introduces latency between when the normalized data changes and when the denormalized data catches up. This repeated “full refresh” recomputation is also resource intensive. As the amount of data increases, latency and CPU utilization go up. There are a few databases that can incrementally compute a <code class="prettyprint">MATERIALIZED VIEW</code> but only for a <a href="https://docs.oracle.com/database/121/DWHSG/basicmv.htm#GUID-505C24CF-5D56-4820-88AA-2221410950E7">fairly restrictive set of special cases</a>.</p>
<h1 id="so-why-are-you-telling-me-this_1">So Why are you Telling me This? <a class="head_anchor" href="#so-why-are-you-telling-me-this_1">#</a>
</h1>
<p>Enter <a href="https://materialize.com/">Materialize</a>, which maintains <code class="prettyprint">SQL</code> <code class="prettyprint">MATERIALIZED VIEW</code>s incrementally, <a href="https://github.com/TimelyDataflow/differential-dataflow/blob/v0.12.0/differentialdataflow.pdf">doing as little work as possible</a> in response to each change in the underlying data. It is also much more expressive in the kinds of SQL queries it can incrementally maintain, including many-way joins and <a href="https://materialize.com/robust-reductions-in-materialize/">complex aggregations</a>. This is pretty obviously useful for things like analytics dashboards, but when I first heard about the <a href="https://github.com/TimelyDataflow/timely-dataflow">timely</a> and <a href="https://github.com/TimelyDataflow/differential-dataflow">differential dataflow</a> projects that power Materialize, my immediate thought was Foursquare’s denormalization microservices.</p>
<p>As I mentioned, I’ve decided to use my first few Skunkworks Fridays to prototype using Materialize as a replacement for what Foursquare was doing by hand. The basic idea, as hinted above, is that the data of record will be stored fully normalized, but in Materialize I’ll have a <code class="prettyprint">MATERIALIZED VIEW</code> corresponding to each API endpoint of a consumer-facing app. A nice side-benefit is that this will give me experience using the product I’m now developing and the opportunity to see it as a user.</p>
<h1 id="freebase_1">Freebase <a class="head_anchor" href="#freebase_1">#</a>
</h1>
<p>A long time ago (pre-Foursquare), I heard about the <a href="https://en.wikipedia.org/wiki/Freebase_(database)">Freebase</a> project. Freebase was a sort of “structured data” Wikipedia for storing facts. For example: the height of the Eiffel Tower, actor A played role R in movie M, and the hierarchy of administrative regions in the United States. These facts are stored as <code class="prettyprint"><subject, predicate, object></code> triples (more on this in the next post). The company behind Freebase was called “Metaweb” because the structure of this data was also expressed as these triples. In some sense, it’s the “ultimate normalization” of data, in which the schema and constraints (foreign keys/checks/etc) aren’t stored as part of database table structure, but as part of the data itself. (Notice that the <code class="prettyprint">MATERIALIZED VIEW</code> per endpoint is then a parallel “ultimate denormalization” of data. Why do anything halfway amirite?)</p>
<p>Freebase was acquired by Google and the database has been internalized (RIP), but Google still hosts a copy of the last publicly available <a href="https://developers.google.com/freebase/">freebase dataset</a>. So, my plan is to play with the idea of building an application on top of triples (seeded with the Freebase data) and using Materialize to maintain the denormalizations needed to keep it performant.</p>
<h1 id="whats-next_1">What’s Next <a class="head_anchor" href="#whats-next_1">#</a>
</h1>
<p>Well that’s what I’m planning to do and a bit of my motivations. In the next post, I’ll download a dump of the Freebase data and extract a smaller, more manageable chunk to work with. In post 3, I’ll fire up Materialize and use it to render something useful. See you then!</p>
<ul>
<li>Part 1: Introduction (you’re here)</li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-2-the-data">Part 2: The Data</a></li>
<li><a href="http://blog.danhhz.com/freebase-meets-materialize-3-first-impressions">Part 3: First Impressions</a></li>
</ul>
<p><em>Thanks to Arjun</em></p>
tag:blog.danhhz.com,2014:Post/simplenote-934cb2a5e50b2019-01-18T15:33:52-08:002019-01-18T15:33:52-08:00Simplenote<blockquote>
<p>“What note-taking app do you use? Do you like it? I currently use Evernote and kind of hate it.” — co-worker on our internal slack</p>
<p>“Twitter productivity gurus and tinkerers. What is the best cross-platform, light-weight note-taking app these days?” — <a href="https://twitter.com/noah_weiss/status/704335744714842112">Noah Weiss</a></p>
</blockquote>
<p><a href="https://svbtleusercontent.com/HvH2cumCKPUnihj2vVGJK0xspap.png"><img src="https://svbtleusercontent.com/HvH2cumCKPUnihj2vVGJK0xspap_small.png" alt="Simplenote"></a></p>
<p>Allow me to introduce you to my favorite piece of software, Simplenote. As the name suggests, it’s for notes and it’s intentionally simple. The notes are plain text, shareable, and they sync instantly and seamlessly. This means that I never need to think about where I typed a note, it’s available and editable on my computer, my phone, my partner’s phone, the web. There are a very small number of features built on top of this, but they’re carefully chosen. Omitted features, like inline images, may sound limiting but there’s a reason people seem to always be looking for an Evernote replacement.</p>
<p>I’ve used Simplenote for nearly 10 years now and along the way it became the place where I put everything. Everything I type, besides code and emails/texts/chats, is typed into Simplenote. Sometimes the emails get drafted in Simplenote. In fact, I’m writing this post in Simplenote. I have thousands of notes, so I can’t list them all, but here’s a sample: every todo list I have, a grocery list shared with my partner, poetry I’ve liked, quotes, recipes, vacation planning, things I pack on every trip, a log of books I’ve read, books I want to read, frequent flyer account numbers, restaurants I want to go to, potential dog names, my sizes in various clothing brands, thoughts I want to remember but didn’t know what to do with, snippets of code, dinner party menus. The shared grocery list alone is worth it.</p>
<p>Simplenote is my goto example of a great user experience. I have a theory that the largest part of successful UX is that every UI element does exactly what you expect (no surprises!) and when you want to do something, your first guess is correct. Simplenote nails this and the simplicity is a big part of how. They’ve selected exactly the minimum feature set that people really need, which means those features can all be exposed in the most obvious way and yet it doesn’t end up cluttered.</p>
<p>The features, basically: tags, search, full revision history, read-only publishing to the web, sharing with other Simplenote accounts, markdown rendering, focus mode, and that’s about it. Just check out Noah’s list of requirements in the tweet I linked above, it matches almost perfectly.</p>
<p>My life is organized entirely in Simplenote. This has been the case since at least 2009, when I was working at Google and they gave out the first Android phones as holiday gifts. Like a good employee, I tried to switch, but at the time there was no native Android app and I had to switch back to iOS. Early Android had a lot of rough edges, but the one thing I couldn’t adapt to was the lack of Simplenote.</p>
<p>When syncing only works most of the time, it never completely fades from your headspace, but when it works every time, it stops being something that you worry about. Simplenote’s syncing works every time. In ~10 years, it’s glitched once and lost one set of edits to a single note. I remember this so clearly because for the rest of that time, I’ve never had to think about where I’ve typed something. It’s always in Simplenote.</p>
<p>My enthusiasm for Simplenote may sound like hyperbole, but it’s not. Official apps are available for Mac, iOS, Android, and the web. <a href="https://simplenote.com/">Try it out</a>!</p>
tag:blog.danhhz.com,2014:Post/easy-bread-18662b29cf332018-10-24T14:59:21-07:002018-10-24T14:59:21-07:00Easy Bread<p><em>My take on making Jim Lahey’s No-Knead Bread even simpler</em></p>
<p><a href="https://svbtleusercontent.com/6s6XMDZ3xkVus8YWUxcJ730xspap.jpeg"><img src="https://svbtleusercontent.com/6s6XMDZ3xkVus8YWUxcJ730xspap_small.jpeg" alt="The final product" title="The final product"></a></p>
<p>I love making bread. It does, however, lead to more bread than I can eat, which leads to me giving away loaves of bread at every opportunity. Occasionally this leads to someone asking me how I make bread.</p>
<p>These days I follow the recipes in Ken Forkish’s book Flour Water Salt Yeast as closely as possible. It’s an excellent book and I’ve had much better results with his recipes than any other source I’ve tried.</p>
<p>For about year when I was getting started, I used <a href="https://cooking.nytimes.com/recipes/11376-no-knead-bread">Jim Lahey’s “No-Knead Bread” recipe</a>. It strikes the perfect balance of easy, beginner friendly, and tasty. There is a delightful tradeoff in bread between fast and easy; you can make good bread in 5 hours with a lot of work or you can make great bread in 18-24 hours with almost no work. Mr. Lahey’s recipe swings all the way toward easy, eliminating the kneading entirely. A friend of mine made a couple tweaks to simplify it and over time, I’ve made a couple tweaks of my own to make it more foolproof.</p>
<p>The only ingredients you need are all-purpose flour, water, salt, and instant yeast. The long rise means you only need a tiny bit of yeast, so one packet will make several loaves of bread. This recipe can also work with a sourdough starter instead of yeast, but it’s more complicated. I also find I end up with less consistent results. If you want sourdough, I recommend jumping right to the recipes in Flour Water Salt Yeast.</p>
<p>I use <a href="https://thewirecutter.com/reviews/best-kitchen-scale/">a kitchen scale</a> to measure ingredients because it gives much more consistent results. It’s also less cleanup! Just set your bowl on the scale and dump the ingredients into it, no measuring cup necessary. I really believe weighing ingredients is one of the easiest things you can do to improve your baking. If you’re not ready to commit to a scale, try the volumetric amounts in the No-Knead recipe I linked above, but your milage may vary, I’ve never done it.</p>
<p>You’ll also need some parchment and something to bake in. Ideally, it will be enclosed to keep the moisture in. I use two cast iron pans that are sold as a set and fit together. A dutch oven works well but most of them have a handle on top that can’t stand the temperatures involved, so make sure to remove it beforehand. If all you have is a baking sheet, try it out and let me know how it goes. Yeasted bread is incredibly forgiving.</p>
<p>The timeline here is: a bit of work, 12–18 hours of waiting, a bit of work, 2–3 hours of waiting (with the oven pre-heating at the end), and an hour or so bake. There are peak times during the two windows of waiting, but at first, it’s fine to just use whatever is most convenient for your schedule.</p>
<p>That’s it. Let’s make bread.</p>
<hr>
<ul>
<li>500g all-purpose flour</li>
<li>400g water</li>
<li>11g salt</li>
<li>1/8 teaspoon instant yeast (most scales, including mine, are not accurate enough to weigh such a small amount)</li>
</ul>
<p>Step 1: In the evening on the day before you want bread, mix together all the ingredients in a bowl that’s big enough for it to double or triple in size. Cover in plastic wrap and leave it somewhere that’s as close to 70F as possible.</p>
<p><a href="https://svbtleusercontent.com/rWH5tG2ptVfckB4Uu9uN1H0xspap.jpeg"><img src="https://svbtleusercontent.com/rWH5tG2ptVfckB4Uu9uN1H0xspap_small.jpeg" alt="After the initial mix"></a></p>
<p>Step 2: Anywhere between 12–18 hours later, it’ll look like below. This is called the bulk rise.</p>
<p><a href="https://svbtleusercontent.com/eLciuXFWr59WUQkJRQLRbq0xspap.jpeg"><img src="https://svbtleusercontent.com/eLciuXFWr59WUQkJRQLRbq0xspap_small.jpeg" alt="After the bulk rise"></a></p>
<p>Step 3: Next is shaping. Most bread recipes have you use as little flour as possible, but more flour makes it easier and the worst I’ve seen happen is a streak of flour in the middle of the finished bread. Which is fine. If anything it just makes it seem more homemade. A trick of my own: if you have a large enough cutting board to fit the bread, it makes cleanup much easier than a counter. Sprinkle a generous amount of flour and dump the dough onto it. Use a scraper or a spatula or anything flat to fold it into a (flat, loose) ball with the flour side out.</p>
<p><a href="https://svbtleusercontent.com/hTtjJj2gunComaaymNHz9u0xspap.gif"><img src="https://svbtleusercontent.com/hTtjJj2gunComaaymNHz9u0xspap_small.gif" alt="1Getting ready to knead"></a></p>
<p>I find that 10–15 (gentle!) kneads here make it easier to shape. This is pretty late in the process to be working it this much and will push out some bubbles. There’s a better way to accomplish the same thing (using the “folds” method described by Flour Water Salt Yeast), but the kneads here are a lot easier and you still end up with a good bread.</p>
<p><a href="https://svbtleusercontent.com/TA3J7MAcHf8fQExqy7kdH0xspap.gif"><img src="https://svbtleusercontent.com/TA3J7MAcHf8fQExqy7kdH0xspap_small.gif" alt="Not quite no-knead"></a></p>
<p>Before shaping, make sure the dough isn’t sticky on the outside. If it is, dust a good amount of flour on it (in the video below, I don’t have nearly enough flour and you can see my hands sticking). If your hands have dough on them, wash it off. Then cover them with as much flour as will stick to them. Dust a piece of parchment paper with some flour.</p>
<p>Pick the dough up with your fingertips pointed toward the middle of the bottom. Then create some tension by gently pushing the bottom up into the middle. Rotate and repeat a few times. Set the dough seam side down on the floured parchment. Dust the top with flour and put another piece of parchment on the top to keep it from drying out.</p>
<p><a href="https://svbtleusercontent.com/2y17qrGoCNsT8H92Q1XP8a0xspap.gif"><img src="https://svbtleusercontent.com/2y17qrGoCNsT8H92Q1XP8a0xspap_small.gif" alt="Shaping the dough"></a></p>
<p>Step 4: Let it sit for 2–3 hours before baking. This is called the final rise. It will take a while for the oven to preheat to 500F, so make sure to turn it on (with your baking container in it!) an hour or so before the end of the final rise.</p>
<p><a href="https://svbtleusercontent.com/38yReLE67Rxkh5v6VA1SRS0xspap.jpeg"><img src="https://svbtleusercontent.com/38yReLE67Rxkh5v6VA1SRS0xspap_small.jpeg" alt="After the final rise"></a></p>
<p>Step 5: When the final rise finishes and the oven is pre-heated, take the extremely hot container out and flip the dough into it so the seam side is up. The seam will allow gases to escape (good) and bake into a beautiful top. Put on your container’s top and put it in the oven. Lower the temp to 450F and let it bake for 30 minutes before uncovering. This uncovering is my favorite part! You get to see the final shape of your bread, which I find incredibly satisfying and a bit magical. Bake at least 15 minutes uncovered before starting to check for doneness.</p>
<p>It’s done when the outside is at least golden brown, but the darker it gets short of burning the more flavor it will have, so leave it in for as long as your nerve holds, up to 60 minutes total.</p>
<p>Once it’s ready, take it out and let it cool on a wire rack or leaned up against something, so the bottom can breathe. If you cut it before it has cooled 20–30 minutes, the rest of the loaf will collapse a bit. But warm bread is awesome, so maybe you don’t care.</p>
<p>Here’s my first bread, from ~1 year ago.</p>
<p><a href="https://svbtleusercontent.com/5mEcNUwYqqEgLbRjis7nvt0xspap.jpeg"><img src="https://svbtleusercontent.com/5mEcNUwYqqEgLbRjis7nvt0xspap_small.jpeg" alt="My first bread! ~1 year ago"></a></p>
<p>Enjoy and send me pictures!</p>
<p><em>Thanks to Kat and Arjun.</em></p>
tag:blog.danhhz.com,2014:Post/implementing-backup-5a75b77887442017-08-09T14:35:08-07:002017-08-09T14:35:08-07:00Implementing Backup<p><em>Originally published at <a href="http://www.cockroachlabs.com">www.cockroachlabs.com</a> on August 9, 2017.</em></p>
<p>Almost all widely used database systems include the ability to backup and restore a snapshot of their data. The replicated nature of CockroachDB’s distributed architecture means that the cluster survives the loss of disks or nodes, and yet many users still want to make regular backups. This led us to develop distributed backup and restore, the first feature available in our CockroachDB Enterprise offering.</p>
<p>When we <a href="https://www.cockroachlabs.com/blog/coming-soon-what-to-expect-in-cockroachdb-1-0/">set out</a> to work on this feature, the first thing we did was figure out why customers wanted it. The reasons we discovered included a general sense of security, “Oops I dropped a table”, finding a bug in new code only when it’s deployed, legally required data archiving, and the “extract” phase of an ETL pipeline. So as it turns out, even in a system that was built to never lose your data, backup is still a critical feature for many of our customers.</p>
<p>At the same time, we brainstormed whether CockroachDB’s unique architecture allowed any improvements to the status quo. In the end, we felt it was important that both backup and restore be consistent across nodes (just like our SQL), distributed (so it scales as your data scales), and incremental (to avoid wasting resources).</p>
<p>Additionally, we knew that backups need to keep only a single copy of each piece of data and should impact production traffic as little as possible. You can see the full list of goals and non-goals in the <a href="https://github.com/cockroachdb/cockroach/blob/v1.0/docs/RFCS/backup_restore.md#goals-and-non-goals">Backup & Restore RFC</a>.</p>
<p>In this post, we’ll focus on backup and how we made it work.</p>
<h1 id="step-0-why-we-reinvented-the-wheel_1">Step 0: Why We Reinvented the Wheel <a class="head_anchor" href="#step-0-why-we-reinvented-the-wheel_1">#</a>
</h1>
<p>One strategy for implementing backup is to take a snapshot of the database’s files, which is how a number of other systems work. CockroachDB uses RocksDB as its disk format and <a href="https://github.com/facebook/rocksdb/wiki/how-to-backup-rocksdb%3F">RocksDB already has a consistent backup feature</a>, which would let us do consistent backups without any particular filesystem support for snapshots of files. Unfortunately, because CockroachDB does such a good job of balancing and replicating your data evenly across all nodes, there’s not a good way to use RocksDB’s backup feature without saving multiple copies of every piece of data.</p>
<h1 id="step-1-make-it-consistent_1">Step 1: Make it Consistent <a class="head_anchor" href="#step-1-make-it-consistent_1">#</a>
</h1>
<p>Correctness is the foundation of everything we do here at Cockroach Labs. We believe that once you have correctness, then stability and performance will follow. With this in mind, when we began work on backup, we started with consistency.</p>
<p>Broadly speaking, CockroachDB is a SQL database built on top of a consistent, distributed key-value store. Each table is assigned a unique integer id, which is used in the mapping from table data to key-values. The table schema (which we call a <a href="https://github.com/cockroachdb/cockroach/blob/v1.0/pkg/sql/sqlbase/structured.proto#L355-L533">TableDescriptor</a>) is stored at key <code class="prettyprint">/DescriptorPrefix/<tableid></code>. Each row in the table is stored at key <code class="prettyprint">/<tableid>/<primarykey></code>. (This is a simplification; the real encoding is much more complicated and efficient than this. For full details see the <a href="https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/">Table Data blog post</a>).</p>
<p>I’m a big fan of pre-RFC exploratory prototypes, so the first version of backup used the existing <code class="prettyprint">Scan</code> primitive to fetch the table schema and to page through the table data (everything with a prefix of <code class="prettyprint">/<tableid></code>). This was easy, quick, and it worked!</p>
<p>It also meant the engineering work was now separable. The <a href="https://www.cockroachlabs.com/docs/stable/backup.html">SQL syntax for <code class="prettyprint">BACKUP</code></a>, the format of the backup files (described below), and <code class="prettyprint">RESTORE</code> could now be divvied up among the team members.</p>
<p>Unfortunately, the node sending all the <code class="prettyprint">Scan</code>s was also responsible for writing the entire backup to disk. This was sloooowwww (less than 1 MB/s), and it didn’t scale as the cluster scaled. We built a database to handle petabytes, but this could barely handle gigabytes.</p>
<p>With consistency in hand, the natural next step was to distribute the work.</p>
<h1 id="step-2-make-it-distributed_1">Step 2: Make it Distributed <a class="head_anchor" href="#step-2-make-it-distributed_1">#</a>
</h1>
<p>We decided early on that backups would output their files to the storage offered by cloud providers (Amazon, Google, Microsoft, private clouds, etc). So what we needed was a command that was like <code class="prettyprint">Scan</code>, except instead of returning the data, it would write it to cloud storage. And so we created <code class="prettyprint">Export</code>.</p>
<p><a href="https://github.com/cockroachdb/cockroach/blob/v1.0/pkg/roachpb/api.proto#L782-L804"><code class="prettyprint">Export</code> is a new transactionally-consistent command</a> that iterates over a range of data and writes it to cloud storage. Because we break up a large table and its secondary indexes into multiple pieces (called “ranges”), the request that is sent gets split up by the kv layer and sent to many nodes. The exported files use <a href="https://github.com/google/leveldb/blob/master/doc/table_format.md">LevelDB’s SSTable</a> as the format because it supports efficient seeking (in case we want to query the backup) and because it was already used elsewhere in CockroachDB.</p>
<p>Along with the exported data, a serialized <a href="https://github.com/cockroachdb/cockroach/blob/v1.0/pkg/ccl/sqlccl/backup.proto#L20-L58">backup descriptor</a> is written with metadata about the backup, a copy of the schema of each included SQL table, and the locations of the exported data files.</p>
<p>Once we had a backup system that could scale to clusters with many nodes and lots of data, we had to make it more efficient. It was particularly wasteful (both cpu and storage) to export the full contents of tables that change infrequently. What we wanted was a way to write only what had changed since the last backup.</p>
<h1 id="step-3-make-it-incremental_1">Step 3: Make it Incremental <a class="head_anchor" href="#step-3-make-it-incremental_1">#</a>
</h1>
<p>CockroachDB uses <a href="https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/">MVCC</a>. This means each of the keys I mentioned above actually has a timestamp suffix, something like <code class="prettyprint">/<tableid>/<primarykey>:<timestamp></code>. Mutations to a key don’t overwrite the current version, they write the same key with a higher timestamp. Then the old versions of each key are cleaned up after 25 hours.</p>
<p>To make an incremental version of our distributed backup, all we needed to do was leverage these MVCC versions. Each backup has an associated timestamp. An incremental backup simply saves any keys that have changed between its timestamp and the timestamp of the previous backup.<br>
We plumbed these time ranges to our new Export command and voilà! Incremental backup.</p>
<p>One small wrinkle: if a given key (say <code class="prettyprint">/<customers>/<4></code>) is deleted, then 25 hours later when the old MVCC versions are cleaned out of RocksDB, this deletion (called a tombstone) is also collected. This means incremental backup can’t tell the difference between a key that’s never existed and one that was deleted more than 25 hours ago. As a result, an incremental backup can only run if the most recent backup was fewer than 25 hours ago (though full backups can always be run). The 25 hour period is not right for every user, so it’s <a href="https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html">configurable using replication zones</a>.</p>
<h1 id="go-forth-and-backup_1">Go Forth and Backup <a class="head_anchor" href="#go-forth-and-backup_1">#</a>
</h1>
<p>Backup is run via a simple <a href="https://www.cockroachlabs.com/docs/stable/backup.html"><code class="prettyprint">BACKUP</code> SQL command</a>, and with our work to make it consistent first, then distributed and incremental, it turned out blazing fast. We’re getting about 30MB/s per node and there’s still lots of low-hanging performance fruit. It’s our first enterprise feature, so head on over to our <a href="https://www.cockroachlabs.com/pricing/">license page</a> to grab an evaluation license and try it out.</p>
<p>While CockroachDB was built to survive failures and prevent data loss, we want to make sure every team, regardless of size, has the ability to survive any type of disaster. Backup and restore were built for large clusters that absolutely need to minimize downtime, but for smaller clusters, a simpler tool will work just fine. For this, <a href="https://www.cockroachlabs.com/docs/stable/sql-dump.html">we’ve built <code class="prettyprint">cockroach dump</code></a>, which is available in CockroachDB Core.</p>
<h1 id="whats-next_1">What’s Next? <a class="head_anchor" href="#whats-next_1">#</a>
</h1>
<p>We have plans for a number of future projects to build on this foundation: <a href="https://github.com/cockroachdb/cockroach/pull/16838">Change Feeds</a> for point-in-time backup and restore, read-only SQL queries over backups, an admin ui page with progress and scheduling, pause/resume/cancel control of running backups, and more.</p>
<p>Plus, <code class="prettyprint">BACKUP</code> is worth far more with <code class="prettyprint">RESTORE</code> (which turned out to be much harder and more technically interesting) and there’s a lot more that didn’t fit in this blog post, so stay tuned.</p>
tag:blog.danhhz.com,2014:Post/implementing-column-families-in-cockroachdb-ea15b23b782e2016-09-28T15:44:38-07:002016-09-28T15:44:38-07:00Implementing Column Families in CockroachDB<p><em>Originally published at <a href="https://www.cockroachlabs.com/blog/sql-cockroachdb-column-families/">www.cockroachlabs.com</a> on September 29, 2016.</em></p>
<p>CockroachDB is a scalable SQL database built on top of a transactional key value store. We don’t (yet) expose the kv layer but it’s general purpose enough that we’ve used it to implement SQL without any special trickery.<br>
The particulars of how we represent data in a SQL table as well as the table metadata are internally called the “format version”. Our first format version was deliberately simple, causing some performance inefficiencies. We recently improved performance with a technique called column families, which pack multiple columns in one kv entry.</p>
<p>Once implemented, column families <a href="https://github.com/cockroachdb/cockroach/pull/7623">produced dramatic improvements in our benchmarks</a>. A table with more columns benefits more from this optimization, so we added a benchmark of INSERTs, UPDATEs, and DELETEs against a table with 20 INT columns and <a href="https://github.com/cockroachdb/cockroach/pull/7408">it ran 5 times faster</a>.</p>
<p>Press on, dear reader, and I’ll explain the details of how we did it and how they work.</p>
<h1 id="format-version-1-cockroachdb-before-column-fa_1">Format Version 1: CockroachDB Before Column Families <a class="head_anchor" href="#format-version-1-cockroachdb-before-column-fa_1">#</a>
</h1>
<p>CockroachDB requires every SQL table to have a primary index; one is generated if it was not provided by the user. Our first format version stored the table data as kv entries with keys prefixed by the columns in the primary index. The remaining columns were each encoded as the value in a kv entry. Additionally, a sentinel key with an empty value was always written and used to indicate the existence of a row. This resulted in N+1 entries for a table with N non-primary index columns. Secondary indexes work a little differently, but we don’t need them for today.</p>
<p>This all results in something like:</p>
<pre><code class="prettyprint lang-text">/<tableID>/<indexID>/<primaryKeyColumns...>/<columnID> -> <4 byte CRC><encoded value>
</code></pre>
<p>And more concretely:</p>
<pre><code class="prettyprint lang-text">CREATE TABLE users (id INT PRIMARY KEY, name STRING, email STRING);
INSERT INTO users (11, "Hal", "hal@cockroachlabs.com");
INSERT INTO users (13, "Orin", "orin@cockroachlabs.com");
/<tableid>/0/11/0 -> <empty>
/<tableid>/0/11/1 -> "Hal"
/<tableid>/0/11/2 -> "hal@cockroachlabs.com"
/<tableid>/0/13/0 -> <empty>
/<tableid>/0/13/1 -> "Orin"
/<tableid>/0/13/2 -> "orin@cockroachlabs.com"
</code></pre>
<p>Note that columns never use ID 0 because it’s reserved for use as the sentinel. This is all described in much more detail in the original <a href="https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/">SQL in CockroachDB: Mapping Table Data to Key-Value Storage</a> blog post. If you haven’t read it, I highly recommend you do.</p>
<h1 id="the-trouble-with-format-version-1_1">The Trouble with Format Version 1 <a class="head_anchor" href="#the-trouble-with-format-version-1_1">#</a>
</h1>
<p>Everything has to start somewhere, and while our first format version worked, it was a little inefficient. The encoded primary index data in the key was repetitive, and there is an <a href="https://en.wikipedia.org/wiki/Multiversion_concurrency_control">MVCC timestamp</a> and checksum for each entry, collectively wasting disk space and network bandwidth.</p>
<p>Perhaps worse was that there was per-key overhead at the transaction level. Every key written within a transaction has a “<a href="https://www.cockroachlabs.com/blog/how-cockroachdb-distributes-atomic-transactions/">write intent</a>” associated with it. These intents need to be resolved when the transaction is committed, taxing performance.</p>
<p>While our disk format avoids the key repetition with an incremental prefix encoding, the timestamp and the checksum still create ~12 bytes of overhead per key, not to mention the intents.</p>
<p>Since the problem was using one kv entry per column in the table, the natural solution was to group multiple columns into one value. Several NoSQL databases use a similar technique and call each group a “column family”.</p>
<p>When we set out to implement column families, the first wrinkle was deciding whether to support get and set on individual columns in a family or to load and store an entire family to change one column. The former would allow us to make every table’s primary data one key value entry. Unfortunately, it would also require the kv layer to understand the encoding that packs multiple columns in one value. If we later decided to change the encoding, it would be much more difficult to migrate if it were baked into the key value layer. Plus, the tidy separation they’ve enjoyed so far has been a big help to testability and moving quickly. We felt this wasn’t a worthwhile tradeoff.</p>
<p>As a result, we support multiple column families per table, so that setting a small field doesn’t necessitate roundtripping any large fields in the same table.</p>
<p><em>Side note: A common question we get is whether we support use of the key value layer directly. We don’t right now, but by using one entry instead of two, we’ve gotten much closer to eliminating the overhead of using the CockroachDB key value store via a two column key and value SQL table.</em></p>
<h1 id="how-do-column-families-in-cockroachdb-work_1">How Do Column Families in CockroachDB Work? <a class="head_anchor" href="#how-do-column-families-in-cockroachdb-work_1">#</a>
</h1>
<p>Before column families, the value of an encoded table column was structured as:</p>
<pre><code class="prettyprint lang-text"><crc><typetag><encodedvalue>
</code></pre>
<p>With column families, this is now:</p>
<pre><code class="prettyprint lang-text"><crc><columnid0><typetag0><encodedvalue0>...<columnidN><typetagN><encodedvalueN>
</code></pre>
<p>or for our example above</p>
<pre><code class="prettyprint lang-text">/<tableid>/0/11/0 -> <crc>/1/string/"Hal"/1/string/"hal@cockroachlabs.com"
/<tableid>/0/13/0 -> <crc>/1/string/"Orin"/1/string/"orin@cockroachlabs.com"
</code></pre>
<p>Notably, the column IDs in the keys have been replaced by family IDs. The first family ID is 0, doubling as the sentinel, and is always present. We use a variable length encoding for integers, including column IDs. This encoding is shorter for smaller numbers, so instead of storing the column ID directly, we store the difference to keep them smaller. NULLs are omitted to save space.</p>
<p>A couple of the existing data encodings (DECIMAL and BYTES) didn’t self-delimit their length. It’s desirable if we can extract the data for some of the columns without decoding them all, so we added variants of these two encodings that are length prefixed.</p>
<p>A constant concern of working in any system that persists data is how to read old data with new code. We made column families backward compatible by special casing a family that’s only ever had one column; it’s encoded exactly as it was before (with no column ID). This also happens to have the side benefit of being a nice space optimization.</p>
<p>All this and more is detailed in the <a href="https://github.com/cockroachdb/cockroach/blob/71a023f9212cf1285a4475ed85d5ab6eefa9bdf7/docs/RFCS/20151214_sql_column_families.md">Column Families RFC</a> if you’re interested.</p>
<h1 id="using-column-families_1">Using Column Families <a class="head_anchor" href="#using-column-families_1">#</a>
</h1>
<p>When a table is created, some <a href="https://github.com/cockroachdb/cockroach/blob/71a023f9212cf1285a4475ed85d5ab6eefa9bdf7/docs/RFCS/20151214_sql_column_families.md#heuristics-for-fitting-columns-into-families">simple heuristics</a> are used to determine which columns get grouped together. You can see these assignments in the output of the <code class="prettyprint">SHOW CREATE TABLE</code> command.</p>
<p>CockroachDB can’t know the query patterns of a table when it’s created, but the way a table is queried has a big impact on the optimal column family mapping. So, we decided to allow a user to manually tune these assignments when necessary. A small extension (<code class="prettyprint">FAMILY</code>) was added to our SQL dialect to allow for user tuning of the assignments. The various tradeoffs are detailed in our <a href="https://www.cockroachlabs.com/docs/stable/column-families.html">column families documentation</a>.</p>
<p>Building a SQL database after the rise of NoSQL means that CockroachDB gets to pick the best parts of both. In this case, we were able to use column families, an optimization commonly found in NoSQL databases, to speed up our SQL implementation. The resulting performance improvement moves us one step closer to our 1.0 release.</p>