Freebase Meets Materialize 3 - First Impressions

Previous posts talked about what I’m hoping to do and some background on the Freebase data. Today, we (finally) take Materialize out for a spin.

First, a quick note: one of my motivations for doing this is to get a feel for Materialize as a user, so I’m going to take my developer hat off and put my user hat on. I’ve only been here a couple weeks and the first things I’ve been working on have to do more with internals than UX, so I’m hoping this will mostly work.

Spoilers from the future: it turns out to have worked pretty well! The following is all my real, unabridged first interactions with Materialize’s docs and Materialize itself. I end up finding some papercuts as well as some great touch-points where we could have helped conceptually with someone transitioning from traditional databases to streaming. This was exactly the sort of feedback I was hoping to gather.

Installation #

There is a cloud version of Materialize (currently in private beta), but I prefer to do my development locally, so I downloaded it. Following the Install instructions for Homebrew (which is my preference for this sort of thing):

$ brew install MaterializeInc/materialize/materialized
[...]

Hmm, it compiles instead of using a brew bottle? It’s probably because I’m on arm64. I’m going to assume that we (Materialize) have a bottle for x86 but not for arm64 yet. It also feels like brew could have done better here. I would have been okay with installing an x86 binary and running it with Rosetta 2, so I wish it would have asked me if I wanted that or to install a bunch of compile-time dependencies and do a slow compile.

After finishing the compile and installation, brew tells me I can start it with materialized --threads=1.

$ materialized --threads=1
error: Found argument '--threads' which wasn't expected, or isn't valid in this context

USAGE:
    materialized [OPTION]...

For more information try --help

The install page has a pointer to [Get Started] which informs me that the updated name for this flag is materialized -w 1. So one issue here is the brew instructions are out of date. It also seems weird to have renamed this flag without supporting the old one. We’re pre-1.0, so I don’t think we need to commit to perfect backward compatibility, but at initial glance, this was a simple rename from --threads to --workers/-w. It’s pretty easy to alias the old name to the new one (and hide it from docs).

[get started]: https://materialize.com/docs/get-started

Next up on the Get Started page is to connect to my running Materialize instance:

$ psql -U materialize -h localhost -p 6875 materialize
bash: psql: command not found

I happen to know from working at Cockroach Labs that “psql” comes from PostgreSQL. The Get Started page also notes this in a “Prerequisites” section that I only notice now. This is incredibly nit-picky on my part, but having Prerequisites here feels heavyweight. It makes a ton of sense for other pages in the docs, where it might list having Materialize itself set up as a prerequisite (CockroachDB used to have things like “set up a multi-node cluster” in Prerequisites for some of the docs examples), but for this page I would have had “Make sure you have psql installed” as a step instead. To each their own.

$ brew install postgresql
[...]
$ psql -U materialize -h localhost -p 6875 materialize
psql (13.2, server 9.5.0)
Type "help" for help.

materialize=>

Now we’re in business.

Loading a File #

Just a bit further down the Get Started page, there’s an example of creating a file SOURCE, which is exactly what I’m looking for. My file isn’t changing, so I’m not going to tail it (yet). I copy the example, remove the tail bit and swap the regex for the [monstrosity in my over-engineered tuple partitioning program][regex monstrosity] (I knew this would come in handy). A quick glance at the [CREATE SOURCE] page for local files shows that Materialize supports `COMPRESSION GZIP, so we’re ready to go!

[regex monstrosity]: https://github.com/danhhz/scribbles/blob/e7b712f304ce59747e91127e2babbcd63e841e9c/cmd/src/ntriple.rs#L16-L42
[CREATE SOURCE]: https://materialize.com/docs/sql/create-source/text-file/

materialize=> CREATE SOURCE film
FROM FILE 'film.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE
materialize=> SHOW COLUMNS FROM film;
       name       | nullable |  type
------------------+----------+--------
 blank            | t        | text
 column15         | t        | text
 comment          | t        | text
 mz_line_no       | f        | bigint
 obj_data_lit     | t        | text
 obj_data_type    | t        | text
 obj_data_type_fb | t        | text
 obj_data_type_w3 | t        | text
 obj_lang_lit     | t        | text
 obj_lang_type    | t        | text
 obj_node         | t        | text
 obj_str_lit      | t        | text
 obj_uri          | t        | text
 obj_uri_fb       | t        | text
 obj_uri_w3       | t        | text
 prd_uri          | t        | text
 prd_uri_fb       | t        | text
 prd_uri_w3       | t        | text
 sub_node         | t        | text
 sub_uri          | t        | text
 sub_uri_fb       | t        | text
 sub_uri_w3       | t        | text
(22 rows)

Sweet! Let’s make sure the regex is working correctly. (Also what’s column15?)

materialize=> SELECT * FROM film LIMIT 10;
ERROR:  Unable to automatically determine a timestamp for your query; this can happen if your query depends on non-materialized sources.
For more details, see https://materialize.com/s/non-materialized-error

After reading the link, I get why this is the case, but it’s kind of a bummer. I just want to look at enough of the data to verify that the regex is working. The Get Started page is making materialized views and selecting from them, I’ll do that instead:

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film LIMIT 10;
CREATE VIEW
materialize=> SELECT * FROM foo;
 comment | sub_uri_fb | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
(0 rows)

Huh, okay let’s try putting the LIMIT 10 on the SELECT instead of the materialized view definition.

materialize=> DROP VIEW foo;
DROP VIEW
materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM film;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
 comment |  sub_uri_fb  | sub_uri_w3 | sub_uri | sub_node | prd_uri_fb | prd_uri_w3 | prd_uri | obj_uri_fb | obj_uri_w3 | obj_uri | obj_node | obj_lang_lit | obj_lang_type | column15 | obj_data_lit | obj_data_type_fb | obj_data_type_w3 | obj_data_type | obj_str_lit | blank | mz_line_no
---------+--------------+------------+---------+----------+------------+------------+---------+------------+------------+---------+----------+--------------+---------------+----------+--------------+------------------+------------------+---------------+-------------+-------+------------
         | ns/m.01y67v  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | kbs         |       |      26591
         | ns/m.0h34n   |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | gimli       |       |     120215
         | ns/m.0fns_b  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | nezu        |       |      99581
         | ns/m.0xcy    |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | atlantis    |       |     184125
         | ns/m.03k9l5  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | figwit      |       |      39070
         | ns/m.04zl2r  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | ovidie      |       |      49112
         | ns/m.06r6zc  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | kyknet      |       |      56779
         | ns/m.0btr9d  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | libaas      |       |      81760
         | ns/m.0kprc8  |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | pg_usa      |       |     141054
         | ns/m.09rypdl |            |         |          | key/en     |            |         |            |            |         |          |              |               |          |              |                  |                  |               | frode       |       |      67198
(10 rows)

Cool! It’s either processed that file really fast or is doing something lazily.

materialize=> SELECT COUNT(*) FROM foo;
  count
---------
 9346717
(1 row)
$ ls -alh film.nt.gz
-rw-r--r--  1 dan  staff   384M Apr 24 12:10 film.nt.gz

Yeah the count seems low. Nine million tuples aren’t gonna take up 384MB compressed.

materialize=> SELECT COUNT(*) FROM foo;
  count
----------
 19143660
(1 row)

It’s bigger now! This confirms that the select is returning while the file is still loading. That makes sense given the streaming focus. It’s nice to get results immediately, but my initial impression is that it’s odd for a non-tail file source, which I expect to be loaded atomically. I’m going to chalk this up to my background in OLTP databases and my ongoing adjustment to this new streaming world. (Note from future self: this will become a recurring theme.)

Is my File Loaded? #

I want to know when my file is finished loading, so I poke around the docs sidebar and [Monitoring] seems promising. It talks about prometheus inside docker, but I haven’t polluted this pristine new work computer with docker yet. A bunch of our testing infra uses it, so I will eventually, but let’s see what other options we have (there was mention of SQL at the top of the page).

[monitoring]: https://materialize.com/docs/ops/monitoring/

Side Note: I happened to notice the [On macOS, with Materialize running outside of Docker][macosdocker] section, which is amazing. I know from experience that if I’d gone the docker route, this would have saved me a lot of time.

[macosdocker]: https://materialize.com/docs/ops/monitoring/#on-macos-with-materialize-running-outside-of-docker

[System catalog SQL interface][mzcatalog mon] sounds promising! There are links to [SQL documentation][mzcatalog sql] and “walkthrough of useful [diagnostic queries]”. I open them both.

[mzcatalog mon]: https://materialize.com/docs/ops/monitoring/#system-catalog-sql-interface
[mzcatalog sql]: https://materialize.com/docs/sql/system-catalog/
[diagnostic queries]: https://materialize.com/docs/ops/diagnosing-using-sql/

Looking at the system catalog SQL docs, I see [mz_sources], which doesn’t look like it will include loading progress, but I want to run it anyway.

[mz_sources]: https://materialize.com/docs/sql/system-catalog/#mz_sources

materialize=> SELECT * FROM mz_sources;
  id   |  oid  | schema_id |                name                 | volatility
-------+-------+-----------+-------------------------------------+------------
 u1    | 20234 |         3 | film                                | unknown
 s3022 | 20153 |         1 | mz_peek_active                      | volatile
 s3026 | 20157 |         1 | mz_source_info                      | volatile
 s3024 | 20155 |         1 | mz_peek_durations                   | volatile
...
(18 rows)

Cool! I assume u is user and s is system.

The second page of useful diagnostic queries has a section titled [Are my sources loading data in a reasonable fashion?][loading] Exactly what I’m here for, bravo! <3

[loading]: https://materialize.com/docs/ops/diagnosing-using-sql/#are-my-sources-loading-data-in-a-reasonable-fashion

Oh the answer is to run SELECT count(*). I literally LOL’d.

It also mentions mz_materialization_frontiers:

materialize=> select * from mz_materialization_frontiers;
 global_id |     time
-----------+---------------
 s3001     | 1619797365000
 s3003     | 1619797365000
 s3005     | 1619797365000
...
(38 rows)

This is clearly the same id space as my mz_sources query above but none of them match up. Oh right! It’s probably materialized views (which I can select from), not sources (which I can’t). Before I bother to reopen that first page, I bet mz_views is a thing.

materialize=> SELECT * FROM mz_views;
  id   |  oid  | schema_id |               name                | volatility
-------+-------+-----------+-----------------------------------+------------
 u4    | 20237 |         3 | foo                               | unknown
 s5022 | 20229 |         2 | pg_proc                           | unknown
 s5024 | 20231 |         2 | pg_enum                           | unknown
 s5021 | 20228 |         2 | pg_type                           | volatile
...
(26 rows)

Yup but nope. Still none of them match up.

At this point, I’m going give up for now and decide that wc -l and waiting for that number in SELECT count(*) is how I’d do it. I don’t see how this would work for more complex materialized views because I wouldn’t have a good way to reason about how many rows would be in them once they finished loading. I guess I could keep re-running the SELECT count(*) until it stops changing? Dunno, maybe this is all just me still adjusting to streaming paradigms.

Let’s drop this test view. It showed what it needed to show.

materialize=> DROP VIEW foo;
DROP VIEW

Debug Endpoint #

So, what’s my first API endpoint going to be? Honestly, at this point, I just want to explore the data. Let’s start with a page that, given an id, shows the name of the thing, all the triples where it is the subject, and linkifies everything. That will let me easily poke around.

I’ve got a film source, but it’ll be useful to have links to stuff in common.nt.gz work as well, so let’s make a second source.

materialize=> CREATE SOURCE common
FROM FILE 'common.nt.gz' COMPRESSION GZIP
FORMAT REGEX '^[ \t]*(?:(?P<comment>#[ -~]*)|(?:<(?:http://rdf.freebase.com/(?P<sub_uri_fb>[ -~]+)|http://www.w3.org/(?P<sub_uri_w3>[ -~]+)|(?P<sub_uri>.+))>|_:(?P<sub_node>[A-Za-z][A-Za-z0-9]*))[ \t]+<(?:http://rdf.freebase.com/(?P<prd_uri_fb>[ -~]+)|http://www.w3.org/(?P<prd_uri_w3>[ -~]+)|(?P<prd_uri>.+))>[ \t]+(?:<(?:http://rdf.freebase.com/(?P<obj_uri_fb>[ -~]+)|http://www.w3.org/(?P<obj_uri_w3>[ -~]+)|(?P<obj_uri>.+))>|_:(?P<obj_node>[A-Za-z][A-Za-z0-9]*)|"(?P<obj_lang_lit>.*)"@(?P<obj_lang_type>[a-z]+(-[a-zA-Z0-9]+)*)|"(?P<obj_data_lit>.*)"\^\^<(?:http://rdf.freebase.com/(?P<obj_data_type_fb>[ -~]+)|http://www.w3.org/(?P<obj_data_type_w3>[ -~]+)|(?P<obj_data_type>.+))>|"(?P<obj_str_lit>.*)")[ \t]*\.[ \t]*|(?P<blank>))$';
CREATE SOURCE

And union them together.

materialize=> CREATE VIEW freebase AS SELECT * FROM common UNION ALL SELECT * FROM film;
CREATE VIEW

I plan to join everything to its user-facing name, so let’s make a view for that to make it easier later.

materialize=> CREATE VIEW id_names AS SELECT sub_uri_fb AS id, obj_lang_lit AS name, obj_lang_type AS lang FROM freebase WHERE prd_uri_fb = 'ns/type.object.name';
CREATE VIEW

I’m the type of person that likes to see things work as I go and we’re about at that point, so I inspect id_names using the (soon to be very common) throwaway materialized view plus select trick.

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM id_names;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
      id      | name | lang
--------------+------+------
 ns/m.0dq6p   | VHS  | en
 ns/m.05r9bx  | Ed   | en
 ns/m.09nrrz  | 15   | en
 ns/m.0gxwk_  | RJ   | en
 ns/m.0kvd3l  | Da   | en
 ns/m.03d019n | G    | en
 ns/m.03hj3r6 | K    | en
 ns/m.03w7wq_ | A    | en
 ns/m.03w7wtc | D    | en
 ns/m.03w854r | S    | en
(10 rows)

materialize=> DROP VIEW foo;
DROP VIEW

That’s working nicely. Next up is a view that uses id_names to name everything in each tuple.

This is where I note that 5 years of building a SQL database doesn’t make you a SQL expert. In fact, I’m very much still a SQL novice: probably 95%+ of the SQL I’ve written in my life is in CockroachDB unit tests and most of it is for stuff like BACKUP, RESTORE, CHANGEFEED, and IMPORT. I write down the first thing that comes to mind, which is probably a terrible way to do this:

materialize=> CREATE VIEW named_tuples AS
    SELECT
        f.sub_uri_fb AS sub_id,
        sub_n.name AS sub_name_en,
        f.prd_uri_fb AS prd_id,
        prd_n.name AS prd_name_en,
        f.obj_uri_fb AS obj_id,
        obj_n.name AS obj_name_en
    FROM
        (
            SELECT
                *
            FROM
                freebase
            WHERE
                sub_uri_fb IS NOT NULL
                AND prd_uri_fb IS NOT NULL
                AND obj_uri_fb IS NOT NULL
        )
            AS f
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS sub_n ON f.sub_uri_fb = sub_n.id
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS prd_n ON f.prd_uri_fb = prd_n.id
        JOIN (SELECT * FROM id_names WHERE lang = 'en')
                AS obj_n ON f.obj_uri_fb = obj_n.id;
CREATE VIEW

Side note: Thank deity (and [Matt Jibson]) for [https://sqlfum.pt/].

[Matt Jibson]: https://twitter.com/mjibson
[https://sqlfum.pt/]: https://sqlfum.pt/

You know what’s coming next.

materialize=> CREATE MATERIALIZED VIEW foo AS SELECT * FROM named_tuples;
CREATE VIEW
materialize=> SELECT * FROM foo LIMIT 10;
    sub_id     |            sub_name_en            |                prd_id                | prd_name_en |     obj_id     |         obj_name_en
---------------+-----------------------------------+--------------------------------------+-------------+----------------+------------------------------
 ns/m.01119bmt | 2009 QDoc                         | ns/film.film_festival_event.festival | Festival    | ns/m.0lm919d   | QDoc
 ns/m.0r9mpb7  | 2010 KidFilm Festival             | ns/film.film_festival_event.festival | Festival    | ns/m.011dxlp0  | KidFilm Festival
 ns/m.0zb4wdg  | 2013 Palić Film Festival          | ns/film.film_festival_event.festival | Festival    | ns/m.02wxclb   | Palić Film Festival
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.0111b2xs | 2012 Fête de l'Animation          | ns/film.film_festival_event.festival | Festival    | ns/g.12214qrxp | Fête de l'Animation
 ns/m.010h2sg3 | 2011 11mm Filmfestival Berlin     | ns/film.film_festival_event.festival | Festival    | ns/m.0bdxcb4   | 11mm Filmfestival Berlin
 ns/m.0rh99r7  | 2003 Panorama of European Cinema  | ns/film.film_festival_event.festival | Festival    | ns/m.0107tj0d  | Panorama of European Cinema
 ns/m.01069sst | 2013 Neum Animated Film Festival  | ns/film.film_festival_event.festival | Festival    | ns/m.01069v41  | Neum Animated Film Festival
 ns/m.010h61_1 | 2012 Portland Maine Film Festival | ns/film.film_festival_event.festival | Festival    | ns/m.0100zwb5  | Portland Maine Film Festival
(10 rows)
materialize=> DROP VIEW foo;
DROP VIEW

Beautiful.

Here, I’d like to take a brief pause to mention that I had been vaguely planning to, at some point, redo all this on top of file sources with the TAIL option to show off Materialize’s incremental computation. But it turns out I don’t need to, because it’s doing it for me. I think this is just another data point in favor of “I’m still wrapping my head around streaming paradigms”.

A Materialized View of My Very Own #

On to our final view, this one materialized because, as described in Part 1, it’s what our API server will be a thin wrapper around. I went back and forth on how to structure it. The natural SQL way would be rows like:

id, name, prd_id1, prd_name_en1, obj_id1, obj_name_en1
id, name, prd_id2, prd_name_en2, obj_id2, obj_name_en2

This would repeat “id” and “name” for each tuple, which is wasteful and unsatisfying. Given that the API endpoint is going to return json anyway, why explode it just to unexplode it later? After some mulling, I ended up with a key-value structure of id -> jsonb endpoint response.

Without further ado… (I’m so so sorry.)

materialize=> CREATE MATERIALIZED VIEW api_debug AS
        SELECT
                sub_n.id AS sub_id,
                jsonb_build_object(
                        'sub_id',
                        sub_n.id,
                        'sub_name_en',
                        sub_n.name_en,
                        'sub_tuples',
                        jsonb_build_array(sub_t.named_tuples)
                ) AS json
        FROM
                (
                        SELECT
                                sub_id AS id, sub_name_en AS name_en
                        FROM
                                named_tuples
                )
                        AS sub_n
                JOIN LATERAL (
                                SELECT
                                        sub_id AS id,
                                        jsonb_agg(named_tuple) AS named_tuples
                                FROM
                                        (
                                                SELECT
                                                        sub_id,
                                                        jsonb_build_object(
                                                                'sub_id',
                                                                sub_id,
                                                                'sub_name_en',
                                                                sub_name_en,
                                                                'prd_id',
                                                                prd_id,
                                                                'prd_name_en',
                                                                prd_name_en,
                                                                'obj_id',
                                                                obj_id,
                                                                'obj_name_en',
                                                                obj_name_en
                                                        )
                                                                AS named_tuple
                                                FROM
                                                        named_tuples
                                        )
                                GROUP BY
                                        sub_id
                        )
                                AS sub_t ON sub_n.id = sub_t.id;
CREATE VIEW
materialize=> SELECT * FROM api_debug LIMIT 10;
    sub_id     |                                                                                                                                                                                                                                                                                                                                                         json
---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ns/m.01119bmt | {"sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc","sub_tuples":[[{"obj_id":"ns/m.0lm919d","obj_name_en":"QDoc","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01119bmt","sub_name_en":"2009 QDoc"}]]}
 ns/m.0r9mpb7  | {"sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival","sub_tuples":[[{"obj_id":"ns/m.011dxlp0","obj_name_en":"KidFilm Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0r9mpb7","sub_name_en":"2010 KidFilm Festival"}]]}
 ns/m.0zb4wdg  | {"sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival","sub_tuples":[[{"obj_id":"ns/m.02wxclb","obj_name_en":"Palić Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0zb4wdg","sub_name_en":"2013 Palić Film Festival"}]]}
 ns/m.010h2sg3 | {"sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin","sub_tuples":[[{"obj_id":"ns/m.0bdxcb4","obj_name_en":"11mm Filmfestival Berlin","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h2sg3","sub_name_en":"2011 11mm Filmfestival Berlin"}]]}
 ns/m.0rh99r7  | {"sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema","sub_tuples":[[{"obj_id":"ns/m.0107tj0d","obj_name_en":"Panorama of European Cinema","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0rh99r7","sub_name_en":"2003 Panorama of European Cinema"}]]}
 ns/m.01069sst | {"sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival","sub_tuples":[[{"obj_id":"ns/m.01069v41","obj_name_en":"Neum Animated Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.01069sst","sub_name_en":"2013 Neum Animated Film Festival"}]]}
 ns/m.010h61_1 | {"sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival","sub_tuples":[[{"obj_id":"ns/m.0100zwb5","obj_name_en":"Portland Maine Film Festival","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.010h61_1","sub_name_en":"2012 Portland Maine Film Festival"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
 ns/m.0111b2xs | {"sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation","sub_tuples":[[{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"},{"obj_id":"ns/g.12214qrxp","obj_name_en":"Fête de l'Animation","prd_id":"ns/film.film_festival_event.festival","prd_name_en":"Festival","sub_id":"ns/m.0111b2xs","sub_name_en":"2012 Fête de l'Animation"}]]}
(10 rows)

Alright, I’m really changing my tune on this whole “show stuff right away” bit. Getting this SQL working took a number of tries. If I’d been using anything but Materialize to prototype this, I would have had to wait quite a while to get the results of each attempt (or manually make an even smaller subset of the freebase data). With Materialize, creating the view, selecting from it, dropping it, and trying again were all basically instantaneous.

Brainstorming: Thinking about how I’d use this in production, I wish there was some way for my select to block until all the sources were “caught up enough” so I don’t serve incomplete results. For a non-tailed file, what this means is straightforward: when the whole file is loaded. On the other hand, what it means for tailed files, kafka, etc is less clear. Maybe when it’s caught up to within some time delta of where the source is at? I can see why we haven’t solved this yet, there are some meaty product and UX questions here.

I’m going to be using this view for simple id lookups, so I want to make sure there’s an index on id. I didn’t see a place in the CREATE MATERIALIZED VIEW grammar to specify an index, but the docs page mentions an [index is automatically made][materialized view index] for me.

[materialized view index]: https://materialize.com/docs/sql/create-materialized-view/#indexes

materialize=> SHOW INDEXES FROM api_debug;
  on_name  |       key_name        | seq_in_index | column_name | expression | nullable
-----------+-----------------------+--------------+-------------+------------+----------
 api_debug | api_debug_primary_idx |            1 | sub_id      |            | t
 api_debug | api_debug_primary_idx |            2 | json        |            | t
(2 rows)

Yep, that should work. Let’s look for Erica Albright from part 2.

materialize=> SELECT * FROM api_debug WHERE sub_id = 'ns/m.09r8m05';
 sub_id | api_debug
--------+--------------------
(0 rows)

Bummer, but that makes sense. It’s still loading things. Let’s wait for Materialize to catch up and try again.

Before that happens…

OOM #

Killed: 9

After a brief investigation (aka asking in slack), it looks like this means Materialize is running out of memory (and swap?). I restarted it while watching Activity Monitor and confirmed. At some point, while browsing the docs today, I saw something about debugging and optimizing memory usage, but if possible I’d like to come back to that later.

Let’s see if something dumb and easy works to unblock us. I restart Materialize, quickly drop the view and recreate it with LIMIT 10 added to the (SELECT ... FROM named_tuples) AS sub_n above.

Sorta? It’s using swap but not crashing anymore. Now to hook it up to a webpage.

Success! #

I’m out of practice here, so would like for something minimal to serve HTTP requests and query Materialize for the data. With a bit of looking, I found [tide], which seems to be associated with the official rust folks in some way. After a bit of iteration, I managed to get something working end-to-end!

[tide]: https://github.com/http-rs/tide

Screen Shot 2021-04-30 at 3.27.06 PM.png

I’m running out of time for the day, so I’ll have to go into more detail in a later post. The code needs a bit of cleanup before I push it anywhere, so that will have to wait, too.

Funny enough, while iterating on the web frontend, I noticed that the set of 10 things being selected by my LIMIT 10 hack is changing over time, which means I have to keep finding another id to test with. Makes sense once I think about it.

Next Up: Debugging Memory Usage #

I’d like to polish up the web frontend and get it deployed somewhere, but it’s pretty clear that my next task is to figure out how to write these views so they don’t knock over the materialized process. I’d only planned through here when I started all this, but now I know what Part 4 is going to be.

Encouragingly, it feels like the basic idea is workable. When our Chief Scientist [Frank] read part 1, he pointed me at [Declarative Dataflow], which efficiently processes queries over (subject, predicate, object) 3-tuples and is built on top of the same [Differential Dataflow] incremental computation framework that powers Materialize. So there’s no reason we shouldn’t be able to do it, too.

[frank]: https://github.com/frankmcsherry/
[declarative dataflow]: https://github.com/comnik/declarative-dataflow
[differential dataflow]: https://github.com/TimelyDataflow/differential-dataflow

There were a few bumps along the way that we (Materialize) can polish up pretty easily, and I’ll file issues for those. I think there’s also a larger takeaway here around helping users that are new to streaming wrap their heads around its unfamiliar paradigms. These sorts of discoveries are exactly why dogfooding is so important and why I wanted to do it right when I started and had fresh eyes.

Stay tuned!

 
3
Kudos
 
3
Kudos

Now read this

Implementing Backup

Originally published at www.cockroachlabs.com on August 9, 2017. Almost all widely used database systems include the ability to backup and restore a snapshot of their data. The replicated nature of CockroachDB’s distributed architecture... Continue →