Not the previous poster, but my use case is storing site analytics (i.e. "Google...

jeff-davis · on March 7, 2021

Can you elaborate on how you'd use update/delete?

It sounds like you might be able to do that using range partitioning instead, similar to the example in the blog post. Then just delete entire partitions when they are older than x days.

arp242 · on March 7, 2021

The table is essentially:

  | site_id | created_at | [..other columns trimmed..]
  | 1       | 2021-06-18 | [..]
  | 2       | 2021-06-18 | [..]
  | 1       | 2021-06-19 | [..]
  | 2       | 2021-06-19 | [..]
  | 3       | 2021-06-19 | [..]

Site 1 may choose to keep things for 30 days, site 2 for 50 days, and site 3 may choose to keep things infinitely.

The solution I use now is to just run a background job where day with "delete from hits where site_id = ? and created_at < now() - interval ?". It's pretty simple, but works.

There are probably some clever things I can do: partition by site ID and then date range, or something like that. But this has a bunch of downsides too:

- I like to keep things as simple as possible to make it easier for people to self-host this, which is a big reason I use PostgreSQL in the first place. If it was just some SaaS I was running then I wouldn't have a lot of problems turning up the complexity a bit if it gave me serious benefits, but when you distribute software you expect other people to be able to run then it's a bit more of a trade-off.

- The entire thing also runs on SQLite, and I'd like to minimize the number of special PostgreSQL/SQLite branches when possible.

Ideally, what I'd like is that people can optionally plug in something like Citus if they need it (and many don't! Even SQLite actually works fine for many) and have standard SQL "just work" without sweeping architectural changes. They can switch it on/off as they please. I don't mind adding a few "if using_citus then ..." exceptions here or there, but there's a limit, especially since many people just don't need it (but will still "stuck" with a much more complex table structure because of it).

This, for me anyway, is the appeal of things like Citus or TimescaleDB vs. more specialized solutions like Clickhouse or whatnot. I don't need to get the "fastest possible solution", and these solutions strike a good middle ground between complexity/ease of setup vs. speed.

There is also a second use case for UPDATE (the above is mostly for DELETE): right now I pre-compute some data and run most queries on that, rather than the main append-only table with events because this is rather faster in various cases. Because the data is aggregated by hour or day, it needs to update those rows when new events come in. The way I do that now is with a unique key and "insert [..] on conflict [..] do update [..]", but there's a bunch of other ways to do this (but this it probably the easiest).

In principle those tables should become obsolete with a column store, but when I tried cstore_fdw last year this wasn't really fast enough (although this version may be, I didn't try yet). Even if it would be fast enough, this still means I'd have to write all queries twice: once for the "normal" PostgreSQL/SQLite use case, and once for the Citus use case, which isn't especially appealing.

So, tl;dr, I want Citus to be an optional thing people can use, and maintain compatibility with mainstream PostgreSQL and SQLite (and, possibly, other engines in the future too).

Perhaps this is a rare use cases? I don't know. But this is my use case.