Hacker News new | past | comments | ask | show | jobs | submit login
PgSync: Sync Postgres data between databases (github.com/ankane)
166 points by aeontech on March 24, 2020 | hide | past | favorite | 35 comments




what makes it better?


A few things I really like about pgloader (which to be clear, uses the COPY protocol):

- Handles MySQL -> pg; MSSQL -> pg; And pg -> pg copy

- Very fast (lisp/"real threading")

- Syncs schemas effectively (this says data only)

- Has it's own DSL for batch jobs: specify tables to include/exclude, renaming them on the fly in dest, and cast dtypes between src and dest if needed. etc


I tried it once, could not make it work.

I tried migrating a smallish mysql database (~10Gb) to postgres and it always crashed with a weird runtime memory error. Reducing the number of threads or doing it table by table didnt help.


I found compiling with Clozure CL instead of SBCL leads to way better memory allocation/usage.

It's worth getting it going, it's one of the greatest tools in migrating things into and out of pgsql I've ever used. If you have pgsql in your pipeline, get it working.

I used this and debezium to really make my ETL pipeline absolutely bulletproof.


> I used this and debezium

This looks very cool, was not aware of it:

"Debezium is a set of distributed services that capture row-level changes in your databases so that your applications can see and respond to those changes."

https://debezium.io/documentation/faq/#what_is_debezium


It has a lot of required components, however I ran most of them in a single-component setup (e.g.: 1 zookeeper server , 1 kafka server; using only their provided containers) and got extremely far (think 1 billion rows a day ingested, then transferred to pgsql and s3) with some extra overhead still.


Just offering an anecdotal experience to counter, not saying you or your experience is wrong...

I've used pgloader multiple times because I'm a huge Postgres evangelist for that exact use case multiple times without issue. Honestly - it's a favorite tool in my toolbox.


> weird runtime memory error.

If it's "heap exhausted, game over", there's a solution for that - you need to tell pgloader to allocate more memory for itself.



That was noted in the first line of my comment FWIW.


Wasn't clear you were saying it uses copy just like pgsync uses copy. I read it as you making the distinction between the two.


Looks good, am going to give it a test drive. Does it use foreign data wrappers to create connections between servers?


Postgres 10+ now has logical replication features built in, which makes that a better bet for near-zero-downtime replication. Especially if you're trying to move a database geographically or to a different machine or Postgres version, without the downtime that a full dump and restore requires.

https://www.postgresql.org/docs/10/logical-replication.html

This isn't rsync style two way sync, though, it's more for one way streaming / ongoing replication, although you could technically set up a master-master or replication chain if the conflicts are dealt with correctly.


Last time I tried to set up pg, I had a notion of a read-mostly website and I intended to feed it with a read-only replica and squirrel away the master someplace secure with only admin access. But it turns out that pg slaves contact the master, and not the other way around.

I never did find a solution I could live with. I'm trying to start that project back up right now.


How does it work in the event of real time changes happening on the database(s)?

It appears not to use logical decoding/replication. If so, how it does sync the data? It sounds like a hard problem, and not very efficient, to do it without logical decoding/replication.

I didn't see documentation about it. Unless.. it is intended to be used only with offline databases.


It's batch sync, not streaming or replication. By default, appears to dedupe based on PK.


Is this two-way or one-way sync? It's sounding like it's one-way. I don't see anything about how to deal with clobbers, for instance, if the same data is changed in two different databases at the same time.

And "batch one-way sync" is better described as "copy".

So I think this is a "postgres database copy" tool. Also known as clone, or backup. And as such, it's competing with the existing postgres database cloning tools, like pg_dump. So how is this different than that?


One-way, yes. pg's COPY command can't do upserts, this can.

The primary use case for this appears to be ETL. Pg_dump is more backup-oriented and optimized for larger operations, vs. a bit more fine-grained in most ETL processes.


Kind of like MSSQL SSIS, but with only data export-import functionality.


pg_comparator, albeit barely known, is a monster. For sync purposes think of it as a rsync for PG.

It needs temp tables.

https://www.cri.ensmp.fr/people/coelho/pg_comparator/


Can confirm pg_comparator is a very solid piece of software.


Embulk has been around for a long time, and supports considerably more options than just Postgres (especially since pgsync looks like it's just doing statement-based data copying).

https://github.com/embulk/embulk


A little late to the party, but this could possibly help someone.

I spent a few hours trying all of the options in this thread and I couldn't find anything that also transferred functions and triggers which was vital for my project.

Here is a little bash file that I whipped together to completely wipe the DB and import it in.

https://gist.github.com/nickreese/8eeb01e2f73fec2bf2dbfecf5d...

Just have the variables in your .env file and it will wipe your local DB and copy from production.


I'd really like some PostGres sync tool that would allow me to use something like "copy-on-write", i.e. do an upsert on each row, unless some condition evaluates to true, using both target and source data (e.g. target modified_at > source modified_at). Would allow for some common date to be easily shared between databases (better performance characteristics than e.g. a foreign table/union view).


How does this compare to dumping and loading for full databases? I have a number of cloning processes that do that, should I look into this instead?


Can't use the local db while restoring, this should help with that (I hope).


This makes getting production data down to your local machine super quick.

Need user where id = 64418362 and associated orders, order_items, etc

Done in like 5 seconds


I have a citus cluster in production and then some manual scripts to turn it back into a single-tenant database for in-house reporting. Basically pgdump and restore. This has to happen overnight since we can't run reports while it is restoring.

I am excited to try this and see if it might be a better way to pull in what's needed.


Is there something specific in Postgres (perhaps in the binlog) that makes it easier to do this than MySQL? Seems like there's more tools for this kind of data loading than I've seen for MySQL.

BTW: Awesome work ankane! Cool dude too :)


This tool uses COPY which is highly optimized and supports stdout and stdin. The combination makes it relatively easy to copy tables around. I don't believe MySQL INFILE/OUTFILE is as flexible.


I did the ghetto version of this yesterday: pg_dump -t tablename sourcedatabase | psql targetdatabase


Related note to this library: Andrew Kane (@ankane) has released some great tools that I've found useful, notably Blazer (for our use cases) -- good to see this is another nice piece of work from him. Thanks @ankane!


Purely in Ruby as well... nice.


Ruby is just for plumbing, i.e. preparing queries and whatnot, it's still pg_dump under the hood.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: