PgSync: Sync Postgres data between databases

pritambaral · on March 24, 2020

I think pgloader does it better: https://pgloader.readthedocs.io/en/latest/ref/pgsql.html

durkie · on March 24, 2020

what makes it better?

dfsegoat · on March 24, 2020

A few things I really like about pgloader (which to be clear, uses the COPY protocol):

- Handles MySQL -> pg; MSSQL -> pg; And pg -> pg copy

- Very fast (lisp/"real threading")

- Syncs schemas effectively (this says data only)

- Has it's own DSL for batch jobs: specify tables to include/exclude, renaming them on the fly in dest, and cast dtypes between src and dest if needed. etc

pampa · on March 24, 2020

I tried it once, could not make it work.

I tried migrating a smallish mysql database (~10Gb) to postgres and it always crashed with a weird runtime memory error. Reducing the number of threads or doing it table by table didnt help.

chucky_z · on March 25, 2020

I found compiling with Clozure CL instead of SBCL leads to way better memory allocation/usage.

It's worth getting it going, it's one of the greatest tools in migrating things into and out of pgsql I've ever used. If you have pgsql in your pipeline, get it working.

I used this and debezium to really make my ETL pipeline absolutely bulletproof.

dfsegoat · on March 26, 2020

> I used this and debezium

This looks very cool, was not aware of it:

"Debezium is a set of distributed services that capture row-level changes in your databases so that your applications can see and respond to those changes."

https://debezium.io/documentation/faq/#what_is_debezium

chucky_z · on March 28, 2020

It has a lot of required components, however I ran most of them in a single-component setup (e.g.: 1 zookeeper server , 1 kafka server; using only their provided containers) and got extremely far (think 1 billion rows a day ingested, then transferred to pgsql and s3) with some extra overhead still.

folkhack · on March 24, 2020

Just offering an anecdotal experience to counter, not saying you or your experience is wrong...

I've used pgloader multiple times because I'm a huge Postgres evangelist for that exact use case multiple times without issue. Honestly - it's a favorite tool in my toolbox.

phoe-krk · on March 24, 2020

> weird runtime memory error.

If it's "heap exhausted, game over", there's a solution for that - you need to tell pgloader to allocate more memory for itself.

massaman_yams · on March 24, 2020

This uses COPY for export & import:

https://github.com/ankane/pgsync/blob/master/lib/pgsync/tabl...

dfsegoat · on March 24, 2020

That was noted in the first line of my comment FWIW.

massaman_yams · on March 25, 2020

Wasn't clear you were saying it uses copy just like pgsync uses copy. I read it as you making the distinction between the two.

sharadov · on March 24, 2020

Looks good, am going to give it a test drive. Does it use foreign data wrappers to create connections between servers?

sudhirj · on March 25, 2020

Postgres 10+ now has logical replication features built in, which makes that a better bet for near-zero-downtime replication. Especially if you're trying to move a database geographically or to a different machine or Postgres version, without the downtime that a full dump and restore requires.

https://www.postgresql.org/docs/10/logical-replication.html

This isn't rsync style two way sync, though, it's more for one way streaming / ongoing replication, although you could technically set up a master-master or replication chain if the conflicts are dealt with correctly.

hinkley · on March 25, 2020

Last time I tried to set up pg, I had a notion of a read-mostly website and I intended to feed it with a read-only replica and squirrel away the master someplace secure with only admin access. But it turns out that pg slaves contact the master, and not the other way around.

I never did find a solution I could live with. I'm trying to start that project back up right now.

ahachete · on March 24, 2020

How does it work in the event of real time changes happening on the database(s)?

It appears not to use logical decoding/replication. If so, how it does sync the data? It sounds like a hard problem, and not very efficient, to do it without logical decoding/replication.

I didn't see documentation about it. Unless.. it is intended to be used only with offline databases.

massaman_yams · on March 24, 2020

It's batch sync, not streaming or replication. By default, appears to dedupe based on PK.

toomim · on March 24, 2020

Is this two-way or one-way sync? It's sounding like it's one-way. I don't see anything about how to deal with clobbers, for instance, if the same data is changed in two different databases at the same time.

And "batch one-way sync" is better described as "copy".

So I think this is a "postgres database copy" tool. Also known as clone, or backup. And as such, it's competing with the existing postgres database cloning tools, like pg_dump. So how is this different than that?

massaman_yams · on March 24, 2020

One-way, yes. pg's COPY command can't do upserts, this can.

The primary use case for this appears to be ETL. Pg_dump is more backup-oriented and optimized for larger operations, vs. a bit more fine-grained in most ETL processes.

sharadov · on March 24, 2020

Kind of like MSSQL SSIS, but with only data export-import functionality.

natmaka · on March 25, 2020

pg_comparator, albeit barely known, is a monster. For sync purposes think of it as a rsync for PG.

It needs temp tables.

https://www.cri.ensmp.fr/people/coelho/pg_comparator/

sgt · on March 25, 2020

Can confirm pg_comparator is a very solid piece of software.

tomnipotent · on March 24, 2020

Embulk has been around for a long time, and supports considerably more options than just Postgres (especially since pgsync looks like it's just doing statement-based data copying).

https://github.com/embulk/embulk

nickreese · on March 26, 2020

A little late to the party, but this could possibly help someone.

I spent a few hours trying all of the options in this thread and I couldn't find anything that also transferred functions and triggers which was vital for my project.

Here is a little bash file that I whipped together to completely wipe the DB and import it in.

https://gist.github.com/nickreese/8eeb01e2f73fec2bf2dbfecf5d...

Just have the variables in your .env file and it will wipe your local DB and copy from production.

mhd · on March 24, 2020

I'd really like some PostGres sync tool that would allow me to use something like "copy-on-write", i.e. do an upsert on each row, unless some condition evaluates to true, using both target and source data (e.g. target modified_at > source modified_at). Would allow for some common date to be easily shared between databases (better performance characteristics than e.g. a foreign table/union view).

arcticfox · on March 24, 2020

How does this compare to dumping and loading for full databases? I have a number of cloning processes that do that, should I look into this instead?

jaxn · on March 24, 2020

Can't use the local db while restoring, this should help with that (I hope).

blairanderson · on March 25, 2020

This makes getting production data down to your local machine super quick.

Need user where id = 64418362 and associated orders, order_items, etc

Done in like 5 seconds

jaxn · on March 24, 2020

I have a citus cluster in production and then some manual scripts to turn it back into a single-tenant database for in-house reporting. Basically pgdump and restore. This has to happen overnight since we can't run reports while it is restoring.

I am excited to try this and see if it might be a better way to pull in what's needed.

loganfrederick · on March 24, 2020

Is there something specific in Postgres (perhaps in the binlog) that makes it easier to do this than MySQL? Seems like there's more tools for this kind of data loading than I've seen for MySQL.

BTW: Awesome work ankane! Cool dude too :)

rpedela · on March 25, 2020

This tool uses COPY which is highly optimized and supports stdout and stdin. The combination makes it relatively easy to copy tables around. I don't believe MySQL INFILE/OUTFILE is as flexible.

chungus · on March 25, 2020

I did the ghetto version of this yesterday: pg_dump -t tablename sourcedatabase | psql targetdatabase

forgingahead · on March 25, 2020

Related note to this library: Andrew Kane (@ankane) has released some great tools that I've found useful, notably Blazer (for our use cases) -- good to see this is another nice piece of work from him. Thanks @ankane!

metreo · on March 24, 2020

Purely in Ruby as well... nice.

why-el · on March 24, 2020

Ruby is just for plumbing, i.e. preparing queries and whatnot, it's still pg_dump under the hood.