A few things I really like about pgloader (which to be clear, uses the COPY protocol):
- Handles MySQL -> pg; MSSQL -> pg; And pg -> pg copy
- Very fast (lisp/"real threading")
- Syncs schemas effectively (this says data only)
- Has it's own DSL for batch jobs: specify tables to include/exclude, renaming them on the fly in dest, and cast dtypes between src and dest if needed. etc
I tried migrating a smallish mysql database (~10Gb) to postgres and it always crashed with a weird runtime memory error. Reducing the number of threads or doing it table by table didnt help.
I found compiling with Clozure CL instead of SBCL leads to way better memory allocation/usage.
It's worth getting it going, it's one of the greatest tools in migrating things into and out of pgsql I've ever used. If you have pgsql in your pipeline, get it working.
I used this and debezium to really make my ETL pipeline absolutely bulletproof.
"Debezium is a set of distributed services that capture row-level changes in your databases so that your applications can see and respond to those changes."
It has a lot of required components, however I ran most of them in a single-component setup (e.g.: 1 zookeeper server , 1 kafka server; using only their provided containers) and got extremely far (think 1 billion rows a day ingested, then transferred to pgsql and s3) with some extra overhead still.
Just offering an anecdotal experience to counter, not saying you or your experience is wrong...
I've used pgloader multiple times because I'm a huge Postgres evangelist for that exact use case multiple times without issue. Honestly - it's a favorite tool in my toolbox.
Postgres 10+ now has logical replication features built in, which makes that a better bet for near-zero-downtime replication. Especially if you're trying to move a database geographically or to a different machine or Postgres version, without the downtime that a full dump and restore requires.
This isn't rsync style two way sync, though, it's more for one way streaming / ongoing replication, although you could technically set up a master-master or replication chain if the conflicts are dealt with correctly.
Last time I tried to set up pg, I had a notion of a read-mostly website and I intended to feed it with a read-only replica and squirrel away the master someplace secure with only admin access. But it turns out that pg slaves contact the master, and not the other way around.
I never did find a solution I could live with. I'm trying to start that project back up right now.
How does it work in the event of real time changes happening on the database(s)?
It appears not to use logical decoding/replication. If so, how it does sync the data? It sounds like a hard problem, and not very efficient, to do it without logical decoding/replication.
I didn't see documentation about it. Unless.. it is intended to be used only with offline databases.
Is this two-way or one-way sync? It's sounding like it's one-way. I don't see anything about how to deal with clobbers, for instance, if the same data is changed in two different databases at the same time.
And "batch one-way sync" is better described as "copy".
So I think this is a "postgres database copy" tool. Also known as clone, or backup. And as such, it's competing with the existing postgres database cloning tools, like pg_dump. So how is this different than that?
One-way, yes. pg's COPY command can't do upserts, this can.
The primary use case for this appears to be ETL. Pg_dump is more backup-oriented and optimized for larger operations, vs. a bit more fine-grained in most ETL processes.
Embulk has been around for a long time, and supports considerably more options than just Postgres (especially since pgsync looks like it's just doing statement-based data copying).
A little late to the party, but this could possibly help someone.
I spent a few hours trying all of the options in this thread and I couldn't find anything that also transferred functions and triggers which was vital for my project.
Here is a little bash file that I whipped together to completely wipe the DB and import it in.
I'd really like some PostGres sync tool that would allow me to use something like "copy-on-write", i.e. do an upsert on each row, unless some condition evaluates to true, using both target and source data (e.g. target modified_at > source modified_at). Would allow for some common date to be easily shared between databases (better performance characteristics than e.g. a foreign table/union view).
I have a citus cluster in production and then some manual scripts to turn it back into a single-tenant database for in-house reporting. Basically pgdump and restore. This has to happen overnight since we can't run reports while it is restoring.
I am excited to try this and see if it might be a better way to pull in what's needed.
Is there something specific in Postgres (perhaps in the binlog) that makes it easier to do this than MySQL? Seems like there's more tools for this kind of data loading than I've seen for MySQL.
This tool uses COPY which is highly optimized and supports stdout and stdin. The combination makes it relatively easy to copy tables around. I don't believe MySQL INFILE/OUTFILE is as flexible.
Related note to this library: Andrew Kane (@ankane) has released some great tools that I've found useful, notably Blazer (for our use cases) -- good to see this is another nice piece of work from him. Thanks @ankane!