Python – Writing large ZIP archives without memory inflation

dataflow · on May 16, 2020

It'd be better if they didn't require knowing the paths a priori. One of the fundamental strengths of ZIP is that the file list is at the end of the archive rather than the beginning, letting you dynamically discover and send contents in a streaming fashion. Forcing the list of files to be known a priori works against that.

leni536 · on May 16, 2020

I have this same gripe with libzip in C.

kummappp · on May 16, 2020

Please test your library with the c10-archive test suite https://www.ee.oulu.fi/research/ouspg/PROTOS_Test-Suite_c10-...

wiredfool · on May 16, 2020

How would one test a library for generating compressed archives with a test designed for testing decompressors?

kummappp · on May 16, 2020

Sorry. My bad.

niceguy2 · on May 16, 2020

What is c10-archive test?

gspr · on May 16, 2020

Follow the link.

2bluesc · on May 16, 2020

I did something similar to stream large amounts of data off of an embedded device via a zip file, falcon, and wsgi server with no external dependencies for actual zip stream.

Proof of concept: https://gist.github.com/1e22bbf31b7e5ae84bbdfa32c68e03a9

jonatron · on May 15, 2020

I remember needing this 5+ years ago, I used a branch of a fork of python-zipstream https://github.com/longaccess/python-zipstream/tree/streamin...

m4rtink · on May 16, 2020

Yeah, I looked into this a while ago forgenerating photo gallery zip files on the fly. Currently I'm using lazygal to generate static html galleries and if I want to give users the option to easoly download the full gallery, it basically doubles the size I need to store, as lazygal simply creates a zip file and puts it next to the photos and html files in the static gallery folder it generates.

So I looked to creating the zip files on the fly when requested & streaming them to the client onthe other end withou having to create temporary files and/or a lot of memmory consumption on the server. I got it working on a prototype and even found some articles from others that got it working - it is not that hard, really.

kitotik · on May 16, 2020

It was apparently hard enough that you had to build a prototype and consult other prior write ups.

When you really need something done it’s sometimes very helpful to be able quickly get it done.

scosman · on May 16, 2020

I built exactly this once. Here it is as a microservice you can deploy to heroku in 1 click:

https://github.com/scosman/zipstreamer

eggsnbacon1 · on May 16, 2020

its very easy to zip on the fly with C#, Java, JS. No idea why Python has this blind spot

tyingq · on May 15, 2020

The zip file format itself, and the compression algorithms, like deflate...all seem set up to encourage low memory usage. The directory, for example, is at the end of the file.

Also, most places in a zipfile you might not have context to write in a streaming fashion are predictable in size/position so you can throw in a placeholder and seek() back to it later.

Kind of a shame someone has to specifically implement a low memory usage library. That implies other implementations went the lazy route.

shockinglytrue · on May 15, 2020

Certain fields in the ZIP local file header require random access at write time. Although deflate is self-delimiting, for non-compressed items you must either know the size of item upfront, seek after learning the size, or force all readers to fully buffer the ZIP before decompressing any entry, in order to access the size stored in the central directory at the end of the ZIP

The CRC field of the local file header is similar, but its optionality is less painful than the length field

In other words, there isn't a single 'good' ZIP implementation, it all depends on what you're aiming for. AFAIK the built-in Python zipfile module always fully populates the local file header.

zmodem · on May 16, 2020

> Certain fields in the ZIP local file header require random access at write time.

Isn't that what bit 3 in the general purpose bit flags if for? When that's set, the CRC and compressed/uncompressed file lengths are written in a Data Descriptor block after the member file data.

shockinglytrue · on May 16, 2020

It's been a few years, but I seem to remember there simply was no way to detect end of non-compressed content lacking a length header without first reading the TOC except substring search. This would quickly get very messy when the non-compressed content might contain another embedded ZIP, etc.

zmodem · on May 16, 2020

That's during extraction though. There's no way to reliably decompress a ZIP without reading the Central Directory first. While scanning for the Local File Headers works for most files, it's not guaranteed to be correct since it may not match what's in the Central Directory. Also as you point out it doesn't work when the file length is not in the LFH.

Streaming zip creation is well supported by the format though.

tyingq · on May 15, 2020

"Certain fields in the ZIP local file header require random access at write time...The CRC field..."

I covered that in the sentence that described these as things you could fill with a placeholder and seek() back to later.

shockinglytrue · on May 15, 2020

> Kind of a shame someone has to specifically implement a low memory usage library. That implies other implementations went the lazy route.

Streaming and seeking are antithetical, an implementation gets to pick one, and I imagine for most users streaming would be the edge case. It's debatable whether a library that ensures a fully spec-conforming output is generated at the expense of memory or IO is better or worse.

(FWIW this is written from the perspective of someone who for perverse reasons once wrote a streaming ZIP reader. In order to build such a thing, the ZIP writer had to be buffering/seeking. You can't have both without giving up ability to store uncompressed assets without needless overhead (e.g. JPEG-in-deflate))

tyingq · on May 15, 2020

"Streaming and seeking are antithetical"

I disagree, for a library creating zip files. It's the logical way to do it. Otherwise, you're slurping potentially big files into memory, or making un-needed copies. I've also written zip library code.

shockinglytrue · on May 15, 2020

Well, we're not discussing creating zip files where seeking is an option, we're discussing streaming in the context of the attached link which uses it concretely in terms of writing to a network.

tyingq · on May 15, 2020

That explains our disagreement. I read "large zip archives" and assumed there was at least an intermediate file created.

dependenttypes · on May 16, 2020

If you are streaming you should be using deflate directly instead.

bluedino · on May 15, 2020

Well, they were created in the 8086 or 80286 days where you had 640k memory (or less) and possibly floppies

mrbonner · on May 16, 2020

I wrote a servlet that streamed gigabytes of zipped data to HTTP clients in 2005. Something like this was baked into the JDK awhile ago. Why is it an achievement for Python, though?

perraco · on May 16, 2020

I think you can get the size of the zip before creating it. https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....

plusperraco · on May 16, 2020

smart! but always using ZIP_STORED?

nerdbaggy · on May 16, 2020

I wish windows supported more than just zip natively. I would love something like a tar. So much easier to generate on the fly than compressed formats.

johndough · on May 16, 2020

If all you want is to generate zip files and not read them, you can use uncompressed blocks, barely more complicated than tar and a lot faster and less memory hungry than many compression libraries: https://tools.ietf.org/html/rfc1951#page-11

I've been using this for a javascript bookmarklet (< 2000 characters) which automatically downloads all images from a web page on click.

acqq · on May 16, 2020

And if one wants to list the content of the archives, zip is better and orders of magnitude faster, as it also has a "central directory" so one doesn't have to read the whole archive to get the list.

I've used that "central directory" approach even over the internet with a great success (downloading just that segment instead of the whole archive to get a list of what is inside).

But how did you make a bookmarklet to produce a zip archive, even non-compressed?

wwright · on May 16, 2020

I assume that the author concatenated the image blobs into one buffer, then manually generated a file table. I believe there is a way to offer a “Download” button for buffer objects in JS.

johndough · on May 16, 2020

> But how did you make a bookmarklet to produce a zip archive, even non-compressed?

Not quite sure I understand the question. Bookmarklets are bookmarks which contain JavaScript code, for example:

javascript:alert('Hello, World!');

You can put that in a link

<a href="javascript:alert('Hello, World!');">Bookmarklet</a>

and then you can right-click the link to bookmark it. Now, when you want to run that JavaScript code on some website, visit the website and click on the bookmark.

acqq · on May 18, 2020

I know bookmarklets basics. What I wanted to know is which API calls you used inside of the < 2000 character bookmarklet to achieve the functionality you described. I also believed that the bookmarklets are limited to the security context of the web page and I don't understand how you did all that, if I understood you correctly you are generating zip out of the images from the web page using only the bookmarklet?

Do you, or do you do something much simpler?

johndough · on May 21, 2020

> I know bookmarklets basics. What I wanted to know is which API calls you used inside of the < 2000 character bookmarklet to achieve the functionality you described.

To create the archive data, I used an Uint8Array where I wrote the bytes into.

To download the images, I used XMLHttpRequest.

> I also believed that the bookmarklets are limited to the security context of the web page

That seems to be true unfortunately. However, besides the images on the same domain, it should also be possible to download some more images from other domains by allowing cross origin requests if the remote server cooperates and sets the respective header. but I have not looked into that yet.

The bookmarklet is here if you have more questions: https://github.com/983/FileDownloader

It is a bit larger than 2500 characters now since previously the code was optimized for size. Now it is slightly more readable.

stuaxo · on May 16, 2020

That's fantastic, and with HTTP range you could get individual files from inside the zip, effectively creating a just in time zip browser.

Being able to download individual parts seems good: you could have seperate sha256 info for files for instance.

ryall · on May 16, 2020

This is such a great idea! You don't need compression because images on the web are generally compressed already. Can you share any of your code that does this?

johndough · on May 17, 2020

Sure! Here it is: https://github.com/983/FileDownloader

mkl · on May 16, 2020

Can you post your bookmarklet? I tried making a similar thing (using TamperMonkey) and ran into browser security restrictions preventing downloading of the files.

johndough · on May 17, 2020

I tried TamperMonkey for a different project, but it kept deleting my scripts, so I haven't pursued it any further.

Bookmarklets are executed in the context of the currently opened website, so downloading images from the same origin is usually not a problem. What can be an issue are remote images. I tried setting `Access-Control-Allow-Origin: *`, but then other downloads would fail, so I left it as it was since it worked for the websites I was interested in.

Here is the code: https://github.com/983/FileDownloader

eggsnbacon1 · on May 16, 2020

Is this an issue in new versions of vanilla Python? I mostly bang on crusty old Java stuff and zip streaming has been built into JDK for decades

toyg · on May 16, 2020

Zip support in cpython stdlib has been there forever (I think it might even be from 2.0 times, early ‘00s) but it’s never been very user-friendly or particularly advanced. It’s one of those things like timezone and http, where the implemented API didn’t really come out “right” and there has been some pain ever since. It’s a bit like SSL stuff in JDK: something that should instinctively be very easy just... isn’t.

stuaxo · on May 16, 2020

The library this uses is an old enough bit of python that when I started 12 years ago, I looked at the code and thought the code looked a bit crusty and old.

kitotik · on May 16, 2020

It’s a valid question.

I’m a long time python user but haven’t ever had to work with zip files and didn’t know the answer.

hithere68 · on May 16, 2020

Yeah!