There’s also no package caching across venvs(?). Every time I have to install something AI it’s the same 2GB torch + few hundred megabytes libs + cuda update of downloads again and again. On top of that, python projects love to start downloading large resources without asking, and if you ctrl-c it, the setup process is usually non-restartable. You’re just left with a broken folder that doesn’t run anymore. Feels like everything about python’s “install” is an afterthought at best.
Unless you've configured `pip` in a special way, I don't think this is true? All invocations of `pip` across different environments should share the global HTTP cache, which you can confirm with `pip cache dir`.
(What `pip` can't cache is the actual extraction of distributions between venvs, because distribution installation sometimes includes arbitrary code. So you might be hitting that, but you should never need to re-download the same wheels or sdists while the cache is hot.)
But I usually let things as they are, and never touched pips settingss. If my intervention is required, it is different from ”works out of box”. In my AI cases I’ve seen one of the pips picking from where it stopped occasionally (rare cases where you can re-setup), but that doesn’t go across projects.
The fact that I’m on windows and have four different python installations may add to this issue, but I’m pretty sure and aware which `which python` I’m using when. AI projects tend to download their own minicondas, but it’s unclear why they don’t share a cache (if that’s the issue even). Anyway, windows or not, out of box user experience is barely acceptable.
All my installations including all "condas" known to me return
c:\users\user\appdata\local\pip\cache
I just tried to unpack and install the same `text-generation-webui-main` that I installed yesterday and it started to download 2.5gb torch at 12mb/s eta 0:03:07. I made screenshots for those who will claim it's not true. There was also a full ready-to-use in advance 2.5gb file in pip\cache\http\...\...\...\{hashfilename}, which is very fun to navigate to btw. I interrupted the download and activated this new environment:
[prompt] cd text-generation-webui-main
[prompt] installer_files\conda\condabin\conda activate installer_files\env
[prompt] pip cache dir
c:\users\user\appdata\local\pip\cache
Honestly, I don't know what's happening there. My only other guess is that the wheel you're downloading is so big that it somehow busts `pip`'s cache, meaning that you end up always re-downloading it. But I'm not sure there's even a cache limit like that.
It sounds like this might be a bug, either in `pip` or in the caching middleware it uses. You should consider filing a bug upstream with your observed behavior.
Edit: 2.5GB is coincidentally just over the maximum i32 value, so it's possible this is even something silly like a `seek(3)` or other file offset overflowing and resulting in a bad cache.
Edit 2: `pip` is using `CacheControl`, which I maintain. I suspect what's happening is that `CacheControl`'s underlying use of `msgpack` is failing above stores of `2^32 - 1` bytes in individual binary objects, since that's the `msgpack` limit.
It would be nice if all packages (and each version) were cached somewhere global and then just referred to by a local environment, to deduplicate. I think Maven does this.
The downloaded packages themselves are shared in pip's cache. Once installed though they cannot be shared, for a number of reasons:
- Compiled python code is (can be) generated and stored alongside the installed source files, sharing that installed package between envs with different python versions would lead to weird bugs
- Installed packages are free to modify their own files (especially when they embed data files)
- Virtualenv are essentially non relocatable and installed package manifests use absolute paths
Really that shouldn't be a problem though. Downloaded packages are cached by pip, as well as locally built wheels of source packages, meaning once you've installed a bunch of packages already, installing new ones in new environments is a mere file extraction.
The real issue IMHO causing install slowness is that package maintainers have not taken the habit to be mindful of the size of their packages. If you install multiple data science packages for instances, you can quickly get to multi GB virtualenvs, for the sole reason that package maintainers didn't bother to remove (or separate as "extra") their test datasets, language translation files, static assets, etc.
It could be worse... you could be like the NuGet team and fail to identify "prevent cache from growing without bound" as a necessary feature. ;)
Last I checked, it was going-on 7+ years of adding/updating last-used metadata to cached packages (because MS disabled them by default at the Windows fs level in Vista+ for performance reasons), to support automated pruning. https://github.com/NuGet/Home/issues/4980
(Admittedly, most of that spent waiting for the glacially slow default versions of things to ship through the dotnet ecosystem)
They can be, pip developers just have to care about this. Nothing you described precludes file-based deduplication, which is what pnpm does for JS projects: it stores all library files in a global content-addressable directory, and creates {sym,hard,ref}links depending on what your filesystem supports.
Being able to mutate files requires reflinks, but they're supported by e.g. XFS, you don't have to go to COW filesystems that have their own disadvantages.
You can do something like that manually for virtualenvs too, by running rmlint on all of them in one go. It will pick an "original" for each duplicate file, and will replace duplicates with reflinks to it (or other types of links if needed). The downside is obvious: it has to be repeated as files change, but I've saved a lot of space this way.
Or just use a filesystem that supports deduplication natively like btrfs/zfs.
This is unreasonably dismissive: the `pip` authors care immensely about maintaining a Python package installer that successfully installs billions of distributions across disparate OSes and architectures each day. Adopting deduplication techniques that only work on some platforms, some of the time means a more complicated codebase and harder-to-reproduce user-surfaced bugs.
It can be worth it, but it's not a matter of "care": it's a matter of bandwidth and relative priorities.
"Having other priorities" uses different words to say exactly the same thing. I'm guessing you did not look at pnpm. It works on all major operating systems; deduplication works everywhere too, which shows that it can be solved if needed. As far as I know, it has been developed by one guy in Ukraine.
Are there package name and version disclosure considerations when sharing packages between envs with hardlinks and does that matter for this application?
Absolutely. Yes, they have many PMs, but thay can always cache downloads in home/.cache/pip or appdata/local/python/pip-cache-common. Packages don’t change so there’s no need for separate caches, just dump them all in one folder and add basic locks on it.
I wonder how much traffic pressure pip remotes could release by just getting caching right.
Having a way to install all packages and all versions of Python somewhere central, and referring to them both as a combination in a virtual environment, would be fantastic. It's nice that an ecosystem of tools has built into this gap, but it does make good Python practice less approachable, as there are so many options.
You're lucky that you're not installing on an nVidia Jetson system that is pinned to Python 3.8 which forces you to use old versions for almost everything else, unless you want to give up on CUDA.
Not that lucky, since I’m using 3.12 (iirc) for my opencv-related projects, 3.10 for… can’t remember what, probably kohya_ss, then 3.9 for few tools like rembg and a few left from nodejs setup via choco, not sure if it still needs them. Not counting minicondas here cause it’s hard to even guess.
Poetry style solutions still increasing I think but there is more than poetry competing for it. Uv too will soon join the group competing in this area.
Poetry is challenged by not being as Python PEP compatible (it predated some of the "standards") as well as performance (resolve, lock and install)
Poetry isn't really a package manager as much as packaging management (as in packaging and publishing your specific package) and dependency management.
uv is essentially an improved version of pip, virtualenv and pip-tools, but from the outside it's pretty much the same thing. The CLI imitates most of pip's options too.
It's not like conda, which is a completely different ecosystem, with dodgy integration with pip.
UV isn't "improved" yet, at least not until it reaches some semblance of feature parity. However it's a great experiment and I expect that it will end up being a net benefit to the Python ecosystem.
It aims at replacing them all, including venv, pipx, and more.
> Think: a single binary that bootstraps your Python installation and gives you everything you need to be productive with Python, bundling not only pip, pip-tools, and virtualenv, but also pipx, tox, poetry, pyenv, ruff, and more.
I meant to say what uv is _right now_, which AFAIK is replacement for virtualenv, some of pip, and pip-tools. I understand they have a vision for the future, but I'll believe it when I see it. As it stands, they're already a very useful tool.
Conda is a lot more than a Python package manager specifically, it's more like an entire userspace (akin to Homebrew or Pkgsrc), that happens to be focused on Python because that's where it originated.
This is (in part) a knock-on consequence of Python packaging's dynamic metadata: installing a package sometimes means running arbitrary Python code to determine the package's dependencies, meaning that the resolver not only has to backtrack but also has to run arbitrary code in a fixed state (i.e., with references to other dependencies that should not change in parallel streams of execution).
I shouldn’t have this issue then. But a couple of days ago I tried new versions of kohya, sdwebui and tgwebui and none of their convoluted setup batches shared downloads. Btw sdwebui broke itself irrepairably in half an hour by failing installing an extension that works in a sibling folder. Kohya spoiled right away without my intervention. Only tgwebui survived. I’m not doung anything wrong, since there’s not much to do apart from waiting for hours.
I’m not updating in-folder because of deep scars from the last time I did that.
Maybe these issues are not inherent to python, but for some reason AI projects tend to go out of their way to make easy installers. That must mean something at least.
Note to readers who enjoy complaining about Python: the title is more bad-sounding than the actual content.
It's analogous to how IDNs and IP addresses have no single canonical form, but 2 or more equivalent forms that can be parsed. To properly compare two IP addresses, you must parse them and compare the results. The same thing is happening here.
It's an interesting problem and maybe an oversight in the design of the wheel name format, but it's not really an instance of "Python bad" the way you might assume based on the title.
Not helped by every boundaries you'll have to cross to deploy a python lib or app. Pypa will not necessarily agree with pypi and or docker image versionning. It's amazing how fast naming hell travels.
> To allow for compact filenames of bdists that work with more than one compatibility tag triple, each tag in a filename can instead be a ‘.’-separated, sorted, set of tags.
Is his second point about compressed tag sets right? I would interpret "sorted, set" to disallow unsorted and duplicate tags.
1. Sorted, set is a contraction in terms: `packaging` and other tooling have interpreted it to mean just "set," which implies unsorted.
2. The quoted language there isn't from the wheel filename definition itself (and isn't linked to it), so it's unclear whether it's intended to be normative. Similarly, "can" is weaker than "should."
In addition to standardization headaches, the wheel issue discussed is also hindered by using file names as an at least somewhat complex medatata structure's transport. Sure, you can do that, but it's a pain in the ass even if you do a good job. A file name is just one string, after all.
This is a somewhat hot take, but I wish there were more widespread use of package repository tools that don't use the dpkg model of "the registry's metadata is baked into a simple file hierarchy". The simplicity and accessibility (I can just browse packages with HTTP/FTP) of that approach is appealing, but it breaks down a lot in cases like this.
I don't think it's too opaque or overcomplicated to wish for repository servers that have a pair of access APIs: one to ask the server about metadata/packages matching a set of constraints, and another to download blobs/tarballs/whatever. Docker takes that approach, and it seems to work well. I just wish that instances of this pattern weren't per-language/platform and we had something a little more standardized.
Also, when installing a wheel you can get a brief error message "Not a supported wheel on this platform" without an explanation of why this wheel is not compatible with your system, which can be infuriating.
There's an open issue about it which you can express support for, and some marginally-hacky code that you can paste into a "debug_wheel.py"-ish script to get unblocked in the short term, here: https://github.com/pypa/pip/issues/10793
Why is Python packaging such a mess? That's a serious question. Has someone written up some recap or history piece on how it came to be this way? Any pointers?
The Python packaging ecosystem is 25+ years old, and was largely informally specified until ~12 years ago. Most of the complexity in Python packaging can be traced back to (1) a large degree of pre-existing permissive behavior that needed to be encoded in early standards to get standardization off the ground, and (2) the fact that Python packaging is still largely federated, with multiple independent groups performing tool development (and interpreting the standards).
(2) has historically also been a strength of Python packaging, and has been a big source of modernization efforts over the years that (IMO) have made packaging much better: things like pyproject.toml, build backends, etc.
(I think there have been a few recap talks at PyCon over the years on packaging's history. There are definitely people way more qualified than me to provide a more detailed summary :-)).
One example, Semantic Versioning and odd/even versioning, where 3.1 is considered unstable, versus 3.2 is considered stable, versus 3.3 unstable, etc. This was standardized only in 2010, so we can't blame Python developers for not foreseeing that in 2000.
The odd vs even versions rule you describe, while followed by a few projects, is not part of Semantic Versioning (aka SemVer) - there's no mention of it whatsoever in the spec at https://semver.org/.
Indeed it is basically incompatible with SemVer given that SemVer has its own way of marking unstable versions (or, equivalently, pre-release versions; SemVer conflates the two concepts). Unstable/pre-release versions under SemVer are indicated by a hyphenated suffix after the patch version - e.g. 1.2.3-alpha
Right -- one of Python (packaging's) small ironies is that people sometimes forget how old the Python ecosystem is, in part because the ecosystem and language have managed to keep up with the times :-)
(If we use similarly-aged interpreted languages as a benchmark, I would say that Python packaging has modernized admirably in some aspects: compare CPAN, for example.)
- Minimal standardization of ways to cache pre-built artifacts for modules with binary/XS components (no widely adopted wheel equivalent).
- Extremely poor support in cpan-the-tool for multiple local::lib environments.
- Abject lack of standardization on how hermetic/isolated (or not) builds involving linking to native libraries/compiling code should be. Talking super simple stuff here: what path-related envs to inherit, when module build scripts should hardcode vs. use system-path-discoverable versions of common C toolchain components, stuff like that.
- Standardization on build tools ("backends" in Python parlance) is poor. Module::Build is "the future", but lots of stuff seems not to use it. ExtUtils::MakeMaker is ... not good, and alternatives seem to be all over the place in terms of supportedness (I've heard Dist::Zilla is good, never tried it; spent plenty of time with SWIG hassles though). This is a low bar, and Perl trips over it. To be fair, so does Python.
- No uninstallation support (because there's no standardized manifest format). Sure, cpanminus exists (mirroring the annoyance that is e.g. conda competing with pip in Python), and even then uninstallation is dicey.
- Uninstallation troubles hint at the larger issue: CPAN packages are, if you zoom out far enough, shell/perl scripts that perform arbitrary system modifications. As jsiracusa used to say when we worked together, "they're file sprayers, not packages". If you want anything more predictable (or less dangerous) than that, you're out of luck unless you follow a very narrow path which many packages and deployment environments may not be compatible with.
- TMTOWTDI doesn't help here; even simple pure-perl modules have piss poor consistency re: how they install, so answering questions like "what will installation of Some::Module make available" is harder than it needs to be because maintainers are by turns over-clever and using dated methodologies for packaging their code.
- For reproducibility, cpan-the-tool's stance on lockfiles and hashes seems to be "what the fuck is a lockfile?". Given poor CPAN-wide compliance with Semver or equivalent conventions, that makes builds that need to happen in different environments (e.g. local, docker, CI, production) needlessly complicated.
- MacOS support for many common libraries with XS/native components is very poor even by Python's (nightmarishly bad until c. 2019, when tools matured even though the advent of ARM macs made it feel like they didn't) standard.
I sincerely hope that stuff has improved in the Perl community since I last tangled with CPAN issues on the regular (around 5.18/5.20). But given the increasing number of lonely tumbleweeds drifting around CPAN and the Perl community's tendencies in general, I won't hold my breath.
Because the core team never took it seriously. You can let packaging solutions grow organically if your core team doesn't want to do it, but then you need to find those who start a project, help guide them, point people toward them, help it grow and improve. You can't just throw your hands up and hope somebody else solves it.
Perl has the best packaging community I've seen. Ironically I don't think they even have binary packages, but they do have a system that tests both core Perl and most of the CPAN archive on like 20+ platforms. The packaging fits the community, it's robust, and sets conventions that the rest of the community follows that improves the entire ecosystem. It has definitely evolved over time. But I don't know of anyone who has ever said "Man, Perl packaging is a huge pain in the butt, I wish someone could solve this". If there's a problem, usually there is just one solution that everyone adopts, builds on and improves (which is ironic considering Perl's motto)
Whereas with Python, if there's a problem, someone tries to solve it, but it's not perfect, so somebody makes a second solution, then a third, fourth, fifth, etc. Python never established the convention of reusing and improving existing things. Python has a forking/independent culture. So there will always be 20 different ways to do something and none of them all that great (which is ironic considering Python's principles)
Hey this is a nice write up and I'm glad to be able to follow along, I've been writing Python since around 2012. Poetry is a step in the right direction, but I would love to see it baked into the language itself somehow.
I don't think we'll ever see proper packaging integrated into the language. Our best hope is for the ecosystem to coalesce around a particular tool, or at least have standards that all the major players follow. The latter is kind of happening slowly over time.
Poetry is adopted by less than 0.5% of Python repos on GitHub. Because you can’t install PyTorch with it in a way that works for more than 1 platform, and sometimes in general, its adoption among new and sophisticated projects is trending towards zero.
I have no beef with Poetry but having taken huge brain damage dealing with Python packaging as a USP for end users, the maintainers have decided to not participate in the most exciting things in the ecosystem to the detriment of the otherwise excellent and thoughtfully designed project. Ultimately people who are offering a packaging alternative must deliver a “universal install wizard” solution that includes Windows, which interests few people to work on.
Python was first released in the early 90s. If you compare Python packaging to other languages of its time, its not so crazy. For comparison, C++ was released in '85.
The early days of setuptools/easy_install (around 2004) were interesting. The way I remember it, the main developer had a clear "don't ask for permission, just build what you think is best and put it out there" approach.
That avoids the problems you can get where everything gets discussed to death and nothing actually gets written, but the downside was that the project spent a long time in an uneasy situation where it didn't have complete official buy-in but was popular enough that it would have been hard to push an alternative.
There were a couple of questionable early design decisions that didn't help.
This is an incorrect summary: if you read beyond that, name canonicalization is not a significant hurdle. The larger hurdle is the fact that wheel filenames support compressed tags, which themselves lack canonicalization.