This is a really well written partial set of release notes. I was curious and looked at the full release notes [1], and I think these are pretty well written as well. I'm very impressed, especially given that git has such a large set of contributors.
It deals with wrapping an integer index around after incrementing it, the old code just used ++index and a bitmask, the new code uses + 1 and modulo.
I have problems understanding this right now, in my world ++index for an int really shouldn't trigger overflow when counting to at most 4, on any (sem-)realistic environment?
The counter never resets, it just keeps going up and we only look at the low bits. So eventually it will need to wrap. It's doubtful that ever happened in practice, even on a 32-bit system (you'd need to print 2 billion SHA-1s in a single process, and even the largest repos have on the order of millions).
So the key difference between the old and the new is that the counter resets to zero every fourth call.
They were off by a factor of 10 with the likelihood of being struck and killed by lightning, according to the nws website.
To clarify: the likelihood of being merely struck by lightning is ~ 1/1,000,000 per year. The likelihood of being struck and killed is 1/10,000,000 , or about 1/2^23.25
Given this, you would only have to be struck and killed by lightning 6.8 years in a row to equal a sha1 hash collision probability.
More importantly, the comparison is useless. The odds of running into issues with SHA-1 collisions in Git is a very different question from just the odds of two random SHA-1 hashes colliding.
The situational comparison originally used by Linus for a sha1 collision was all the members of your development team being killed and eaten by wolves. I'm not sure if he gave a time-frame that would need to occur in.
That is a nice writeup. One of the interesting things for me was to see which topics you decided to cover and which to omit. For instance, I noted `clone --reference --recurse-submodules` as a potential topic of interest, but I am afraid to point anybody to the `--reference` option due to its hidden dangers.
I'm also curious how you came up with 19,290 for a birthday paradox on a 7-hex hash. I think it's 16,384, but probability can sometimes be tricky. :)
I came up with 19,290 using the generalized birthday formula[0] (actually after double-checking it's slightly closer to 19,291).
16,384 is the value you get using the square approximation method[1] which I believe is a bit less accurate in terms of probability, but faster to calculate. I think Git's using square approximation under the hood -- which is probably a good thing since I think it'll always yield a more conservative result.
I'm curious but not motivated enough to really search for it, but for ambiguous hash abbreviations, why not select the oldest, since presumably it was unique at the time it was created?
edit: I guess that information must not exist or I assume they'd be doing it.
As others have noted, there's not always an unambiguous date for some object types (the best you can do for blobs is to find the first commit in which they appeared, and use its date).
However, there'a more complicated issue with timestamps, with is that you care about what was in the repository of the person who generated the sha1, at the time of generation. So you could merge in history that includes older commits, and invalidate your sha1s with "older" objects.
So the timestamp of interest is not the one in the objects themselves, but when they entered some particular repository (and not even some well-known repository; the local clone of whoever happened to generate the sha1). That being said, those two things correlate a lot in practice, and auto-picking the oldest commit might be a useful heuristic.
It would be a fun project to implement as an option for `core.disambiguate`.
That heuristic would work most of the time. But it rests on the assumption that "if commit A has an older timestamp than commit B, then any user who saw commit B must have also seen commit A", which is not reliable in a distributed version control system.
It seems safer to just explicitly tell the user when they're trying to work with an ambiguous hash.
Yeah, I guess dates are only available for commits but not blobs or trees.
This is suggested by the disambiguation listing; if there were dates they would be displayed I hope.
I think approximate dates might be inferred, but since it might be misleading and costlier to determine it makes sense to leave it out - at least in this version of git.
The issue is a different one, I believe you're considering one specific situation while there are others to ponder. What would happen if someone copy & pasted part of the hash, or had some tool that always reduced that output to the first few digits, or other situations like these, how would you be able to tell that the user was actually after the oldest commit?
It seems much easier to indicate there's a problem, a conflict, and let the user solve it.
(I'm the author.) That is an excellent suggestion. Now that you've mentioned it, it seems like a glaring omission. I'll try to fix that up once I get home.
In case the other parts of the README weren't clear, the concept was to use any Unicode character. I was even thinking of (eventually) getting it to encode data with combining accents. Note that it was intended to optimize the string for screen display space (pixels), not space.
I'm not sure it'd be a good fit for git hashes, simply b/c sometimes you need to type or speak a git hash, and the output from baseunicode was definitely not intended to be pronounceable. (Esp. since I was thinking of using CJK characters, but trying to weight them down for their wider screen area; but imaging trying to describe that to a co-worker who might only speak English.)
I wrote it mostly for fun, after I had a couple of difficult to transfer files between machines in the cloud. I find myself ssh'd into weird places, and scp'ing is sometimes trying. (I do machine-to-machine, so I almost always need -3, which I don't know why that isn't the default; scp doesn't deal well with the file being only accessible by root, not your user; scp has the weirdest arg syntax if you have ill-advised characters in your filenames, like spaces…) So I was cat'ing files, copying them from one window, and pasting into another window. base64 for binary data, tar/gzip for making it smaller. But for the copy/paste, scrolling is a pain, and heaven forbid if you're in screen/tmux.
(Also, if you find yourself really without a file that you can't scp, you can "re-implement" scp with `ssh $hosta sudo tar -cz <stuff> | ssh $hostb sudo tar -xz`; see also the -C flag, and don't forget you can also `ssh $host "sudo bash -c 'cd /where && tar -cz <stuff>'"`)
A nice side effect of hex is that you can pick it out of a text commit message with higher accuracy (e.g., to turn it into a hyperlink). Tools like `gitk` and sites like GitHub do this using a regex.
My name is Angela and I do research for Bitbucket. I’m kicking off a round of discussions with people who use Git tools. Ideally, I’d like to talk to people that sit on a team of 3 or more. If this is you, I would love to talk to you about your experience with <using> Git tools, or just some of the pain points that are keeping you up at night when doing your jobs.
We’ll just need 30 mins of your time, and as a token of my thanks to those that participate, I’d like to offer a US$50 Amazon gift voucher.
If you’re interested, just shoot me an email with your availability over the next few weeks and we can set up a time to chat for 30 minutes. Please also include your timezone so we can schedule a suitable time (as I’m located in San Francisco). Hope to talk to you soon!
I'm curious if anyone knows if the optimizations Twitter made to improve fetch performance for large, active repos have made it upstream yet? I don't work there anymore and neither do any of the people who were originally doing that work, but it was a pretty impressive speed up (I could git pull thousands of commits and be done in under a second on a 3GB repo with no large objects). I know the watchman support made it in, which was the other half of what made large repos perform well, but I haven't seen mention of the log-structured patch queue stuff that helped the server by eliminating most of the work to calculate what to send on a fetch. Anyone know?
Case-insensivity is important for some to be able to reliably remember a string. I won't easily retain the difference between 'b4dQbFs31' and 'b4DqBfs31'.
Same thing when speaking it out loud. 'B four D capital Q B capital F s thirty-one' is way more convoluted and error-prone than 'B four D Q B F S thirty-one'.
The best thing I've found that fits this criterion is Crockford's Base 32 [1], basically the extension of hex digits, removing letters ILOU.
But Base 32 (case-insensitivity by proxy) constrains us to 5 bits, which is only a 20% reduction over the 4 bits of base 16. So instead of the 20 bits `1ab2f` we could express them with something like `1qm3`.
Regarding Base 32, I love the justification used for removing U. I, L, and O all have potential confusion with digits, but U was removed because of "Accidental obscenity".
I'm surprised it wasn't just "and all vowels" with the same reasoning, or at least 'a' (because I can more readily think of examples than for, say, 'e').
I suppose, though, there's an attraction in using b32 rather than b29... (Though I notice mid-word apostrophes are double-tap-selectable at least on macOS, so perhaps swapping 'a' for ''' would be advantageous, if more complicated to explain.)
I find it interesting that you're willing to say "twat" but not "bastard". I admit that I didn't think of either of those words, though I think they're still much tamer than the ones with "u" in them. Really what I was thinking of was "crap" and "damn".
I once had a moment of panic after coming up with and shipping a custom scheme along these lines, when I realized there was a decent probability of generating accidental profanity. Afterwards I came up with a filter, and we spent a fun day filling it with every nasty word we could think of.
The best way to use words that I've seen (e.g., over the phone, and also to memorize) is Mnemonicode, originally created by Oren Tirosh, who has since abandoned it, but there are multiple compatible versions all around, see e.g. [0].
Unfortunately, most of the references around the web link to Tirosh's original work on the WayBack machine, which used to be hosted on "tothink.com", but the new owners put a "robots.txt" which make even the old version unaccessible on archive.org
Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's). I wrote a disambiguation library based upon the checksums present in IBAN @ https://github.com/globalcitizen/php-iban .. it's surprising how accurate the mistranscription suggestions are.
In general, pure-number systems are better if possible, but where you have issues squashing in enough data in suitably compact form, transitioning to alphanumeric is better.
You can also consider the use prefix-based systems, either utilizing temporal epochs or node-specific prefixes, both of which can utilize readable aliases.
Finally, for anything expecting human transcription, checksum systems are awesome!
> Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's)
It's interesting that Git 2.11 shortened the delta chains on aggressive repacks, when mercurial happily creates chains of > 1000 deltas (afaik, it doesn't have a hard limit, it stops using deltas when the size of the required deltas is larger than the full text).
Although it's worth noting mercurial and git use different delta formats.
Master Coder: Hmm. We refer to changes by a long, totally non-human-parseable string of characters that nobody can memorize,
and when we abbreviate it, it doesn't work 100% of the time. What can we do about it?
Novice Apprentice: Well... how about we stop using a long totally non-human-parseable string of characters that no
human can can memorize just to briefly refer to specific changes in human-readable output?
MC: What?! HERESY. Making a human use a cryptographic hash to reference a single random logical reference point in a mass of
logical binary objects among millions of others is clearly the best way to go. We just need some quick fixes.
NA: But... you can't reference it via speech, it doesn't work reliably via text when abbreviated, and it gives absolutely no
context whatsoever as to what it is. What's the point of using a cumbersome, inhuman reference for something you
only need to talk about briefly through a computer interface?
MC: SILENCE FOOL! ME DESIGN GOOD. YOU MAKE CODE MASTER ANGRY.
NA: Err... but what if we just let the program rename the references temporarily to human-parseable short strings, and resolve
what they are in between logs and commits?
MC: I SAID SILENCE!! Just for that, i'm going to make you explain to a new user why we force people to regularly clean
our their repositories after doing complicated things with them, like merging.
> Git is full of some of the worst design decisions in modern software history.
It's also full of some of the best software design decisions. The internals of Git are simple and elegant and they work like a charm. There have been very little changes to the internal workings since the first commit of Git.
I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals, with the help of the git manpages it's pretty easy to achieve what you want.
There's software that's intended to work like a black box, just poke around the user interface and you can get stuff done. Git is not one of those. You need to understand the internal model, accept that the UI sucks, embrace the manpages, quit whining and get shit done.
> I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals,
You should not have to understand the design of the modern combustion engine to operate a car. You shouldn't even have to understand bicycle geometry to ride a bike. It's a friggin tool!
Why on earth would you want to have to become a master of the design of a tool to use it? The whole point of making a complicated tool is to make your life easier! No other revision control system is this complicated and annoying. And i'm not going to quit whining, it sucks and it's stupid and it doesn't have to be, and people keep worshipping it like it's this amazing invention like nobody's heard of a merkle tree before. It treats chunks of changed text like blobs, wow, nobody's done that before. Oh a collection of packed objects, how novel.
By the way, the internals aren't that great either. You have to constantly "clean" your repository as it collects useless crap, merges become such a headache you're asked to destroy your merges to make it somewhat sane to maintain, handling "large" objects is a mystery to us, and the entire design is intended to interface with others and yet it's designed as if your personal repo were the only repo in the universe. You have to use a dozen filters and options and processes to do what a single script could do if it asked you what you wanted to get done, but we have to literally sacrifice a goat on the mountain of Unix Philosophy in order to get something done and go back to doing real work.
Why does it force us to send these anonymous patches and not allow merges to happen intelligently among a group using locks? Why does it force a maintainer to do all the work of managing patches? Why does it waste local storage when we don't need 99% of the repository most of the time? Why can't log messages and status be rendered in a quasi-usable way, or, heaven forbid, we have a Curses frontend for the myriad of random chunks of text and commands we have to memorize to accomplish one small simple operation? Why does the repository fall apart into a completely unusable mess if you don't weed it once a day? Why do we have to shuffle around a bunch of commands to perform a single simple task that any reasonable program 15 years ago would have done for you?
Answer: Because the design is crap. If people would at least just admit the design is crap, I would stop whining. But at this point I feel like i'm the only sane human in a world full of people doing the work for the robots and smiling about how much easier their lives are now.
I disagree with you, I think the hash is a very good auto-generated unique identifier to every commit. What do you suggest to use instead? You do have the option to tag commits with human friendly names.
Imagine if you merge a branch of old commits from another repo or something, which introduce short hash collisions. Then you copy/paste a short hash, and Git doesn't know when that reference is from or which branch it might refer to.
It would defeat the purpose of a content-addressable storage system. The fact that the same file and the same tree will always have the same hash is important for speeding up diffs, merges, and other operations.
[1]: https://github.com/git/git/blob/v2.11.0/Documentation/RelNot...