Hacker News new | past | comments | ask | show | jobs | submit login
Git 2.11 has been released (github.com/blog)
281 points by stablemap on Nov 29, 2016 | hide | past | favorite | 65 comments



This is a really well written partial set of release notes. I was curious and looked at the full release notes [1], and I think these are pretty well written as well. I'm very impressed, especially given that git has such a large set of contributors.

[1]: https://github.com/git/git/blob/v2.11.0/Documentation/RelNot...


Another nice writeup from Atlassian:

https://news.ycombinator.com/item?id=13066516


I liked this gem in L547:

> The code that we have used for the past 10+ years to cycle 4-element ring buffers turns out to be not quite portable in theoretical world.


That piqued my curiousity, and I had to dig up the relevant commit: https://github.com/git/git/commit/bb84735c80.

It deals with wrapping an integer index around after incrementing it, the old code just used ++index and a bitmask, the new code uses + 1 and modulo.

I have problems understanding this right now, in my world ++index for an int really shouldn't trigger overflow when counting to at most 4, on any (sem-)realistic environment?

Feeling extra dense, must have more coffee.


The counter never resets, it just keeps going up and we only look at the low bits. So eventually it will need to wrap. It's doubtful that ever happened in practice, even on a 32-bit system (you'd need to print 2 billion SHA-1s in a single process, and even the largest repos have on the order of millions).

So the key difference between the old and the new is that the counter resets to zero every fourth call.


Ah, right, that was the thing I failed to see. Obvious now, of course. :) Thanks.


In Git Rev News edition 20 (https://git.github.io/rev_news/2016/10/19/edition-20/) there are also articles about some changes in v2.11:

- Changing the default for “core.abbrev”?

- Prepare the sequencer for the upcoming rebase -i patches

(I am a Git Rev News editor.)


They were off by a factor of 10 with the likelihood of being struck and killed by lightning, according to the nws website.

To clarify: the likelihood of being merely struck by lightning is ~ 1/1,000,000 per year. The likelihood of being struck and killed is 1/10,000,000 , or about 1/2^23.25

Given this, you would only have to be struck and killed by lightning 6.8 years in a row to equal a sha1 hash collision probability.


    > Given this, you would only have to be struck and killed
    > by lightning 6.8 years in a row to equal a sha1 hash
    > collision probability.
Oh, well that changes everything! ;)


Oh, you're right. I misread the chart (and just fixed the blog post).

There are a lot of other caveats, too. Such as the idea of each year being an independent probability.


More importantly, the comparison is useless. The odds of running into issues with SHA-1 collisions in Git is a very different question from just the odds of two random SHA-1 hashes colliding.


Doesn't the birthday paradox make it much more likely to eventually occur ?


I'm not sure those events are independent... ;-)


I think the probability of being killed by lightning even two years in a row is already zero.


The situational comparison originally used by Linus for a sha1 collision was all the members of your development team being killed and eaten by wolves. I'm not sure if he gave a time-frame that would need to occur in.


Great write-up! I love the focus on performance in this release.

I've put together another write-up of the Git 2.11 release that discusses some of the other new features (and goes into a little more detail on some of the 'sundries'): https://medium.com/@kannonboy/whats-new-in-git-2-11-64860aea...


That is a nice writeup. One of the interesting things for me was to see which topics you decided to cover and which to omit. For instance, I noted `clone --reference --recurse-submodules` as a potential topic of interest, but I am afraid to point anybody to the `--reference` option due to its hidden dangers.

I'm also curious how you came up with 19,290 for a birthday paradox on a 7-hex hash. I think it's 16,384, but probability can sometimes be tricky. :)


Thanks Peff, congrats on the great release!

I came up with 19,290 using the generalized birthday formula[0] (actually after double-checking it's slightly closer to 19,291).

16,384 is the value you get using the square approximation method[1] which I believe is a bit less accurate in terms of probability, but faster to calculate. I think Git's using square approximation under the hood -- which is probably a good thing since I think it'll always yield a more conservative result.

[0]: https://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_col...

[1]: https://en.wikipedia.org/wiki/Birthday_problem#Square_approx...


Right, I am so used to the square approximation being used for hash collisions that I forgot it was an approximation. Thanks for setting me straight.


Great writeup! Btw what tool did you use to make the merge diagrams?


Thanks! Just Keynote & GIMP on macOS


I'm curious but not motivated enough to really search for it, but for ambiguous hash abbreviations, why not select the oldest, since presumably it was unique at the time it was created?

edit: I guess that information must not exist or I assume they'd be doing it.


As others have noted, there's not always an unambiguous date for some object types (the best you can do for blobs is to find the first commit in which they appeared, and use its date).

However, there'a more complicated issue with timestamps, with is that you care about what was in the repository of the person who generated the sha1, at the time of generation. So you could merge in history that includes older commits, and invalidate your sha1s with "older" objects.

So the timestamp of interest is not the one in the objects themselves, but when they entered some particular repository (and not even some well-known repository; the local clone of whoever happened to generate the sha1). That being said, those two things correlate a lot in practice, and auto-picking the oldest commit might be a useful heuristic.

It would be a fun project to implement as an option for `core.disambiguate`.


That heuristic would work most of the time. But it rests on the assumption that "if commit A has an older timestamp than commit B, then any user who saw commit B must have also seen commit A", which is not reliable in a distributed version control system.

It seems safer to just explicitly tell the user when they're trying to work with an ambiguous hash.


Yeah, I guess dates are only available for commits but not blobs or trees.

This is suggested by the disambiguation listing; if there were dates they would be displayed I hope.

I think approximate dates might be inferred, but since it might be misleading and costlier to determine it makes sense to leave it out - at least in this version of git.


Dates are available for some objects, see http://stackoverflow.com/a/39930978/1832154 for more details about what is/isn't available.

The issue is a different one, I believe you're considering one specific situation while there are others to ponder. What would happen if someone copy & pasted part of the hash, or had some tool that always reduced that output to the first few digits, or other situations like these, how would you be able to tell that the user was actually after the oldest commit? It seems much easier to indicate there's a problem, a conflict, and let the user solve it.


Converting to and from base 36 (or 32) would probably do more to help the problem than any heuristics. Compare:

  66c22ba6fbe0724ecce3d82611ff0ec5c2b0255f
to:

  c04bo5604v5qsp6asgasjp9y4paxu8v
That's approx a 25% gain in compactness.



Interesting, I was hoping for an example at the end of the readme


(I'm the author.) That is an excellent suggestion. Now that you've mentioned it, it seems like a glaring omission. I'll try to fix that up once I get home.

In case the other parts of the README weren't clear, the concept was to use any Unicode character. I was even thinking of (eventually) getting it to encode data with combining accents. Note that it was intended to optimize the string for screen display space (pixels), not space.

I'm not sure it'd be a good fit for git hashes, simply b/c sometimes you need to type or speak a git hash, and the output from baseunicode was definitely not intended to be pronounceable. (Esp. since I was thinking of using CJK characters, but trying to weight them down for their wider screen area; but imaging trying to describe that to a co-worker who might only speak English.)

I wrote it mostly for fun, after I had a couple of difficult to transfer files between machines in the cloud. I find myself ssh'd into weird places, and scp'ing is sometimes trying. (I do machine-to-machine, so I almost always need -3, which I don't know why that isn't the default; scp doesn't deal well with the file being only accessible by root, not your user; scp has the weirdest arg syntax if you have ill-advised characters in your filenames, like spaces…) So I was cat'ing files, copying them from one window, and pasting into another window. base64 for binary data, tar/gzip for making it smaller. But for the copy/paste, scrolling is a pain, and heaven forbid if you're in screen/tmux.

(Also, if you find yourself really without a file that you can't scp, you can "re-implement" scp with `ssh $hosta sudo tar -cz <stuff> | ssh $hostb sudo tar -xz`; see also the -C flag, and don't forget you can also `ssh $host "sudo bash -c 'cd /where && tar -cz <stuff>'"`)


That's an interesting thought, and I don't feel there's any advantage in the text being in hex.

The only problem would by now probably too much code expects hex, so I'm not sure the gain is big enough to go through the pain of the switch.


A nice side effect of hex is that you can pick it out of a text commit message with higher accuracy (e.g., to turn it into a hyperlink). Tools like `gitk` and sites like GitHub do this using a regex.


Participate in Atlassian Research

My name is Angela and I do research for Bitbucket. I’m kicking off a round of discussions with people who use Git tools. Ideally, I’d like to talk to people that sit on a team of 3 or more. If this is you, I would love to talk to you about your experience with <using> Git tools, or just some of the pain points that are keeping you up at night when doing your jobs.



We’ll just need 30 mins of your time, and as a token of my thanks to those that participate, I’d like to offer a US$50 Amazon gift voucher. 



If you’re interested, just shoot me an email with your availability over the next few weeks and we can set up a time to chat for 30 minutes. Please also include your timezone so we can schedule a suitable time (as I’m located in San Francisco). Hope to talk to you soon! 



Cheers, 
Angela Guo aguo@atlassian.com


It's been a while since I looked into what Git was up to in the latest version.

The release notes mentioned protocol improvements with git-filter that can dramatically speed up git-LFS (the large file storage plugin).

Does anyone know if there are any plans to make git-LFS part of the base instead of an add on?


There is work going on about external object database support that could help in the long run:

https://github.com/git-lfs/git-lfs/issues/1702

(I am working on this for GitLab.)


I'd assume there's licensing issues before anything else: git-lfs is MIT licensed, git GPLv2.


It's a compatible license, so it's not an issue.


I'm curious if anyone knows if the optimizations Twitter made to improve fetch performance for large, active repos have made it upstream yet? I don't work there anymore and neither do any of the people who were originally doing that work, but it was a pretty impressive speed up (I could git pull thousands of commits and be done in under a second on a 3GB repo with no large objects). I know the watchman support made it in, which was the other half of what made large repos perform well, but I haven't seen mention of the log-structured patch queue stuff that helped the server by eliminating most of the work to calculate what to send on a fetch. Anyone know?


Hexadecimal dumps of binary data are the worst of all worlds if used as keys/references. Hard to memorize, hard to type, look ugly, aren't compact.

Better alternatives:

Base64 without padding: compact.

Grouped decimals: slightly less compact than hexadecimal, but extremely easy to type and pronounce. E.g. 577-467-341-467


Case-insensivity is important for some to be able to reliably remember a string. I won't easily retain the difference between 'b4dQbFs31' and 'b4DqBfs31'.

Same thing when speaking it out loud. 'B four D capital Q B capital F s thirty-one' is way more convoluted and error-prone than 'B four D Q B F S thirty-one'.

The best thing I've found that fits this criterion is Crockford's Base 32 [1], basically the extension of hex digits, removing letters ILOU.

But Base 32 (case-insensitivity by proxy) constrains us to 5 bits, which is only a 20% reduction over the 4 bits of base 16. So instead of the 20 bits `1ab2f` we could express them with something like `1qm3`.

Or we could be using words...

[1]: http://www.crockford.com/wrmg/base32.html


Regarding Base 32, I love the justification used for removing U. I, L, and O all have potential confusion with digits, but U was removed because of "Accidental obscenity".


I'm surprised it wasn't just "and all vowels" with the same reasoning, or at least 'a' (because I can more readily think of examples than for, say, 'e').

I suppose, though, there's an attraction in using b32 rather than b29... (Though I notice mid-word apostrophes are double-tap-selectable at least on macOS, so perhaps swapping 'a' for ''' would be advantageous, if more complicated to explain.)


The two "worst" swearwords that pop into mind both have "u" in them. Meanwhile, the swears with "a" in them seem to be the tamest of the lot.


"Tameness" varies extraordinarily by region - an infamous example being 'twat' which throughout the UK ranges from friendly to vulgar.

Regardless of variance, I don't think I'd regard either the above or the name of a fat character in Austin Powers as "tamest of the lot".


I find it interesting that you're willing to say "twat" but not "bastard". I admit that I didn't think of either of those words, though I think they're still much tamer than the ones with "u" in them. Really what I was thinking of was "crap" and "damn".


I once had a moment of panic after coming up with and shipping a custom scheme along these lines, when I realized there was a decent probability of generating accidental profanity. Afterwards I came up with a filter, and we spent a fun day filling it with every nasty word we could think of.


That's actually pretty brilliant


The best way to use words that I've seen (e.g., over the phone, and also to memorize) is Mnemonicode, originally created by Oren Tirosh, who has since abandoned it, but there are multiple compatible versions all around, see e.g. [0].

Unfortunately, most of the references around the web link to Tirosh's original work on the WayBack machine, which used to be hosted on "tothink.com", but the new owners put a "robots.txt" which make even the old version unaccessible on archive.org

[0] https://github.com/singpolyma/mnemonicode


Being able to easily copy/paste the string is important. The string should be a word so double clicking selects the full string. Maybe base62?


Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's). I wrote a disambiguation library based upon the checksums present in IBAN @ https://github.com/globalcitizen/php-iban .. it's surprising how accurate the mistranscription suggestions are.

In general, pure-number systems are better if possible, but where you have issues squashing in enough data in suitably compact form, transitioning to alphanumeric is better.

You can also consider the use prefix-based systems, either utilizing temporal epochs or node-specific prefixes, both of which can utilize readable aliases.

Finally, for anything expecting human transcription, checksum systems are awesome!


> Still better is grouped alphanumeric without potentially ambiguous characters (ie. generate number-1's but not letter-I's, generate number-0's but not letter-O's)

git license-plate?


Afaik the Damm algorithm is the best for checksums https://en.wikipedia.org/wiki/Damm_algorithm


It's interesting that Git 2.11 shortened the delta chains on aggressive repacks, when mercurial happily creates chains of > 1000 deltas (afaik, it doesn't have a hard limit, it stops using deltas when the size of the required deltas is larger than the full text).

Although it's worth noting mercurial and git use different delta formats.

Edit: This is apparently what chooses to store a delta or not in mercurial: https://www.mercurial-scm.org/repo/hg/file/9e29d4e4e08b/merc... self._maxchainlen is not set by default.


Jesus christ Git's interface design is horrible.

  Master Coder: Hmm. We refer to changes by a long, totally non-human-parseable string of characters that nobody can memorize,
  and when we abbreviate it, it doesn't work 100% of the time. What can we do about it?
  
  Novice Apprentice: Well... how about we stop using a long totally non-human-parseable string of characters that no 
  human can can memorize just to briefly refer to specific changes in human-readable output?
  
  MC: What?! HERESY. Making a human use a cryptographic hash to reference a single random logical reference point in a mass of
  logical binary objects among millions of others is clearly the best way to go. We just need some quick fixes.
  
  NA: But... you can't reference it via speech, it doesn't work reliably via text when abbreviated, and it gives absolutely no
  context whatsoever as to what it is. What's the point of using a cumbersome, inhuman reference for something you
  only need to talk about briefly through a computer interface?
  
  MC: SILENCE FOOL! ME DESIGN GOOD. YOU MAKE CODE MASTER ANGRY.
  
  NA: Err... but what if we just let the program rename the references temporarily to human-parseable short strings, and resolve
  what they are in between logs and commits?
  
  MC: I SAID SILENCE!! Just for that, i'm going to make you explain to a new user why we force people to regularly clean
  our their repositories after doing complicated things with them, like merging.


> Git is full of some of the worst design decisions in modern software history.

It's also full of some of the best software design decisions. The internals of Git are simple and elegant and they work like a charm. There have been very little changes to the internal workings since the first commit of Git.

I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals, with the help of the git manpages it's pretty easy to achieve what you want.

There's software that's intended to work like a black box, just poke around the user interface and you can get stuff done. Git is not one of those. You need to understand the internal model, accept that the UI sucks, embrace the manpages, quit whining and get shit done.


> I agree that the user interface is inconsistent, ugly and hard to grasp. But if you have a solid understanding of the internals,

You should not have to understand the design of the modern combustion engine to operate a car. You shouldn't even have to understand bicycle geometry to ride a bike. It's a friggin tool!

Why on earth would you want to have to become a master of the design of a tool to use it? The whole point of making a complicated tool is to make your life easier! No other revision control system is this complicated and annoying. And i'm not going to quit whining, it sucks and it's stupid and it doesn't have to be, and people keep worshipping it like it's this amazing invention like nobody's heard of a merkle tree before. It treats chunks of changed text like blobs, wow, nobody's done that before. Oh a collection of packed objects, how novel.

By the way, the internals aren't that great either. You have to constantly "clean" your repository as it collects useless crap, merges become such a headache you're asked to destroy your merges to make it somewhat sane to maintain, handling "large" objects is a mystery to us, and the entire design is intended to interface with others and yet it's designed as if your personal repo were the only repo in the universe. You have to use a dozen filters and options and processes to do what a single script could do if it asked you what you wanted to get done, but we have to literally sacrifice a goat on the mountain of Unix Philosophy in order to get something done and go back to doing real work.

Why does it force us to send these anonymous patches and not allow merges to happen intelligently among a group using locks? Why does it force a maintainer to do all the work of managing patches? Why does it waste local storage when we don't need 99% of the repository most of the time? Why can't log messages and status be rendered in a quasi-usable way, or, heaven forbid, we have a Curses frontend for the myriad of random chunks of text and commands we have to memorize to accomplish one small simple operation? Why does the repository fall apart into a completely unusable mess if you don't weed it once a day? Why do we have to shuffle around a bunch of commands to perform a single simple task that any reasonable program 15 years ago would have done for you?

Answer: Because the design is crap. If people would at least just admit the design is crap, I would stop whining. But at this point I feel like i'm the only sane human in a world full of people doing the work for the robots and smiling about how much easier their lives are now.


I disagree with you, I think the hash is a very good auto-generated unique identifier to every commit. What do you suggest to use instead? You do have the option to tag commits with human friendly names.


Big fan of the negative parent selector for merge commits. Also enjoyed the writeup of the algorithm improvements for the various caches.


What a great writeup.


When a non-ambiguous short-hash _becomes_ ambiguous, can't it be disambiguated by simply disregarding those not in existence at time of reference?


Imagine if you merge a branch of old commits from another repo or something, which introduce short hash collisions. Then you copy/paste a short hash, and Git doesn't know when that reference is from or which branch it might refer to.


Why doesn't git upon commit ensure the sha-1 is unique by having some nonce as part of it?

I guess if you only work with rebases instead of merges, it should be possible, right?


It would defeat the purpose of a content-addressable storage system. The fact that the same file and the same tree will always have the same hash is important for speeding up diffs, merges, and other operations.


I think the answer is "it doesn't need to", because the probability of a collision is too low.

The issue is with collisions in the truncated hashes which are often used to refer to objects in emails and such. Not the full SHA.


Git is the worst




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: