Re: git: a1097094c4c5 - main - newvers: Set explicit git revision length

From: Brooks Davis <brooks_at_freebsd.org>
Date: Thu, 19 Dec 2024 18:45:41 UTC
On Thu, Dec 19, 2024 at 10:03:05AM -0500, John Baldwin wrote:
> On 12/18/24 12:12, Gleb Smirnoff wrote:
> > On Wed, Dec 18, 2024 at 10:22:24AM -0500, Ed Maste wrote:
> > E> That said, it doesn't matter what Git's algorithm chooses as the short
> > E> hash length; specifying --short bypasses that algorithm. `git
> > E> rev-parse --verify --short=12 HEAD` will give us a 12-character short
> > E> hash as long as that hash is unique. The reproducibility concern is
> > E> thus: what is the probability that the 12-character short hash is
> > E> unique at the time and in a repo from which an image is built, but is
> > E> not unique for the attempt to reproduce it, or vice-versa. This
> > E> probability is rather small.
> > E>
> > E> If you look at arbitrary commits 6 or 7 characters are usually
> > E> sufficient for a unique hash today. For instance, some latest -pX from
> > E> recent releng/ branches:
> > E>
> > E> 13.3: 72aa3d
> > E> 13.4: 3f40d5
> > E> 14.0: f10e32
> > E> 14.1: 74b6c98
> > E> 14.2: c8918d6
> > E>
> > E> The status quo of --short=12 should be fine for quite some time.
> > 
> > AFAIU John's concern is that you can't guarantee a reproducible build from a
> > "dirty" repository.  A repository that has more branches than just the official
> > ones.  I just make a quick check on Netflix repo, that has both the current
> > FreeBSD history and the before-the-official-git history together, as well as
> > splitted ports subdirectories and of course our own stuff.  For short hashes
> > there are roughly 2x more ambiguities than for a "clean" repo.  Apparently
> > chance of collision on a long hash is also doubled.
> > 
> > We can of course say that we don't provide reproducible builds from a "dirty"
> > repo.  But would be a real limitation.  That would cancel a legitimate
> > scenario:
> > 
> > git subtree add FreeBSD && cd FreeBSD && make a reproducible build
> 
> In particular, the dirty repository scenario I imagine is FreeBSD's official
> repository at some point in the future.  A question though is how far in the
> future would it have to be to matter.  If we would need 100+ years at our
> current commit rate to matter, then this is probably moot.  The other point
> I guess is that how many other user git settings can affect the build?  Should
> we not require an empty global git config as a prereq for someone who wants a
> reproducible build (and use the same setup for our official builds) and say
> that if you adjust your user config to impact the build that's kind of your
> problem?

I'm not super concerned about rollover here.  If it becomes an issue,
and someone wants to reproduce the build in the future (e.g., a decade
from now) they can always produce a custom repo with future history
removed to avoid having git add extra digits.  IMO that's going to be
the least of their problems given they will need to bootstrap the
correct LLVM in order to make sure binaries are the same.

For FreeBSD itself, I think we're a very long way away.  FreeBSD main
from about a week ago has 296268 commits per `git rev-list --count HEAD`
and CheriBSD has more than twice as many at 662027[0] (more than LLVM's
521761).  All default to 12 digits for short.  If we wanted to add some
margin going to 13 should last until SHA1 is completely untenable as a
hash.

-- Brooks

[0] For those following along, this has two causes: 1) we have both the
current history and uqs's git export history in our history, 2) We merge
each upstream commit individually so we've added a merge commit for each
first-parent commit to src/main since 2015.