GitHub Hash Search

A few weeks ago my co-worker issued a challenge to my team as a response to a random conversation:

The first one to find one of these as a legitimate commit ID wins:

We speculated a little bit about how you might accomplish it - mostly some varation on "clone some large git repo and grep through the commits", but quickly shrugged it off. I found myself thinking about it later that day and couldn't help but really dive into the problem.

Estimating the scale

First I thought a little bit about the scale of the problem - how many git commits do I need to search before I can reasonably expect one of them to begin with a magic string?

git uses 160 bit SHA1 hashes for commit IDs so there are \(2^{160}\) possible hashes. I want to find any commit that has a magic string at the beginning of it's hash. The typical length of the strings is about 7 (hex) characters = 3.5 bytes = 28 bits so for each magic string, there are \(2^{160-28}\) possible commit hashes that start with that string.

The probability of any given hash starting with a particular magic string is \(\frac{2^{160-28}}{2^{160}} = 2^{-28}\approx\frac{1}{250\text{ million}}\). Since I actually have close to 40 different strings that are approximately 7 characters each (of course some are shorter, but I'm only trying to get a rough estimate here), the probability that a commit hash starts with some magic string is about \(\frac{40}{250\text{ million}}\) or about 1 in 6 million. So if I can somehow collect on the order of 10 million git commit hashes I should find one of these strings.

I later wrote a script to compute these odds more precisely for a given set of magic strings, and found that 1 in 7,642,290 hashes should start with one of the magic strings.

Collecting the hashes

Alright so now where am I going to find around 10 million git commit hashes? I thought about cloning a few large git repos, like the linux kernel, but that would require downloading a ton of actual diff data that I don't care about, and even then the linux kernel only has 722,647 commits at the time of writing, so I'd need to find 10 or so different repositories of a similar scale.

Before attempting that I poked around in the man pages a bit and found git ls-remote. It takes a git remote url and returns a list of references available on the remote along with their commit IDs. Here's the output for this website's git repo:

$ git ls-remote
fb6bac0e6efff6f8f74a35a726a2e411ec00d1ad    HEAD
47e11c77a29862b8875e60aea8c9a755c84c31ee    refs/heads/custom-theme
fe6f73099c2c7a40997c32396eab60c65a5ef06c    refs/heads/git-stash
fb6bac0e6efff6f8f74a35a726a2e411ec00d1ad    refs/heads/master

Of course the catch is that ls-remote only returns commit IDs for references like tags and branches, and most git repos don't have that many of them, even the linux kernel only has about 2000:

$ git ls-remote | wc -l

but if I can somehow find a ton of git remotes this might work!

Enter GitHub

An obvious source for a large number of git remotes is GitHub. GitHub hosts about 25 million public repositories1, and it has a public API for accessing them. I wrote a python script to hit the search API:

$ python pie

Then, used my machine's built-in dictionary to run through every dictionary word:

$ cat /usr/share/dict/american-english | grep -v "'" | xargs -I{} python {}
Searching for A...

Then I ran git ls-remote on every remote and collected all the references (with their commit hashes) in a file. chains all these steps together and generates a hashes file with all the results. The search API only allows 30 requests per minute which seriously rate limits the entire process, but at least its fully automated. I let the script run for several days, and eventually collected 57.8M unique commit hashes.

Once I'd collected all these results in a file its a simple matter to grep the results for magic strings... right?

Searching the Results

I figured searching the results would be relatively easy, and it was at first, but once I had gigabytes of GitHub commit hashes it started to take a significant amount of time to search the file.

Initially, I ran my searches with the command:

cat hashes_to_find.txt | xargs -P4 -I{} grep ^{} hashes

This uses xargs to spawn an independent grep process for each commit hash, and run up to 4 different searches in parallel. I eventually realized that this is a pretty silly approach - each process scans through the entire file separately, when a single process could easily scan through the file once searching for every string.

I wrote a second command that does just that. It uses a short python script to generate a regex of the form ^facade|^1badb002|^8badf00d|^abababab..., which is then passed to grep -E.

$ grep -E `cat hashes_to_find.txt | python -c "import sys; print('|'.join('^' + line.strip() for line in sys.stdin))`"

I also investigated using grep -F, a version of grep that uses the Aho-Corasick Algorithm2. Aho-Corasick is supposed to be faster than regex when searching for a set of fixed length strings, which sounds perfect. Unfortunately, grep -F searches for matches anywhere in the line, and I'm only interested in matches at the beginning.

I benchmarked the different approaches and found that searching with a single grep -E command is about 10x faster on my machine.

Enough! Tell me what you found!

Ok enough explanation - here's a rundown of the commits that I found. After running a GitHub API search for every word in my system's dictionary I collected 57.8M unique commit hashes. Out of those, I found 12 that begin with a magic string:

The astute reader will notice that according to my probability estimate of 1 in ~7.5M I should have found ~7.7 magic strings in this dataset. Obviously the standard disclaimer applies - we're talking about probabilities here, just because you saw 10 heads in a row doesn't mean the coin is broken, it means you saw one strange but equally likely outcome. Also my numbers were thrown off a bit by bradfitz, who wrote a go program that brute-forces a desired commit hash. He used it to generate a commit with the hash deadbeef, which I found in my search.

Needless to say, I won my co-worker's challenge. Still not sure what the prize is.

  1. I found this in the State of the Octoverse
  2. Wikipedia has a really useful write-up on the variations of grep
  3. For whatever reason I can't link directly do this commit on GitHub. but if you run git ls-remote on the remote the commit is there