I needed to get some historical data out of Github that was no longer available and stumbled over a really neat project called GHTorrent that records and augments the Github public event stream. I thought it would be neat to see who’s using Github using this information.

Note: If you only need to do a query or two, there’s a web-based SQL tool here.

We first need to download and then restore the database to get access to the latest dump of users. For this you have two choices;

  1. A huge (14.1+GB) MySQL dump containing everything that GHTorrent has ever seen. By far the most up to date option.
  2. Chunked MongoDB backups for specific collections (users in this case) which is much, much smaller. However, this is very out of date.

We’re going to go with the large MySQL dump for this, as it has over a million more users than the MongoDB backups. So go ahead and start downloading the latest dump from here.

Now we’re left with a single utterly massive Gzip’d SQL dump. We don’t want or need to expand this entire thing (which would be over 100GB) and all we care about is the users table so we don’t want to restore it all anyways.

So, let us pipe it and cut out just the users table. This way we don’t waste any RAM or disk space.

gzip -d -c mysql-2014-11-10.sql.gz |\
    sed -n -e '/CREATE TABLE `users`/,/UNLOCK TABLES;/p' |\
    sed -e '16d' > users.sql

At this point you might run into an error:

sed: RE error: illegal byte sequence

This is caused by the eternal pain of Unicode, so lets cheat and tell sed to treat everything as a simple byte string and run it again.

export LC_CTYPE=C
export LANG=c
gzip -d -c mysql-2014-11-10.sql.gz |\
    sed -n -e '/CREATE TABLE `users`/,/UNLOCK TABLES;/p' |\
    sed -e '16d' > users.sql

Now we have a file users.sql which contains a dump of only the users table at a measly ~524MB.

Assuming you’ve already created a MySQL/MariaDB database called “github”, lets go ahead and restore it now.

mysql -u root -p github < users.sql

Woohoo! We’ve got a local copy of over 4.7 million users now. Let’s slice the domain from each email address and see who our winners are:

USE github;
SELECT
    SUBSTRING(email, INSTR(email ,'@') + 1) as email_domain,
    COUNT(email) as email_count
FROM users
WHERE LENGTH(email) > 3
GROUP BY email_domain
ORDER BY email_count DESC;
email_domain email_count
gmail.com 1120807
hotmail.com 89317
yahoo.com 48821
users.noreply.github.com 41457
qq.com 33443
163.com 25998
outlook.com 16523
me.com 11916
live.com 11516
googlemail.com 11483
mail.ru 11093
126.com 10325
yandex.ru 8043
localhost.localdomain 7364
gmx.de 7319

Huh, well I guess that was predictable! The top 15 are dominated by the major email providers. #4 is a fake placeholder, and #14 is a default host on Linux and some BSDs.

Lets try something more interesting and see which USA schools are into Github.

USE github;
SELECT * FROM (
	SELECT
		SUBSTRING(email, INSTR(email ,'@') + 1) as email_domain,
		SUBSTRING(email, -3, LENGTH(email)) as email_suffix,
		COUNT(email) as email_count
	FROM users
	WHERE LENGTH(email) > 3
	GROUP BY email_domain
	ORDER BY email_count DESC
) as t
WHERE email_suffix = 'edu'
LIMIT 15;
email_domain email_suffix email_count
mit.edu edu 2089
berkeley.edu edu 2000
umich.edu edu 1709
stanford.edu edu 1702
uw.edu edu 1231
cornell.edu edu 1196
gatech.edu edu 1059
andrew.cmu.edu edu 1059
usc.edu edu 1039
nyu.edu edu 953
fullsail.edu edu 920
utexas.edu edu 917
vt.edu edu 888
illinois.edu edu 850
virginia.edu edu 849

Neat! It would be interesting to mix this with the size of each school’s CS/CE departments and see what the adoption rate is. Of course, it’s only an aproximate since someone might be using their personal email address.

Something else that might be interesting, let’s try the 10 largest software companies (as listed by Forbes).

USE github;
SELECT * FROM (
	SELECT
		SUBSTRING(email, INSTR(email ,'@') + 1) as email_domain,
		COUNT(email) as email_count
	FROM users
	WHERE LENGTH(email) > 3
	GROUP BY email_domain
	ORDER BY email_count DESC
) as t
WHERE email_domain IN (
	'microsoft.com',
	'oracle.com',
	'sap.com',
	'symantec.com',
	'vmware.com',
	'fiserv.com',
	'ca.com',
	'intuit.com',
	'salesforce.com',
	'amadeus.com'
)
email_domain email_count
microsoft.com 1550
oracle.com 907
sap.com 853
vmware.com 690
salesforce.com 546
intuit.com 287
amadeus.com 154
symantec.com 77
ca.com 69
fiserv.com 4