Analysing Git Logs. Of Linux Kernel Authors, Fridays and Batman

By Liisa Tallinn

Playing around with Git logs can give you fascinating insights into a project. Whether it's analysing your own work or checking how someone else's project is doing, Git logs have it all. Sometimes all you need is a quick general overview, e.g. counting contributions per author. The next minute, you really need to dig into a specific time period (Fridays!), look at a keyword or email domain. Zooming out and digging deep into Git repos is a walk in the park with the right tool - follow our lead.

As an example, we’re going to play around with the Linux kernel source tree, mostly because of its size (~700MB) and the impressive number of commits. Parsing these logs creates a dataset of more than 800 000 rows, so plenty of records to analyse. The repo was downloaded and logs generated on 1 March 2019. The data stretches back for almost 14 years. When sorting the records based on commit timestamps, the first commit by Torvalds dates back to Saturday, April 16, 2005. As the number of lines added is more than 6,7M, it's "probably" not the birth date of Linux, rather than their first date with git.

Get the Data and the Analyser

To get started and play along 1) pick a Git repository you’re interested in and clone it to your machine (if it's not there yet) 2) navigate to the folder and extract the logs in a pretty format to a file.
git log --pretty=fuller --shortstat > logfile.log
With SpectX, you can quickly analyse any unstructured data in its raw form and the glorious multiline gibberish produced by that command is no exception. For sure, there are plenty of options to generate a compact one-line Git log but because parsing multiple lines with SpectX is relatively easy, we were ok with the fuller option.  SpectX is commercial software but you can grab the full-functionality 30-day free trial to copy-paste and run these queries against your own repo. If needed, SpectX allows you to stay offline. Download and install SpectX to your machine and run queries on the raw log file, no need to install, ingest or import anything into the cloud. See more on installing and getting started in the SpectX docs.

First, let's take a quick look at the raw log file to get an idea of the data structure. Looks like multiline records containing hashes, timestamps, strings, email addresses. 
commit f6163d67cc31b8f2a946c4df82be3c6dd918412d
Merge: 2137397c92ae 0358affb5cd8
Author:   Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Wed Feb 20 14:14:31 2019 -0800
Commit:   Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Feb 20 14:14:31 2019 -0800

    Merge tag 'docs-5.0-fix' of git://git.lwn.net/linux
   
    Pull documentation fix from Jonathan Corbet:
    "A single patch from Arnd bringing some top-level docs into the 5.0
      era"
   
    * tag 'docs-5.0-fix' of git://git.lwn.net/linux:
      Documentation: change linux-4.x references to 5.x

Parse the Log

To aggregate and play around with this multiline data, we need to parse these lines into clean records. The following is the SpectX pattern for getting started with this pretty Git log. See the comments ('//') for details.
//a fixed string 'commit ' followed by up to 40 characters in the range of a-z and 0-9. We'll name the field 'commit'. EOL is end of line.
'commit ' [a-z0-9]{40}:commit EOL

//similarly, fixed strings 'Merge ', 'Author ' 'email '. Then 'LD' matching everything else on that line. Asterisk for matching empty lines.
('Merge: ' LD:merge EOL)?
'Author: ' LD*:authorName ' <' LD*:authorEmail '>' EOL

//the date. Parsing out timestamps as well as leaving timestamps as strings to play with author's local time later.
'AuthorDate: ' (TIMESTAMP('EEE MMM d H:mm:ss YYYY Z'):authorTime):auhtorTimeStr EOL

//a fixed string 'Commit ' everything between this and ' <' is matched by 'LD*'. We'll name this field commitName. The same with commitEmail and commitTime.
'Commit: ' LD*:commitName ' <' LD*:commitEmail '>' EOL
'CommitDate: ' (TIMESTAMP('EEE MMM d H:mm:ss YYYY Z'):commitTime):commitTimeStr EOL
EOL

//finally, up to 500k bytes of data - commitInfo
DATA{0,500000}:commitInfo ((EOL >>('commit ' [a-z0-9]{40}:commit EOL)) | EOF)
It's possible you need to tune the pattern for your own Git logs, e.g. if your timestamps are formatted differently (see the SpectX docs on parsing timestamps).  When done with the pattern, run the first query - parse the log file with the pattern and select the fields you're interested in. Copy the full pattern + query here. 

As a side note, if the pattern didn't match all the bytes, you can take a closer look at bytes the parser didn't like by selecting unmatched bytes and filter out those not null:
@gitlog
.select(_unmatched, *) //add the _unmatched column to your results
.filter(_unmatched is not NULL) //filter out records that contain unmatched bytes
When happy with the parser and the initial query, let's run some detailed queries to dig into the essence of the project.

Question 1: Who are the Top 10 Authors of this Repo?

@gitlog
.filter(merge is NULL) //let's look at only non-merge commits
.select(authorName, count(*)) //select the author-field and count the results
.group(authorName) //aggregate authors
.sort(count desc) //sort the results based on count in a descending order
.limit(10) //limit the result to 10 rows
Copy the full pattern + query here. The reason we're filtering out merges is to look at "true" authors of the code. Merges overwrite the author field with the person performing the merge.  
It takes a second to parse and query this 700MB log file and get an aggregated query result which gives us 19 767 unique authors. These are the top 10 authors based on their number of commits (excluding merges). Go, Al Viro!

Question 2: What are the Commit Dynamics?

Has the Top10 always been the same or have the dynamics changed over the years? To see this, let's chop the time into annual intervals and get the commit count for the top 5 authors for each of those periods. Copy the full pattern + query here.

@gitlog
.select(authorName, year(authorTime), *) //select authors and time in annual intervals
.select(authorTime //count the occurrence of top 5 authors (from the previous query)
  ,Viro:count(authorName = 'Al Viro')
  ,Sweeten:count(authorName = 'H Hartley Sweeten')
  ,Chehab:count(authorName = 'Mauro Carvalho Chehab')
  ,Iwai:count(authorName = 'Takashi Iwai')
  ,Hellwig:count(authorName = 'Christoph Hellwig')
)
.group(year) //aggregate time
The result - this is how they've been rolling. The number of commits per author per year.  Quite a sprint there from H Hartley Sweeten back in 2012-2014.

Question 3: How Many Lines Have The Authors Added/Deleted?

The number of commits doesn't take into account the size of the commit - lines added/deleted. Let's take a look at these numbers to see who is really changing the code. Copy the full pattern + query here.
@gitlog
.select(authorName, insertSum:INT(sum(insertions)), deleteSum:INT(sum(deletions))) //select authors, lines added and deleted cast into integers
.group(authorName) //aggregate unique authors
.sort(insertSum desc) //sort based on number of inserts
// .sort(deleteSum desc) //sort based on number of deletions
.limit(10) //limit the result to 10 rows
These are the results - authors lined up based on the number of lines they've inserted since 2005. Torvalds, unsurprisingly, has contributed most of the lines - more than 6,7 million since 2005 (basically, he beat everyone already with the number of lines in the very first commit of this repo).

When it comes to deleting lines, Greg Kroah-Hartman is the man. Authors lined up based on the number of lines they've deleted since 2005.

Question 4: Which Companies/Organisations are the Top Contributors to the Kernel?

Assuming people committing for work use their employer's email domain to post commits, let's take a look at top domains of the email addresses found in the log. For this, we need some additional parsing for the email to extract the domain (see the red text). Copy the full pattern + query here.
@gitlog

//parse the authorEmail field. Skip everything until the '@' sign and name the rest 'domain' EOS stands for end of string, eof: end of file
.select(PARSE("LD '@' LD:domain EOS", authorEmail),*)
.filter(domain not like '%gmail%') //skip gmail addresses
.select(domain, count:count(*)) //select domain and count everything
.group(@1) //group everything based on the first field (i.e. domain)
.sort(count DESC) //sort count in descending order
.limit(10) //limit the result to 10 rows
The result -  Intel and Redhat are working hard. Especially Intel, if you take a closer look and add the two intel domains that made it to this chart: intel.com and linux.intel.com.

Question 5: What are the Most Popular Commit Messages?

Aggregating commit messages is maybe not the most insightful thing to do.  A rather "because we can" type of query but serves well as an example of looking for and playing around with specific keywords in logs. The main value of SpectX when working with commit messages is the ability to scroll around endlessly and discover surprising details, e.g. that batman on row 5. Turns out it's the real thing, not just a goofy commit message: "Batman advanced is a new approach to wireless networking which does no longer operate on the IP basis."

Copy the full pattern + query here.
@gitlog
.select(commitInfo, count(*) as cnt)
.group(commitInfo)
.sort(cnt desc)
The result arrives in 3 seconds. Top 10 of identical commit messages to the Linux kernel repo.

Question 6: What Happens on Fridays?

Don’t deploy on a Friday, they say. Don’t commit on a Friday, they say. But what would Linus do? A hard-working legend with more than 26 000 commits for the Linux kernel (if you include merges). As we need to parse out the author's local timestamp to calculate the weekday for that particular day, the answer to this question calls for parsing the timestamp-to-string field and converting it into actionable fields (see the red text). Copy the full pattern + query here. 
@gitlog
 .select(author_time:parse("LD:day_of_week ' ' LD:month ' ' INT:day ' ' INT:hour ':' INT:minute ':' INT:second ' ' INT:year ' ' LD:timezone EOF", auhtorTimeStr), *)
.filter(authorName like 'Linus Torvalds')
.select(author_time[day_of_week], count(*) as count)
.group(day_of_week)
.sort(count desc);
The result - Torvalds' commit count split into weekdays. Conclusion:  if it's Friday and you really need an excuse to commit then Linus has done most of his commits for the Linux kernel on a Friday since 2005.

When excluding merges, the picture changes but the result is still surprising. Sundays beat every other weekday as commit day for Torvalds:

Conclusion

Playing around with Git logs is fun. We could go on forever - see the dynamics of added lines in time, zoom into the activities of a particular author or commit messages containing xyz. SpectX allows you to get from an idea to result in seconds, so asking any question important in the context of a particular repo is a piece of cake. Do try this at home - installing SpectX to your desktop and pointing it to your data source to parse and analyse any unstructured data is easy and only takes a couple of minutes. See the docs for instructions.

Back to articles