Git LFS

Git has become the de facto version control system for software developers all around the world. This open-source, distributed version control system is faster than its competitors. It is easy-to-use for branching and merging code. However, it has a performance problem with large binary files. Git Large File Storage (LFS) was developed to address this issue.

The Large File Problem in Git

Traditionally, certain companies and institutions have stayed away from Git due to the inefficiency in large binary file handling. Video game developers and media companies have to deal with complex textures, full-motion videos, and high-quality audio files. Research institutes have to keep track of large datasets that can be gigabytes or terabytes. Git has difficulty maintaining these large files.

To understand the problem, we need to take a look at how Git keeps track of files. Whenever there is a commit, Git creates an object node with a pointer to its parent or multiple parents. The Git data model is known as the directed acyclic graph (DAG). The DAG model ensures the parent-to-child relationship can never form any cycles.

We can inspect the inner workings of the DAG model. Here is an example of three commits in a repository:

$ git log --oneline
2beb263 Commit C: added image1.jpeg
866178e Commit B: add b.txt
d48dd8b Commit A: add a.txt

In Commit A and B, we added text file a.txt and b.txt. Then in Commit C, we added an image file called image1.jpeg. We can visualize the DAG as following:

Commit C Commit B Commit A
2beb263 --> 866178e --> d48dd8b

If we inspect the last commit with the following command:

$ git cat-file -p 2beb263
tree 7cc17ba5b041fb227b9ab5534d81bd836183a4e3
parent 866178e37df64d9f19fa77c00d5ba9d3d4fc68f5
author Zak H 1513259427 -0800
committer Zak H 1513259427 -0800
Commit C: added image1.jpeg

We can see that Commit C (2beb263) has Commit B (866178e) as the parent. Now if we inspect the tree object of Commit C (7cc17ba), we can see the blobs (binary large objects):

$ git cat-file -p 7cc17ba
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    b.txt
100644 blob a44a66f9e06a8faf324d3ff3e11c9fa6966bfb56    image1.jpeg

We can check the size of the image blob:

$ git cat-file -s a44a66f9e
871680

Git is keeping track of the changes in this tree structure. Let's make a modification to the image1.jpeg and check the history:

$ git log --oneline
2e257db Commit D: modified image1.jpeg
2beb263 Commit C: added image1.jpeg
866178e Commit B: add b.txt
d48dd8b Commit A: add a.txt

If we check the Commit D object (2e257db):

$ git cat-file -p 2e257db
tree 2405fad67610acf0f57b87af36f535c1f4f9ed0d
parent 2beb263523725e1e8f9d96083140a4a5cd30b651
author Zak H 1513272250 -0800
committer Zak H 1513272250 -0800
Commit D: modified image1.jpeg

And the tree (2405fad) inside it:

$ git cat-file -p 2405fad
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    a.txt
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    b.txt
100644 blob cb4a0b67280a92412a81c60df36a15150e713095    image1.jpeg

Notice that the SHA-1 hash for image1.jpeg has changed. It means it has created a new blob for image1.jpeg. We can check the size of the new blob:

$ git cat-file -s cb4a0b6
1063696

Here is a way to visualize the above DAG structure:

Each commit object maintains its own tree. Blobs are maintained inside that tree. Git optimizes space by making sure it only stores the differences and uses compression for storage. But for binary file changes, Git has to store whole files in the blobs because it's difficult to determine the differences. Also, image, video and audio files are already compressed. As a result, for each instance of a modified binary file, the tree ends up with a large blob.

Let's think of an example where we make multiple changes to a 100 MB image file.

Commit C --> Commit B --> Commit A
|                  |           |
Tree3            Tree2        Tree1
|                  |            |
Blob3            Blob2         Blob1
300 MB 200MB 100MB

Every time we change the file, Git has to create a 100 MB blob. So only after 3 commits, the Git repository is 300 MB. You can see that the size of the Git repository can quickly blow up. Because Git is a distributed version control, you are going to download the whole repository to your local instance and work with branches a lot. So the large blobs become a performance bottleneck.

The Git LFS solves the problem by replacing the blobs with lightweight pointer files (PF) and creating a mechanism to store the blobs elsewhere.

Locally Git stores the blobs in the Git LFS cache, and remotely it will store them in the Git LFS store on GitHub or BitBucket.

PF1 --> Blob1
PF2 --> Blob2
PF3 --> Blob3

Now when you are dealing with the Git repository the lightweight PF files will be used for the routine operations. The blobs will be retrieved only when necessary. For example, if you checkout Commit C, then Git LFS will look up the PF3 pointer and download Blob3. So the working repository will be leaner and the performance will be better. You don't have to worry about the pointer files. Git LFS will manage them behind the scenes.

Installing and Running Git LFS

There have been previous attempt to solve the Git large file problem. But Git LFS has succeeded because it is easy-to-use. You just have to install LFS and tell it which files to track.

You can install Git LFS using the following commands:

$ sudo apt-get install software-properties-common
$ curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
$ sudo apt-get install git-lfs
$ git lfs install

Once you have installed Git LFS, you can track the files you want:

$ git lfs track "*.jpeg"
Tracking "*.jpeg"

The output shows you that Git LFS is tracking the JPEG files. When you start tracking with LFS, you will find a .gitattributes file that will have an entry showing the tracked files. The .gitattributes file use the same notation as .gitignore file. Here is how the content of .gitattributes looks:

$ cat .gitattributes
*.jpeg filter=lfs diff=lfs merge=lfs -text

You can also find which files are tracked using the following command:

$ git lfs track
Listing tracked patterns
*.jpeg (.gitattributes)

If you want to stop tracking a file, you can use the following command:

$ git lfs untrack "*.jpeg"
Untracking "*.jpeg"

For general Git operations, you don't have to worry about LFS. It will take care of all the backend tasks automatically. Once you have set up Git LFS, you can work on the repository like any other project.

Further Study

For more advanced topics, look into the following resources:

Moving Git LFS repository between hosts
Deleting Local Git LFS files
Removing remote Git LFS files from the server
Git LFS Website
Git LFS Documentation

References:

git-lfs.github.com: GitHub repo
github.com/git-lfs/git-lfs/tree/master/docs: GitHub Documentation for Git LFS
atlassian.com/git/tutorials/git-lfs: Atlassian Tutorials
youtube.com: What is Git LFS
youtube.com: Tracking Huge Files with Git LFS by Tim Pettersen, Atlassian
youtube.com: Managing huge files on the right storage with Git LFS, YouTube
youtube.com: Git Large File Storage - How to Work with Big Files, YouTube
askubuntu.com/questions/799341: how-to-install-git-lfs-on-ubuntu-16-04
github.com/git-lfs/git-lfs/blob/master/INSTALLING.md: Installation Guide