Getting started with git-annex

Table of Contents

Author Lee Hinman (leehinman@fastmail.com)
Date 2016-08-10 09:50:50

Introduction

In this article, I am hoping to show how to set up distributed file backups using git-annex. When I started using git-annex I found that while the documentation is plentiful and quite helpful, it's hard to know what exactly to look for unless already familiar with git-annex. I also hard a hard time wrapping my head around some of the concepts associated with it. I'm hoping this will be useful to anyone interested in using git-annex.

Git-annex is a tool that adds an abstraction on top of the git distributed version control system (VCS). It allows you to use git to manage file locations and directory history without actually committing large files into the git repository itself (because git does not handle large binary files particularly well).

Why git-annex?

So why git-annex? Why not manage this myself using rsync, or cp, or <insert pay service>? Well, there are really 3 main reasons, backups, location tracking, and protection.

Backups

Backups. Ugh. It seems like almost everyone with any non-trivial amount of data has a different way of handling this. When I started looking at backups, I was really distressed by some of the solutions. For instance, things like Apple's "Time Machine" only work with HFS+ which makes it totally useless for multiple operating systems. I also wanted something where accessing information was possible without having to install a particular piece of software, and was designed in the manner of "maybe in the future, not EVERYONE will be using X". I think git-annex has a great design page on this called "future proofing".

Location tracking

This also kind of ties into the backups. I have 3-4 large external hard drives that I end up backing up most of my important data, and I can never remember which drive contains which content. git-annex has a nice way of tracking this through the "whereis" command.

Protection

Data is only as good as it's resiliency, and not being able to tell if data is corrupted is almost as bad as having corrupted data. git-annex can perform "fsck" on repository contents to detect corruption. It can also be configured to ensure that you never remove the last copy of the data, or that you always keep a certain number of copies of data. Each repository "remote" can be configured with a variable amount of trust as well.

Okay, enough about the why of git-annex, let's talk about the how.

Setting up your first git-annex repository

In this example I'm going create a new repository of files, in this case, some images I have laying around. Any files will work though!

Initializing and adding the first files

A git-annex distribution is just like a git repository, meaning it's a directory that has had git init run in it, so let's do exactly that:

hinmanm@Xanadu 0 ~()% mkdir ~/myrepo

hinmanm@Xanadu 0 ~()% cd myrepo

hinmanm@Xanadu 0 ~/myrepo()% git init
Initialized empty Git repository in /Users/hinmanm/myrepo/.git/

hinmanm@Xanadu 0 ~/myrepo(master)%

Great, that was easy.

One thing in addition to this is that we need to initialize this as a git-annex repository as well, along with the name of this repository. Since I'm running this on my laptop, I'm going to initialize git-annex with "laptop" as the name of this repository.

hinmanm@Xanadu 0 ~/myrepo(master)% git annex init "laptop"
init laptop ok
(Recording state in git...)

hinmanm@Xanadu 0 ~/myrepo(master)%

Awesome, everything is set up, now all we need to do is to add some files. I have a great video of Grace Hopper explaining a nanosecond that I'd like to backup, so I'll copy it into this folder and add it to the repository using git annex add

hinmanm@Xanadu 0 ~/myrepo(master)% cp -v ~/Downloads/Grace-Hopper-Nanoseconds.mp4 .
/Users/hinmanm/Downloads/Grace-Hopper-Nanoseconds.mp4 -> ./Grace-Hopper-Nanoseconds.mp4

hinmanm@Xanadu 0 ~/myrepo(master*)% git annex add Grace-Hopper-Nanoseconds.mp4
add Grace-Hopper-Nanoseconds.mp4 ok
(Recording state in git...)

hinmanm@Xanadu 0 ~/myrepo(master*)%

Don't forget to commit the file!

hinmanm@Xanadu 0 ~/myrepo(master*)% git commit -m "Added video of Grace Hopper explaining the nanosecond"
[master (root-commit) ea6e382] Added video of Grace Hopper explaining the nanosecond
 1 file changed, 1 insertion(+)
 create mode 120000 Grace-Hopper-Nanoseconds.mp4

hinmanm@Xanadu 0 ~/myrepo(master)%

Understanding what git-annex does with files

It turns out that when we add the files, git-annex is not actually adding the content of the files to the git repo, you can see this using ls -lh, which will show that "Grace-Hopper-Nanoseconds.mp4" is now a symlink:

hinmanm@Xanadu 0 ~/myrepo(master)% ls -lh
total 8
lrwxr-xr-x 1 hinmanm staff 198B Dec 20 12:04 Grace-Hopper-Nanoseconds.mp4@ ->
.git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4

hinmanm@Xanadu 0 ~/myrepo(master)%

git-annex computes the checksum of the file and uses that for the actual file name inside of the .git/annex/objects folder. You can still access the original file by opening the symlink.

So far, we haven't really done much other than adding a file, so let's use git-annex to maintain a second location for this repository.

Adding a second location for data (USB drive)

I'm going to use an external USB drive as the second drive, because that's usually where I perform backups, however, git-annex supports anything that could be a git repository, meaning github, or another computer, it even supports a lot of special remotes like backing up the S3.

I am using a USB drive mounted at /Volumes/MINIDRIVE1/, so adjust as needed to a different location on disk, wherever you device is mounted. You can also use another folder if you don't have a USB drive handy.

First, let's make a folder on the USB drive and initialize it as both a regular git repo and a git-annex repository, but this time, named "usbdrive"

hinmanm@Xanadu 0 ~/myrepo(master)% mkdir /Volumes/MINIDRIVE1/myrepo

hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo()% git init
Initialized empty Git repository in /Volumes/MINIDRIVE1/myrepo/.git/

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex init "usbdrive"
init usbdrive 
  Detected a filesystem without fifo support.

  Disabling ssh connection caching.
ok
(Recording state in git...)

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%

Next we need to let each of the two locations ~/myrepo and /Volumes/MINIDRIVE1/myrepo know about each other. This is accomplished by adding each location as a remote, using the git remote command.

First we'll let the original repo know about the USB drive

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% cd ~/myrepo
hinmanm@Xanadu 0 ~/myrepo(master)% git remote add usbdrive /Volumes/MINIDRIVE1/myrepo
hinmanm@Xanadu 0 ~/myrepo(master)%

And then let the usb drive know about the laptop

hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo/
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git remote add laptop ~/myrepo
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% cd ~/myrepo
hinmanm@Xanadu 0 ~/myrepo(master)%

Now that each repository knows about the other one, you can use the git annex sync command to synchronize the git state between each, so if we run it in each folder we can see that the folder on the USB drive has been updated with the name of the file in the repository

hinmanm@Xanadu 0 ~/myrepo(master)% git annex sync
commit  ok
pull usbdrive 
warning: no common commits
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 5 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (5/5), done.
From /Volumes/MINIDRIVE1/myrepo
 * [new branch]      git-annex  -> usbdrive/git-annex
ok
(merging usbdrive/git-annex into git-annex...)
(Recording state in git...)
push usbdrive 
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (13/13), done.
Writing objects: 100% (17/17), 1.60 KiB | 0 bytes/s, done.
Total 17 (delta 1), reused 0 (delta 0)
To /Volumes/MINIDRIVE1/myrepo
 * [new branch]      git-annex -> synced/git-annex
 * [new branch]      master -> synced/master
ok

hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex sync
(merging synced/git-annex into git-annex...)
commit  ok
pull laptop 
From /Users/hinmanm/myrepo
 * [new branch]      git-annex  -> laptop/git-annex
 * [new branch]      master     -> laptop/master
 * [new branch]      synced/master -> laptop/synced/master


Already up-to-date.
ok
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% ls -lh
total 64
lrwxrwxrwx 1 hinmanm staff 198B Dec 20 13:09 Grace-Hopper-Nanoseconds.mp4@ ->
.git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%

However, you can see that the content hasn't actually been copied over, only the metadata (the symlink), you can verify this with the file command

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% file Grace-Hopper-Nanoseconds.mp4
Grace-Hopper-Nanoseconds.mp4: broken symbolic link to
.git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4

Copying data between locations

Okay, this would be pretty crummy if all it did was keep track of metadata, but git-annex can also shuffle data between different locations, with a variety of methods.

The easiest way to ensure the data is an all places is to use the sync command with the --content flag, so if we run this in either repository, the content will be copied over

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex sync --content
commit  ok
pull laptop 
ok
get Grace-Hopper-Nanoseconds.mp4 (from laptop...) 
SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4
     5769029 100%   60.12MB/s    0:00:00 (xfer#1, to-check=0/1)

sent 5769900 bytes  received 42 bytes  3846628.00 bytes/sec
total size is 5769029  speedup is 1.00
ok
pull laptop 
ok
(Recording state in git...)
push laptop 
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 454 bytes | 0 bytes/s, done.
Total 5 (delta 1), reused 0 (delta 0)
To /Users/hinmanm/myrepo
 * [new branch]      git-annex -> synced/git-annex
ok

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% file Grace-Hopper-Nanoseconds.mp4
Grace-Hopper-Nanoseconds.mp4: ISO Media, MPEG v4 system, version 2

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%

If you don't want the blanket copy that git annex sync --content does, you can move data in a much more fine-grained approach by using the move or copy commands, for instance:

hinmanm@Xanadu 0 ~/myrepo(master)% git annex move . --to usbdrive
(merging synced/git-annex into git-annex...)
move Grace-Hopper-Nanoseconds.mp4 ok
(Recording state in git...)

This is moving by copying the file, but git-annex can also copy the file other ways, depending on the type of remove it is, however describing them all is out of scope for this article.

Finding out where your files are

At any time, you can see where the data is using the whereis command. I'm using "." here to signify the whole directory, but you can also use "*", or "*.mp4" and all the other things you would expect when using a shell.

hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis .
whereis Grace-Hopper-Nanoseconds.mp4 (1 copy) 
    d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive]
ok

hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --from usbdrive
copy Grace-Hopper-Nanoseconds.mp4 (from usbdrive...) 
SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4
     5769029 100%    2.40MB/s    0:00:02 (xfer#1, to-check=0/1)

sent 5769900 bytes  received 42 bytes  2307976.80 bytes/sec
total size is 5769029  speedup is 1.00
ok
(Recording state in git...)

hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis .
whereis Grace-Hopper-Nanoseconds.mp4 (2 copies) 
    69f0a35e-91c7-43c5-9ce0-89af0056b523 -- laptop [here]
    d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive]
ok

Notice that the only copy of the data is on "usbdrive", then I used the copy command to copy it from the USB drive, and then git-annex shows that there are two copies, "here" which is the laptop, and on the "usbdrive" remote.

Removing unneeded data

When you don't want to store the file locally, use the drop command, which will remove the file locally, but only after ensuring that at least 1 copy of the data is still available somewhere (this number is configurable too).

hinmanm@Xanadu 0 ~/myrepo(master)% git annex drop .
drop Grace-Hopper-Nanoseconds.mp4 ok
(Recording state in git...)

hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo

hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex drop .
drop Grace-Hopper-Nanoseconds.mp4 (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies

  Rather than dropping this file, try using: git annex move

  (Use --force to override this check, or adjust numcopies.)
failed
git-annex: drop: 1 failed

hinmanm@Xanadu 1 /Volumes/MINIDRIVE1/myrepo(master)%

Keep in mind that a repository needs to be available in order for git-annex to ensure that the data still resides there however. You won't be able to drop the file for this repository if the USB drive is disconnected, because git-annex can't tell whether the file is still actually on the disk.

Even if a file's content is removed, any folders as well as the symlink will remain (it will be a broken symlink) so you know which files are part of the repository. Additionally, you can create branches and revert commits just like you would expect because all of the information metadata is stored in the regular git repository.

This is where I really get a lot of value out of git-annex, I have about 40 gigabytes of photos that will definitely not fit on my laptop's SSD, so I can keep the folder hierarchy for every photo I've ever taken, but without having to keep track of which of my 3 external USB drives the actual photos exist on.

Adding a third location (a remote computer using SSH)

This time, I'll add another repository on a server machine that I have, so that I can keep data offsite (or perhaps on a home NAS). I have a machine called "corinth", so the first thing I will do is ssh in and initialize a place for the data

hinmanm@Xanadu 0 ~()% ssh corinth
Last login: Sat Dec 20 13:06:52 2014 from dhcp-077-251-118-122.chello.nl
+++Reading .zshenv
+++Reading .zshrc (for interactive use).
+++Loaded files in 0.5055 seconds

hinmanm@corinth 0 ~()% mkdir myrepo

hinmanm@corinth 0 ~()% cd myrepo

hinmanm@corinth 0 ~/myrepo()% git init
Initialized empty Git repository in /home/hinmanm/myrepo/.git/

hinmanm@corinth 0 ~/myrepo(master)% git annex init "corinth"
init corinth ok
(Recording state in git...)

And then back on my local machine add the "corinth" repo and synchronize

hinmanm@Xanadu 0 ~/myrepo(master)% git remote add corinth ssh://corinth/home/hinmanm/myrepo

hinmanm@Xanadu 0 ~/myrepo(master)% git annex sync
commit  ok
pull usbdrive 
From /Volumes/MINIDRIVE1/myrepo
   0abc238..9a23674  git-annex  -> usbdrive/git-annex
 * [new branch]      master     -> usbdrive/master
ok
pull corinth 
warning: no common commits
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
Unpacking objects: 100% (5/5), done.
remote: Total 5 (delta 0), reused 0 (delta 0)
From ssh://corinth/home/hinmanm/myrepo
 * [new branch]      git-annex  -> corinth/git-annex
ok
(merging corinth/git-annex into git-annex...)
(Recording state in git...)
push usbdrive 
Counting objects: 23, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (18/18), done.
Writing objects: 100% (23/23), 1.96 KiB | 0 bytes/s, done.
Total 23 (delta 6), reused 0 (delta 0)
To /Volumes/MINIDRIVE1/myrepo
   171c1fe..2ea2746  git-annex -> synced/git-annex
ok
push corinth 
Counting objects: 44, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (35/35), done.
Writing objects: 100% (44/44), 3.66 KiB | 0 bytes/s, done.
Total 44 (delta 10), reused 0 (delta 0)
To ssh://corinth/home/hinmanm/myrepo
 * [new branch]      git-annex -> synced/git-annex
 * [new branch]      master -> synced/master
ok
hinmanm@Xanadu 0 ~/myrepo(master)%

And now you can move/copy data just like you would expect

hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --from usbdrive
copy Grace-Hopper-Nanoseconds.mp4 (from usbdrive...) 
SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4
     5769029 100%   13.06MB/s    0:00:00 (xfer#1, to-check=0/1)

sent 5769900 bytes  received 42 bytes  3846628.00 bytes/sec
total size is 5769029  speedup is 1.00
ok
(Recording state in git...)

hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --to corinth
copy Grace-Hopper-Nanoseconds.mp4 (checking corinth...) (to corinth...) 
SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4
     5769029 100%  624.58kB/s    0:00:08 (xfer#1, to-check=0/1)

sent 5769900 bytes  received 42 bytes  501734.09 bytes/sec
total size is 5769029  speedup is 1.00
ok
(Recording state in git...)

And data is now stored in 3 places, one of them offsite

hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis .
whereis Grace-Hopper-Nanoseconds.mp4 (3 copies) 
    197ead75-fa13-4277-ac38-576e5279693d -- [corinth]
    69f0a35e-91c7-43c5-9ce0-89af0056b523 -- laptop [here]
    d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive]
ok

This is just scratching the surface, git-annex is a very flexible and full-featured application with a lot of documentation. So check it out if this article was interesting!

Other things to check out

  • The main git-annex webpage has lots of good information
  • The walkthrough on the git-annex page - a good introduction to git-annex similar to the one here that goes through many more features, it's definitely worth a look if you found this article interesting.
  • The full-blown git-annex assistant - you can do all of this with an easier web-based application that git-annex ships with called git-annex assistant, if you so desire.
  • Add files from URLs using git-annex's "addurl"
  • Keep track of podcasts using "importfeed"
  • Metadata driven views, where files can be tagged and the directory can be changed to navigate files based on those tags
  • Git-annex's list of special remotes

Author: Lee Hinman

Created: 2016-08-10 Wed 09:50

Validate