Getting started with git-annex
Table of Contents
Author | Lee Hinman (leehinman@fastmail.com) |
Date | 2016-08-10 09:50:50 |
Introduction
In this article, I am hoping to show how to set up distributed file backups
using git-annex
. When I started using git-annex I found that while the
documentation is plentiful and quite helpful, it's hard to know what exactly to
look for unless already familiar with git-annex. I also hard a hard time
wrapping my head around some of the concepts associated with it. I'm hoping this
will be useful to anyone interested in using git-annex.
Git-annex is a tool that adds an abstraction on top of the git distributed version control system (VCS). It allows you to use git to manage file locations and directory history without actually committing large files into the git repository itself (because git does not handle large binary files particularly well).
Why git-annex?
So why git-annex? Why not manage this myself using rsync
, or cp
, or <insert
pay service>
? Well, there are really 3 main reasons, backups, location
tracking, and protection.
Backups
Backups. Ugh. It seems like almost everyone with any non-trivial amount of data has a different way of handling this. When I started looking at backups, I was really distressed by some of the solutions. For instance, things like Apple's "Time Machine" only work with HFS+ which makes it totally useless for multiple operating systems. I also wanted something where accessing information was possible without having to install a particular piece of software, and was designed in the manner of "maybe in the future, not EVERYONE will be using X". I think git-annex has a great design page on this called "future proofing".
Location tracking
This also kind of ties into the backups. I have 3-4 large external hard drives that I end up backing up most of my important data, and I can never remember which drive contains which content. git-annex has a nice way of tracking this through the "whereis" command.
Protection
Data is only as good as it's resiliency, and not being able to tell if data is corrupted is almost as bad as having corrupted data. git-annex can perform "fsck" on repository contents to detect corruption. It can also be configured to ensure that you never remove the last copy of the data, or that you always keep a certain number of copies of data. Each repository "remote" can be configured with a variable amount of trust as well.
Okay, enough about the why of git-annex, let's talk about the how.
Setting up your first git-annex repository
In this example I'm going create a new repository of files, in this case, some images I have laying around. Any files will work though!
Initializing and adding the first files
A git-annex distribution is just like a git repository, meaning it's a directory
that has had git init
run in it, so let's do exactly that:
hinmanm@Xanadu 0 ~()% mkdir ~/myrepo hinmanm@Xanadu 0 ~()% cd myrepo hinmanm@Xanadu 0 ~/myrepo()% git init Initialized empty Git repository in /Users/hinmanm/myrepo/.git/ hinmanm@Xanadu 0 ~/myrepo(master)%
Great, that was easy.
One thing in addition to this is that we need to initialize this as a git-annex repository as well, along with the name of this repository. Since I'm running this on my laptop, I'm going to initialize git-annex with "laptop" as the name of this repository.
hinmanm@Xanadu 0 ~/myrepo(master)% git annex init "laptop" init laptop ok (Recording state in git...) hinmanm@Xanadu 0 ~/myrepo(master)%
Awesome, everything is set up, now all we need to do is to add some files. I
have a great video of Grace Hopper explaining a nanosecond that I'd like to
backup, so I'll copy it into this folder and add it to the repository using git
annex add
hinmanm@Xanadu 0 ~/myrepo(master)% cp -v ~/Downloads/Grace-Hopper-Nanoseconds.mp4 . /Users/hinmanm/Downloads/Grace-Hopper-Nanoseconds.mp4 -> ./Grace-Hopper-Nanoseconds.mp4 hinmanm@Xanadu 0 ~/myrepo(master*)% git annex add Grace-Hopper-Nanoseconds.mp4 add Grace-Hopper-Nanoseconds.mp4 ok (Recording state in git...) hinmanm@Xanadu 0 ~/myrepo(master*)%
Don't forget to commit the file!
hinmanm@Xanadu 0 ~/myrepo(master*)% git commit -m "Added video of Grace Hopper explaining the nanosecond" [master (root-commit) ea6e382] Added video of Grace Hopper explaining the nanosecond 1 file changed, 1 insertion(+) create mode 120000 Grace-Hopper-Nanoseconds.mp4 hinmanm@Xanadu 0 ~/myrepo(master)%
Understanding what git-annex does with files
It turns out that when we add the files, git-annex is not actually adding the
content of the files to the git repo, you can see this using ls -lh
, which
will show that "Grace-Hopper-Nanoseconds.mp4" is now a symlink:
hinmanm@Xanadu 0 ~/myrepo(master)% ls -lh total 8 lrwxr-xr-x 1 hinmanm staff 198B Dec 20 12:04 Grace-Hopper-Nanoseconds.mp4@ -> .git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 hinmanm@Xanadu 0 ~/myrepo(master)%
git-annex computes the checksum of the file and uses that for the actual file
name inside of the .git/annex/objects
folder. You can still access the
original file by opening the symlink.
So far, we haven't really done much other than adding a file, so let's use git-annex to maintain a second location for this repository.
Adding a second location for data (USB drive)
I'm going to use an external USB drive as the second drive, because that's usually where I perform backups, however, git-annex supports anything that could be a git repository, meaning github, or another computer, it even supports a lot of special remotes like backing up the S3.
I am using a USB drive mounted at /Volumes/MINIDRIVE1/
, so adjust as needed to
a different location on disk, wherever you device is mounted. You can also use
another folder if you don't have a USB drive handy.
First, let's make a folder on the USB drive and initialize it as both a regular git repo and a git-annex repository, but this time, named "usbdrive"
hinmanm@Xanadu 0 ~/myrepo(master)% mkdir /Volumes/MINIDRIVE1/myrepo hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo()% git init Initialized empty Git repository in /Volumes/MINIDRIVE1/myrepo/.git/ hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex init "usbdrive" init usbdrive Detected a filesystem without fifo support. Disabling ssh connection caching. ok (Recording state in git...) hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%
Next we need to let each of the two locations ~/myrepo
and
/Volumes/MINIDRIVE1/myrepo
know about each other. This is accomplished by
adding each location as a remote, using the git remote
command.
First we'll let the original repo know about the USB drive
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% cd ~/myrepo hinmanm@Xanadu 0 ~/myrepo(master)% git remote add usbdrive /Volumes/MINIDRIVE1/myrepo hinmanm@Xanadu 0 ~/myrepo(master)%
And then let the usb drive know about the laptop
hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo/ hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git remote add laptop ~/myrepo hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% cd ~/myrepo hinmanm@Xanadu 0 ~/myrepo(master)%
Now that each repository knows about the other one, you can use the git annex
sync
command to synchronize the git state between each, so if we run it in each
folder we can see that the folder on the USB drive has been updated with the
name of the file in the repository
hinmanm@Xanadu 0 ~/myrepo(master)% git annex sync commit ok pull usbdrive warning: no common commits remote: Counting objects: 5, done. remote: Compressing objects: 100% (3/3), done. remote: Total 5 (delta 0), reused 0 (delta 0) Unpacking objects: 100% (5/5), done. From /Volumes/MINIDRIVE1/myrepo * [new branch] git-annex -> usbdrive/git-annex ok (merging usbdrive/git-annex into git-annex...) (Recording state in git...) push usbdrive Counting objects: 17, done. Delta compression using up to 4 threads. Compressing objects: 100% (13/13), done. Writing objects: 100% (17/17), 1.60 KiB | 0 bytes/s, done. Total 17 (delta 1), reused 0 (delta 0) To /Volumes/MINIDRIVE1/myrepo * [new branch] git-annex -> synced/git-annex * [new branch] master -> synced/master ok hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex sync (merging synced/git-annex into git-annex...) commit ok pull laptop From /Users/hinmanm/myrepo * [new branch] git-annex -> laptop/git-annex * [new branch] master -> laptop/master * [new branch] synced/master -> laptop/synced/master Already up-to-date. ok hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% ls -lh total 64 lrwxrwxrwx 1 hinmanm staff 198B Dec 20 13:09 Grace-Hopper-Nanoseconds.mp4@ -> .git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%
However, you can see that the content hasn't actually been copied over, only the metadata (the symlink), you can verify this with the file command
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% file Grace-Hopper-Nanoseconds.mp4
Grace-Hopper-Nanoseconds.mp4: broken symbolic link to
.git/annex/objects/5X/Gv/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4/SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4
Copying data between locations
Okay, this would be pretty crummy if all it did was keep track of metadata, but git-annex can also shuffle data between different locations, with a variety of methods.
The easiest way to ensure the data is an all places is to use the sync
command with the --content
flag, so if we run this in either repository, the
content will be copied over
hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex sync --content commit ok pull laptop ok get Grace-Hopper-Nanoseconds.mp4 (from laptop...) SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 5769029 100% 60.12MB/s 0:00:00 (xfer#1, to-check=0/1) sent 5769900 bytes received 42 bytes 3846628.00 bytes/sec total size is 5769029 speedup is 1.00 ok pull laptop ok (Recording state in git...) push laptop Counting objects: 5, done. Delta compression using up to 4 threads. Compressing objects: 100% (4/4), done. Writing objects: 100% (5/5), 454 bytes | 0 bytes/s, done. Total 5 (delta 1), reused 0 (delta 0) To /Users/hinmanm/myrepo * [new branch] git-annex -> synced/git-annex ok hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% file Grace-Hopper-Nanoseconds.mp4 Grace-Hopper-Nanoseconds.mp4: ISO Media, MPEG v4 system, version 2 hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)%
If you don't want the blanket copy that git annex sync --content
does, you can
move data in a much more fine-grained approach by using the move
or copy
commands, for instance:
hinmanm@Xanadu 0 ~/myrepo(master)% git annex move . --to usbdrive (merging synced/git-annex into git-annex...) move Grace-Hopper-Nanoseconds.mp4 ok (Recording state in git...)
This is moving by copying the file, but git-annex can also copy the file other ways, depending on the type of remove it is, however describing them all is out of scope for this article.
Finding out where your files are
At any time, you can see where the data is using the whereis
command. I'm
using "." here to signify the whole directory, but you can also use "*", or
"*.mp4" and all the other things you would expect when using a shell.
hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis . whereis Grace-Hopper-Nanoseconds.mp4 (1 copy) d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive] ok hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --from usbdrive copy Grace-Hopper-Nanoseconds.mp4 (from usbdrive...) SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 5769029 100% 2.40MB/s 0:00:02 (xfer#1, to-check=0/1) sent 5769900 bytes received 42 bytes 2307976.80 bytes/sec total size is 5769029 speedup is 1.00 ok (Recording state in git...) hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis . whereis Grace-Hopper-Nanoseconds.mp4 (2 copies) 69f0a35e-91c7-43c5-9ce0-89af0056b523 -- laptop [here] d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive] ok
Notice that the only copy of the data is on "usbdrive", then I used the copy
command to copy it from the USB drive, and then git-annex shows that there are
two copies, "here" which is the laptop, and on the "usbdrive" remote.
Removing unneeded data
When you don't want to store the file locally, use the drop command, which will remove the file locally, but only after ensuring that at least 1 copy of the data is still available somewhere (this number is configurable too).
hinmanm@Xanadu 0 ~/myrepo(master)% git annex drop . drop Grace-Hopper-Nanoseconds.mp4 ok (Recording state in git...) hinmanm@Xanadu 0 ~/myrepo(master)% cd /Volumes/MINIDRIVE1/myrepo hinmanm@Xanadu 0 /Volumes/MINIDRIVE1/myrepo(master)% git annex drop . drop Grace-Hopper-Nanoseconds.mp4 (unsafe) Could only verify the existence of 0 out of 1 necessary copies Rather than dropping this file, try using: git annex move (Use --force to override this check, or adjust numcopies.) failed git-annex: drop: 1 failed hinmanm@Xanadu 1 /Volumes/MINIDRIVE1/myrepo(master)%
Keep in mind that a repository needs to be available in order for git-annex to
ensure that the data still resides there however. You won't be able to drop
the file for this repository if the USB drive is disconnected, because git-annex
can't tell whether the file is still actually on the disk.
Even if a file's content is removed, any folders as well as the symlink will remain (it will be a broken symlink) so you know which files are part of the repository. Additionally, you can create branches and revert commits just like you would expect because all of the information metadata is stored in the regular git repository.
This is where I really get a lot of value out of git-annex, I have about 40 gigabytes of photos that will definitely not fit on my laptop's SSD, so I can keep the folder hierarchy for every photo I've ever taken, but without having to keep track of which of my 3 external USB drives the actual photos exist on.
Adding a third location (a remote computer using SSH)
This time, I'll add another repository on a server machine that I have, so that I can keep data offsite (or perhaps on a home NAS). I have a machine called "corinth", so the first thing I will do is ssh in and initialize a place for the data
hinmanm@Xanadu 0 ~()% ssh corinth Last login: Sat Dec 20 13:06:52 2014 from dhcp-077-251-118-122.chello.nl +++Reading .zshenv +++Reading .zshrc (for interactive use). +++Loaded files in 0.5055 seconds hinmanm@corinth 0 ~()% mkdir myrepo hinmanm@corinth 0 ~()% cd myrepo hinmanm@corinth 0 ~/myrepo()% git init Initialized empty Git repository in /home/hinmanm/myrepo/.git/ hinmanm@corinth 0 ~/myrepo(master)% git annex init "corinth" init corinth ok (Recording state in git...)
And then back on my local machine add the "corinth" repo and synchronize
hinmanm@Xanadu 0 ~/myrepo(master)% git remote add corinth ssh://corinth/home/hinmanm/myrepo hinmanm@Xanadu 0 ~/myrepo(master)% git annex sync commit ok pull usbdrive From /Volumes/MINIDRIVE1/myrepo 0abc238..9a23674 git-annex -> usbdrive/git-annex * [new branch] master -> usbdrive/master ok pull corinth warning: no common commits remote: Counting objects: 5, done. remote: Compressing objects: 100% (3/3), done. Unpacking objects: 100% (5/5), done. remote: Total 5 (delta 0), reused 0 (delta 0) From ssh://corinth/home/hinmanm/myrepo * [new branch] git-annex -> corinth/git-annex ok (merging corinth/git-annex into git-annex...) (Recording state in git...) push usbdrive Counting objects: 23, done. Delta compression using up to 4 threads. Compressing objects: 100% (18/18), done. Writing objects: 100% (23/23), 1.96 KiB | 0 bytes/s, done. Total 23 (delta 6), reused 0 (delta 0) To /Volumes/MINIDRIVE1/myrepo 171c1fe..2ea2746 git-annex -> synced/git-annex ok push corinth Counting objects: 44, done. Delta compression using up to 4 threads. Compressing objects: 100% (35/35), done. Writing objects: 100% (44/44), 3.66 KiB | 0 bytes/s, done. Total 44 (delta 10), reused 0 (delta 0) To ssh://corinth/home/hinmanm/myrepo * [new branch] git-annex -> synced/git-annex * [new branch] master -> synced/master ok hinmanm@Xanadu 0 ~/myrepo(master)%
And now you can move/copy data just like you would expect
hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --from usbdrive copy Grace-Hopper-Nanoseconds.mp4 (from usbdrive...) SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 5769029 100% 13.06MB/s 0:00:00 (xfer#1, to-check=0/1) sent 5769900 bytes received 42 bytes 3846628.00 bytes/sec total size is 5769029 speedup is 1.00 ok (Recording state in git...) hinmanm@Xanadu 0 ~/myrepo(master)% git annex copy . --to corinth copy Grace-Hopper-Nanoseconds.mp4 (checking corinth...) (to corinth...) SHA256E-s5769029--dc383e922caca7a52eb8c44e0d79fdce22e597e667ba29732966e462ac483cf7.mp4 5769029 100% 624.58kB/s 0:00:08 (xfer#1, to-check=0/1) sent 5769900 bytes received 42 bytes 501734.09 bytes/sec total size is 5769029 speedup is 1.00 ok (Recording state in git...)
And data is now stored in 3 places, one of them offsite
hinmanm@Xanadu 0 ~/myrepo(master)% git annex whereis . whereis Grace-Hopper-Nanoseconds.mp4 (3 copies) 197ead75-fa13-4277-ac38-576e5279693d -- [corinth] 69f0a35e-91c7-43c5-9ce0-89af0056b523 -- laptop [here] d6a2baed-8785-4f2a-8b7c-c99190b67647 -- [usbdrive] ok
This is just scratching the surface, git-annex is a very flexible and full-featured application with a lot of documentation. So check it out if this article was interesting!
Other things to check out
- The main git-annex webpage has lots of good information
- The walkthrough on the git-annex page - a good introduction to git-annex similar to the one here that goes through many more features, it's definitely worth a look if you found this article interesting.
- The full-blown git-annex assistant - you can do all of this with an easier web-based application that git-annex ships with called git-annex assistant, if you so desire.
- Add files from URLs using git-annex's "addurl"
- Keep track of podcasts using "importfeed"
- Metadata driven views, where files can be tagged and the directory can be changed to navigate files based on those tags
- Git-annex's list of special remotes