Backup your databases in Git

Storing backups of the database is import for any possible service on the internet. Git can be the right tool to backup databases.

Like other version control systems, Git tracks the changes, and will only push the changes in files to the remote. So if one line in a 1 million database dump is changed, we don’t need to transfer the whole dump to our backup server. This economization is done by Gits delta compression mechanism.¹

Configuring Git

Generating SSH keys:

$ ssh-keygen -t rsa -C "your_email@example.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/you/.ssh/id_rsa): [Press enter]
Enter passphrase (empty for no passphrase): [Type a passphrase]
Enter same passphrase again: [Type passphrase again]
Your identification has been saved in /Users/you/.ssh/id_rsa.
Your public key has been saved in /Users/you/.ssh/id_rsa.pub.
The key fingerprint is:
01:0f:f4:3b:ca:85:d6:17:a1:7d:f0:68:9d:f0:a2:db your_email@example.com

If you want to execute the backup script automatically, leave the passphrase blank. This way SSH won’t ask for it. Note that this might be insecure!

Now create a remote Git repository, and add the public key to this Git service, e.g. Github, Gogs, ….

Init a new local repo with the SSH remote address, and commit/push an initial commit.

Backup script

#! /bin/sh

TIMESTAMP=$(date +"%F")
BACKUP_DIR="/home/mathias/backup/mysql_git"
MYSQL_USER="Mathias"
MYSQL=/usr/bin/mysql
MYSQL_PASSWORD="yourpassword"
MYSQLDUMP=/usr/bin/mysqldump

cd $BACKUP_DIR

echo "Backupping databases"

databases=`$MYSQL --user=$MYSQL_USER -p$MYSQL_PASSWORD -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema|performance_schema)"`

for db in $databases; do

        echo "  > Dumping $db to disk"
        $MYSQLDUMP --force --opt --user=$MYSQL_USER -p$MYSQL_PASSWORD --skip-extended-insert --databases $db > "$BACKUP_DIR/$db.sql"

        echo "  > Saving $db changes on Git server"
        git add $BACKUP_DIR/$db.sql
        git commit -m "$db `date +"%m-%d-%Y"`"
        git push

done

echo "Done"

The script will loop through all MySQL databases, and dump them to a .sql file (with their name). After it is dumped, the file will be added to the local git repo and will be committed.

After each commit, the changes are pushed to the remote repo. This avoids having very big pushes to do, when working with large databases. If you want to push only once, just place the push at the end of the script.

Running the backup script

Running this script manually isn’t the best solution. Making an automated backup service of this is straightforward, just make a cronjob that executes the script every day (or any timespan you want).

Type crontab -e in the console, this will open your personal cron configuration in your favorite editor. Now add the cronjob to the crontab:

30 2 * * * /home/mathias/backup/mysql_git/backup.sh >> /home/mathias/backup_git_cron.log

This particular example will run the backup script every day at 2h30, and append the output of the script to a backup_git_cron.log file in my home directory. (Of course you are absolutely free to create any exotic cronjob that runs the backup script at your desired moment)

Big data and low-end hardware

Git works very well for small programming source files and small text files. Those database dumps, however, aren’t always that small. On my VPS I have 200MB of database dumps, which each have to be compressed and packed for every commit. This takes a lot of time on a machine with 512MB ram, and even crashes sometimes on the largest files. While pushing I’ve seen this error way too much: error: pack-objects died of signal 9.

Some other Git users with larger files have reduced the limits regarding packing, which resulted in fewer problems packing those files:

git config --global pack.windowMemory "100m"
git config --global pack.packSizeLimit "100m"
git config --global pack.threads "1"

On my server that didn’t really seem to work (apparently I still haven’t got enough free ram): I noticed fewer problems while compressing the objects, but it still took quite some time (and crash) for the large dumps.
The solution for me seemed to be turning of the delta compression.²

echo '*.sql -delta' > .gitattributes

The above solution writes the setting to the .gitattributes file. If you commit this file, it will be turned of on any clone of the repo.

Another solution would be to migrate from Git to Mercurial. From what I’ve read, Mercurial stores diffs in stead of object packs.

There’s one huge difference between git and mercurial; the way the represent each commit. git represents commits as snapshots, while mercurial represents them as diffs.

Changesets (diffs) advantage is in taking up less space. Git recovers the space used for commits by using compression, but this requires an occasional explicit recompress step (“git pack”).

When the history of the repository become too large, it is useful to do a shallow clone of the remote: git clone --depth <depth> <remote-url>. This way you don’t keep large local history, but let the remote keep it.

Conclusion

Git might not be the perfect system for backups, but using it with care (and good hardware) it can provide a decent backup system.

And it’s always better than having no backup at all!

¹ Later in this blogpost I have pointed out that disabling delta compression is better on low memory machines to use less memory. Note that without delta compression Git needs to send a lot more data to the remote.²

² Note that disabling delta compression makes that Git needs to push the full packs of large files, and can therefore not rely on delta compression. So If you change one line of 500MB file, that is packed in 100MB pack, you will always need to send that 100MB pack to the remote (instead of just 15KB) when delta compression is turned off. Without delta compression Git also needs to store all the objects of the files. After 65 commits (of some large databases), I had a repo of almost 1GB. Running git gc shrank that repo to less than 100MB. (Unfortunately, running git gc on my 512MB ram server results in those well-known issues: error: pack-objects died of signal 9, warning: suboptimal pack - out of memory and fatal: inflateInit: out of memory)