------------------------------------------------------------------------------- Rsync Backups, and Snapshoting ------------------------------------------------------------------------------- Making incremental backups (snapshots) with rsync... It's actually very easy to extend rsync to incorporate access to old version as well. All you have to do is create a hard linked copy of the backup tree before you run rsync. Rsync will unlink and recreate any files that have changed, so the old version still exists in the copy of the backup tree. This only uses up extra disk space for a file when it changes, so you end up with a complete backup from every day (or even every hour) with very efficient disk usage. The following is a simple script to do this. You may want to change some of the options, especially on the rsync command. cp -l today `/bin/date -I -d yesterday` rsync -a --delete fileserver:/home today Also look at the option --compare-dest=DIR Though I think the hard link option may be better. Rsync backup snapshot scheme as originally pieced together on http://www.mikerubel.org/computers/rsync_snapshots/ Better documention of the techniquye http://www.sanitarium.net/golug/rsync_backups_2010.html And taken to a higher level with http://rsnapshot.org/ ------------------------------------------------------------------------------- My Backup Scheme "home_backup" (vs "rsnapshot") My rsync backup scripts... http://www.ict.griffith.edu.au/anthony/software/#home_backup My version stores this slightly differently to rsnapshot, and is mostly because of a completely separate development cycle. However it has features that make it very useful and which is not provided by rsnapshot. The actual backup is just that. A complete backup of my home to a separate directory on a separate machine... user@remote:backup/current The "backup" directory is where things are stored, and "current" is the most recent 'copy' of my primary home (regardless of if it was daily or in the last hour. This is a push backup from my primary home, and can make backups to any number of remote accounts or mounted USB hard drives (two or more backup locations is a good idea as a further disaster recovery technique). When a backup to "current" is complete, a separate script is run to 'roll' the "current" backup into various snapshot cycles. At this time four such cycles: hourly, daily, weekly, monthly. The script also runs a 're-link' of the "current" against the all the previous hourly and daily snapshots, so as to handle the posibility of a directory/file renaming, and thus improve the hardlinked disk space saving (see below). However I designed it so that each 'cycle' is rolled and a new snapshot is made when any of three conditions exist... 1/ This cycle does not exist yet -- start new cycle 2/ it is a specific time or day -- a preferred snapshot cycle time 3/ last snapshot is older that preferred cycle time -- if the preferred snapshot did not happen at the right time. The first rule is obvious, create it. The second means that is it will prefer to make snapshots at a specific time, 6am for daily, Monday morning for weekly or 1st day of the month for monthly, etc. While the thrid means that if that 'preferred' time passes and the snapshot did not happen, just make one when next run. This means that when the backup is performed the cycles are updated only if needed, but the backup program can run haphazardly, due to computers or network being down, and the cycle rolling will simply happen. This means scheduling does not actually need to be continuous, or regular. An hourly flagged backup for example can easily be schedualed to be every two hours, or four hours, or only during work-time hours (8am to 6pm). I could even do the current backup every 15 minutes, and hourly cycle will still remain an hourly cycle, with a preferred on the hour timing. The other aspect (when compared to rsnapshot) is that rather than naming the cycle snapshots by an incrementing number, hourly.0 hourly.1 I wanted to be able to look at the backup and know the backup dates directly, and have it sort correctly. As such I format the snapshot directory in a very specific way... Here is a listing of my backup directory... Backup_Summery day_14_2012-06-08_0606/ mnth_3_2012-04-01_0605/ current/ day_15_2012-06-07_0606/ mnth_4_2012-03-01_0605/ current_2012-06-21_1106@ home_backup_prep* mnth_5_2012-02-01_0605/ day_01_2012-06-21_0606/ home_backup_roll* mnth_6_2012-01-01_0605/ day_02_2012-06-20_0606/ home_backup_summ* mnth_7_2011-12-01_0609/ day_03_2012-06-19_0606/ hour_1_2012-06-21_1106/ mnth_8_2011-11-01_0609/ day_04_2012-06-18_0608/ hour_2_2012-06-21_1006/ mnth_9_2011-10-01_0608/ day_05_2012-06-17_0606/ hour_3_2012-06-21_0907/ week_1_2012-06-18_0608/ day_06_2012-06-16_0606/ hour_4_2012-06-21_0806/ week_2_2012-06-11_0608/ day_07_2012-06-15_0606/ hour_5_2012-06-20_1806/ week_3_2012-06-04_0608/ day_08_2012-06-14_0606/ hour_6_2012-06-20_1706/ week_4_2012-05-28_0608/ day_09_2012-06-13_0910/ hour_7_2012-06-20_1606/ week_5_2012-05-21_0608/ day_10_2012-06-12_0606/ hour_8_2012-06-20_1506/ week_6_2012-05-14_0608/ day_11_2012-06-11_0608/ hour_9_2012-06-20_1406/ week_7_2012-05-07_0608/ day_12_2012-06-10_0606/ mnth_1_2012-06-01_0606/ week_8_2012-04-30_0607/ day_13_2012-06-09_0606/ mnth_2_2012-05-01_0606/ week_9_2012-04-23_0607/ As you can see each backup has a date and time, and it is easy to see how old each backup is and if some problem has occurred. For example the listing will show any gaps, or duplication that may have happened, and allowed me to track down and fix the problem. For example the day_09 roll happed late. but that seems to be the only quirk shown at this point in the snapshot sequence. Note "current_*" is just symbolic links to "current" so I can see that last actual backup was made. In the above that backup was also hardlink copied as the last hourly snapshot. The 'roll' script (which is also a backed up from my primary home) is designed to make it easy to configure new cycles. Thus is can be quite easy to change weekly to fortnightly, or bi-weekly, or add a yearly snapshot, if I wanted it that way. Specific features... * The backups are 'pushed' to remote accounts, or USB mounts. That is you can have multiple backup 'storage'. * Re-linking is performed to handle file/directory renaming. * The backup and the snapshot cycles are treated as separate tasks. * No need to specify this backup is for a 'specific snapshot cycle'. * There is a 'preferred time' for each cycle (monthly at first of the month). * But recovers if the preferred time is missed (previous snapshot is old) * No chance of two backups working in parallel (for different cycles) * Better naming of the snapshots so you can see at a glance any problems or gaps in the backup. * Snapshot rolling is designed to be flexible and easily modified in the 'roll' script. These features make great additions to any backup scheme. ------------------------------------------------------------------------------- Renaming directories breaks hardlinks! While using a RSync backup utility (like rsnapshot) I did come across a particular situation what I found very annoying. If you rename a directory in your home, the backups will fail the hardlink what are essentially the same unchanged files within those directories. For example Say I have a sub-directory called 'my_photos' which has been backed up by a rsync backup program. Then one day I rename that directory to 'photos_2009'. Later when the rsync backup makes new 'snapshot' it sees the old directory deleted, and a new directory created. It does not see it as a 'renaming'. The result is that the hardlinks between the old and new snapshots for all the files are now broken. As a whole directory of photos (possibility contains many sub-directories of photos), this can be very very large, suddenly your backup uses a large amount of disk space. If the backups continue for some time like this, new snapshots will be created with hardlinks to the files new location. That mean you can end up with two copies of those files each with a collection of many hardlinks between snapshots. That has happened to me, eventually causing me to run out of diskspace in the backup server, even though I really did not have any new data! The problem here is that deleting one file and relinking the file to the other hardlink 'group' does not actually free up any disk space. That will only happen if ALL the files in one 'hardlink group' gets re-linked to the other hardlink group. A very complex business, and not one you would want to do by hand, some time after the fact. One solution to this have been a C program that supposedly re-links similar files called 'hardlink' http://helmut.hullen.de/filebox/Linux/slackware/ap/hardlink-1.2-i486-1hln.tgz I downloaded and looked at this and the program has no documentation, no comments and is very difficult to decode to figure out how exactly it does the job. However it does not seem to deal with the situation of their being two 'hardlink groups' of files, and properly sorting out that mess. Other solutions I have seen http://linux.die.net/man/1/hardlink Again, this program make no mention of how it would handle 'hardlink groups' Eventually I wrote a perl script called "linkdups" that will actually understand re-link hardlink groups. It is available at http://www.ict.griffith.edu.au/anthony/software/#linkdups It also has many options to provide better controls of the re-linking process. By default however it does NOTHING, just finds files to re-link. Add a -f option to actually fix the links. The script only compares files that actually have the same length, making it very fast in comparing large numbers of files. It also remembers what hardlinked file groups it has already com across so that if a larger group is found it can merge them on the spot. In other words it is much more hardlink savvy, than most other similar programs. Finally it tries to make intelligent choices about how permissions and date stamps should be merged as hardlinked files share such information. Comments and suggestions are welcome. WARNING WARNING WARNING Re-linking an actual home directory (compared to a backup) is dangerous and can have un-foreseen consequences. See the CAVEAT section of the script built-in documentation. ------------------------------------------------------------------------------- Handling very large "image" files. These files contain whole file systems, can be large and posibly 'holey'. That is has many 'zero blocks' that does not actually need disk space to store. The problem is that the file will change but only small parts of the file. Because of this while normal rsync will only copy the changed parts, it will still break the hardlink for this changed large file, forcing rsync snapshots to create multiple copies of the whole file. EG no disk space saving. The same happens for 'log' type files such as email "mbox" file format. where changes are only appended to the 'end' of the file. A perl rsync program "StoreBackup" http://www.nongnu.org/storebackup/ has some method of backing up these files in pieces, so that only the changed pieces will be updated and break the hard link. See its explanation at http://www.nongnu.org/storebackup/node36.html Research needed ------------------------------------------------------------------------------- Notes on snapshoting to USB sticks.... Rsync only replaces files on the destination (breaking any hardlinked copies), if a file data changes, which is why you can create large numbers of 'snapshots' (even once an hour) using very little disk space. Such rsync backups are not compressed, which allows each snapshot to be look almost exactly like a simple full working copy of the directories that were backed up. That is, it is easy to search, and access any file in any snapshot. You do not have do searching multiple incremental compressed backup files just to recover a specific bit of data, perhaps without knowing the exact filename that data is in. Just search for it directly as you normally would, across all the snapshots. It is the hard linking of unchanged files that gives a rsync multi-snapshot backup method such a good compression. However hardlinks only work on the same disk storage mount, so each USB would have to have at least one full copy of the files being backed up. Also hardlinked snapshoting will require... hard links.. which requires a UNIX style filesystem. USB sticks typically only use a low level VFAT filesystem (no hardlinks, and DOS file attributes) for maximum compatibility. As such USB sticks may need a different filesystem for it to work well. And larger USB drives with say a EXT4 filesystem tends to work better. It allows more hardlinked snapshots from the initial full copy (or last snapshot depending on how you look at it), and this higher disk space savings (hardlink compression) per snapshot. ------------------------------------------------------------------------------- Notes on snapshots and Cloud Filesystems (like dropbox) The use of a cloud based filesystem (like dropbox) also precludes the use of hardlinks. As such snapshoting to such a filesystem does not compress well as you do not get hardlink sharing of files across individual snapshots. However making snapshot backups on a local machine, of a (possibly encrypted) cloud based 'working' filesystem that can be shared across devices, should work very well. That is one local machine keeps 'snapshot backups' (perhaps working automatically in the background), while the cloud allows access to the actual working directory from multiple locations. If something happens to the cloud, or your working directory gets corrupted for some reason, you have your highly-hardlinked snapshots to recover from. It will be straight forward then to copy the last good snapshot to a new replacement cloud provider. -------------------------------------------------------------------------------