------------------------------------------------------------------------------- Rsync Backups, and Snapshoting ------------------------------------------------------------------------------- Making incremental backups (snapshots) with rsync... It's actually very easy to use rsync to create multiple snapshot of a backup. All you have to do is create a hard linked copy of the backup tree before (or after) you run rsync. Each snapshot will have a hardlinked copy of the files, so each snapshot uses very little extra disk space. Then when a file changes, rsync will unlink and recreate that files, while leaving the old version of the changed file older hardlinked snapshots untouched. This means you can have many 'snapshots' of your backup (hours, days, etc) each with a copy of the files as they existed at the time the snapshot was made, with only the changed files using extra disk space. A very efficent of disk space for a snapshot system. However be warned that file premissions, and ownership is shared via hardlinks, so if later updated changes some files owner or permission than ALL the copies of that linked file will also recieved the same change. This is the only cavat with a hardlinked snapshot system. The following is a simple script to make a snapshot of a existing copy (today, and call it yesterday) before updating the current backup directory. You may want to change some of the options, especially on the rsync command. cp -l today `/bin/date -I -d yesterday` rsync -a --delete fileserver:/home today Also look at the option --compare-dest=DIR Though I think the hard link option may be better. Rsync backup snapshot scheme as originally pieced together by Mike Rubel http://www.mikerubel.org/computers/rsync_snapshots/ Better documention of the techniquye http://www.sanitarium.net/golug/rsync_backups_2010.html And taken to a higher level with http://rsnapshot.org/ ------------------ NOTE: The above makes a snapshot BEFORE doing the syncronization. As the directories are all hardlinks, you can make multiple snapshot 'cycles'. For this I fould it is better to do it immediatally afetr you have synced the primary 'current' backup. That is do the backup, then create all the snapshot cycles such as 'dayly', 'weekly' or 'monthly' snapshots of the 'current' backup, as appropriate. For example This creates a single snapshot immediatally after the 'current' primary backup directory, has been updated (sync'ed) rsync -a --delete $HOME backup_server:current ssh backup_server cp -l current snap_`date +%F_%H%M` The 'cp' command can be a more complex script that performs multiple backup cycles to preserve hourly/daily/monthly/yearly backup cycles, removing older ones as appropriate. ------------------------------------------------------------------------------- WARNING: Renaming directories breaks rsync hardlinks! If you rename a directory in your home, the backups will break all the hardlinks between what are essentially the same unchanged files within those directories. This means an rsync backup will then result in a massive increase in the disk space needed for the backup. For example... Say I have a sub-directory called 'my_photos' which has been backed up by a rsync backup program. Then I rename that directory to 'photos_2009'. Later when the rsync backup makes new 'snapshot' it sees the old directory deleted, and a new directory created. It does not see it as a 'renaming', as so breaks all the hardlinks. The result is that the hardlinks between the old and new snapshots for all the files are now broken. As a whole directory of photos (possibility containing many sub-directories of photos), this can be very very large, suddenly your backup uses vast amount of disk space. This happened to me, eventually causing me to run out of diskspace in the backup server, even though I did not have any new data! If the backups continue for some time like this, new snapshots will be created with hardlinks to the files new location. That means you can end up with two copies of the files each with its own collection of mulitple hardlinks between snapshots. This can make fixing the hardlinking between files really difficult. The problem is that simply deleting one file and relinking the file to the other hardlink 'group' does not actually free up any disk space. That will only happen if ALL the files in the 'hardlink group' gets re-linked to the other hardlink group. A very complex business, and not one you would want to do by hand. ---- One solution to this have been a C program that re-links similar files called 'hardlink' http://helmut.hullen.de/filebox/Linux/slackware/ap/hardlink-1.2-i486-1hln.tgz I downloaded and looked at this and the program has no documentation, no comments and is very difficult to decode to figure out how exactly it does the job. However it does not deal with the situation of their being two 'hardlink groups' of files, and properly sorting out that mess. Other solutions I have seen http://linux.die.net/man/1/hardlink Again, this program make no mention of how it would handle 'hardlink groups' Eventually I wrote a perl script called "linkdups" that will actually understand and re-link hardlink groups. It keeps track of all the files in a group, and when it finds the same file in two different groups it fixes ALL the hardlinks to form one single hardlink group. It is available at https://antofthy.gitlab.io/software/#linkdups It also has many options to provide better controls of the re-linking process, such as how to handle file permissions. The script only does a full comparison of files that actually have the same length, making it very fast in comparing large numbers of files. In other words it is much more hardlink savvy, than most other similar programs. Finally it tries to make intelligent choices about how permissions and date stamps should be merged between two hardlinked groups being merged. Comments and suggestions are welcome. ASIDE: Unison (a bidirectional rsync) can see renamed/moved files! But does not appear suitable for creating hardlinked backups. WARNING WARNING WARNING Re-linking an actual home directory (compared to relinking backup snapshots) is dangerous and can have un-foreseen consequences, in terms of configuration files, and SVN/GIT repository copies. Do not relink the files of your working home. See the CAVEAT section of the script built-in documentation. ------------------------------------------------------------------------------- My Backup Scheme "home_backup" (vs "rsnapshot") A copy of rsync backup scripts... https://antofthy.gitlab.io/software/#home_backup My version stores this slightly differently to rsnapshot, and is mostly as it handles the snapshot cycles in a completely seperate step to the backup process. However it has features that make it very useful and which is not provided by rsnapshot. BACKUP... The actual backup is just that. A complete backup of my home to a separate directory on a separate machine... user@remote:backup/current That part is all it does, nothing else. The "backup" directory is where things are stored, and is the "current" most recent 'copy' of my primary home, regardless of when the last backup was made, be it yesterday, or in the last hour. This is a "push" backup from my primary home, and as such you can make additional backups to any number of remote accounts or mounted USB hard drives (two or more backup locations is a good idea as a further disaster recovery technique). It could be made to be a "pull" backup, but I prefer a "push", so that I can control it from my main home. RELINK... A 're-link' of the "current" backup against previous hourly and daily snapshots is then performed, so as to handle the possibility of a directory/file being renamed, moved, restored, and thus improve the hardlinked disk space saving. This uses my "linkdups" script, see above. An occasional relink over ALL snapshots is also recommended, but not vital. this will relink files that changed but was then restored or put back to normal, something that a normal relink may not find. SNAPSHOT CYCLES... Once the current backup is complete, a separate step creates the hardlinked 'snapshots'. That is I create multiple 'snapshot' cycles, not just daily but also weekly, monthly and yearly! This is done by 'rolling' the "current" directory into various cycles (removing the oldest snapshot if needed). I found it is a lot easier to do this as a separate step AFTER the backup has been complete. Most other rsync snapshot systems do this before the rsync backup, creating a single 'daily' backup cycle. To create multiple cycles They create seperate backups for each cycle! doubling or tripling disk space usage. At this time I have five such cycles: hourly, daily, weekly, monthly, yearly. While this 'rolling' is launched after the rsync backup is complete, it can actually be done at any time, and in many different ways. In my case I designed it so that each 'cycle' is rolled and a new snapshot is made, only when any of three conditions exist... 1/ This cycle does not exist yet. -- start a new cycle immediately 2/ It is a specific time or day. -- a preferred snapshot cycle time 3/ Last snapshot is older than the cycle time -- it missed snapshot roll The first rule is obvious, create it. The second means that is it will prefer to make snapshots at a specific time, 6am for daily, Monday morning for weekly or 1st day of the month for monthly, etc. While the third rule means that if that 'preferred' time passes and the snapshot did not happen (machine down or network issues), just make one when next run. This means that when the backup is performed the cycles are updated only as needed, but the snapshot cycles will still work even when the backup program runs haphazardly. That is a cycle will be rolled if the prefered time/day was missed for some reason. This means scheduling does not actually need to be continuous, or regular. An hourly flagged backup for example can easily be scheduled to be every two hours, or four hours, or only during work-time hours (8am to 6pm - which is what I do). I could even do a current backup every 15 minutes, or perhaps even continuously, and hourly cycle will still remain an hourly cycle, with the specified period, and at the prefered times. Note that the prefered time rules probably should be aborted if a recent snapshot roll was made recently. For example don't do a dayly snapshot at the prefered time, if the last daily snapshot was just couple of hours ago, due to a missed update. This way if a snapshot cycle run was not performed for sometime, appropriate snapshots will be made immediately, but will migrate with slightly longer cycle periods (3rd rule), until the preferred time is reached. This avoids getting two snapshots in a cycle very close together. It also means that if say an hourly snapshot took longer than 30 minutes the next hourly snapshot will be aborted! As such snapshot intervals will probably need some tweaking to get it 'just right' (not too soon, and not too late). SNAPSHOT NAMING... Simply using a incrementing number (such as in rsnapshot), is not very useful, expecially in debuging a backup program that is going wrong in some way. For example, this is difficult to debug! hourly.0 hourly.1 I wanted to be able to look at the backup and know the backup dates directly, while still having the directory names sort and cycle correctly. That is know which more recent, and which is the next to be removed. Here is a listing of my backup directory... Backup_Summery day_14_2012-06-08_0606/ mnth_3_2012-04-01_0605/ current/ day_15_2012-06-07_0606/ mnth_4_2012-03-01_0605/ current_2012-06-21_1106@ home_backup_prep* mnth_5_2012-02-01_0605/ day_01_2012-06-21_0606/ home_backup_roll* mnth_6_2012-01-01_0605/ day_02_2012-06-20_0606/ home_backup_summ* mnth_7_2011-12-01_0609/ day_03_2012-06-19_0606/ hour_1_2012-06-21_1106/ mnth_8_2011-11-01_0609/ day_04_2012-06-18_0608/ hour_2_2012-06-21_1006/ mnth_9_2011-10-01_0608/ day_05_2012-06-17_0606/ hour_3_2012-06-21_0907/ week_1_2012-06-18_0608/ day_06_2012-06-16_0606/ hour_4_2012-06-21_0806/ week_2_2012-06-11_0608/ day_07_2012-06-15_0606/ hour_5_2012-06-20_1806/ week_3_2012-06-04_0608/ day_08_2012-06-14_0606/ hour_6_2012-06-20_1706/ week_4_2012-05-28_0608/ day_09_2012-06-13_0910/ hour_7_2012-06-20_1606/ week_5_2012-05-21_0608/ day_10_2012-06-12_0606/ hour_8_2012-06-20_1506/ week_6_2012-05-14_0608/ day_11_2012-06-11_0608/ hour_9_2012-06-20_1406/ week_7_2012-05-07_0608/ day_12_2012-06-10_0606/ mnth_1_2012-06-01_0606/ week_8_2012-04-30_0607/ day_13_2012-06-09_0606/ mnth_2_2012-05-01_0606/ week_9_2012-04-23_0607/ As you can see each backup not only has an increment, but also the time-date. It is easy to see how old each backup is and if some problem has occurred. For example the "day_09" snapshot roll happened late on that day. In this case my main computer had been turned off until I came in that day. As such the daily snapshot was made from the first hourly backup/snapshot that was performed, rather than at it's normal preferred time. Technically you don't need the incrementing number, as the filenames will sort correctly alphabetically (just in reverse order), and you can still remove older files based on file counts, or how old they are. But the incrementing number I find useful for referring to specific backups, or for comparing and restoration. For example restoring from the snapshot 2 days ago is easier when you can just jump to 'day_02_*' directory. As such I still prefer to have then in the naming scheme. Future Possibility: At this time my rolling script does not handle gracefully situations where the incrementing number has gaps or duplicates. It would be nice if when rolling backups the scheme ignored that number, and automatically renumber the filenames based on the date component, so as to 'close the gap'. Note the "current_*" file is just a symbolic link to "current" so I can see what time the last backup was made. In the above that backup, is also normally hardlink copied as the last hourly snapshot. This has minimal cost in terms of diskspace. The 'roll' script is designed to make it easy to modify and configure new cycles, or naming schemes. Thus is can be quite easy to change, to make weekly, fortnightly, or bi-weekly, or add a yearly, half-yearly, or quarterly, snapshots, if desired. SUMMERY OF FEATURES... * The backups are 'pushed' to remote accounts, or even a local USB mount. * You can have multiple, separate backup 'storage'. * Re-linking is performed to handle large scale directory renaming. * The backup and the snapshot cycles are treated as separate tasks. * No need to figure out what 'cycle' the current backup is for (just do it). * There is a 'preferred time' to make a snapshot for each cycle. * But recovers if the preferred time is missed (previous snapshot is old) * No need to make perform a seperate backup for different cycles * Better naming of the snapshots so you can see problems or gaps * You can easilly specify a snapshot, such as say 2 days ago. * Snapshot rolling is designed to be easily modified in the 'roll' script. These features make great additions to any backup scheme. ------------------------------------------------------------------------------- Comparing snapshot directories Directory Disk Usage changes (hardlinked directorys only) cd ~/backup du current {day,week,mnth,year}_* | sed '/\tcurrent/d' | xdiskusage - & ------- Directory Listing Compare compare_ls mnth_{4,2}_*/archive OR ls_dir() { # get listing with size, date and full path find "$1" -type f -ls | sed 's/^.\{27\}\(.\{4\}\).\{20\}\(.\{21\}\)[^/]*\//\1\2/;' | sort -k6 # sort result } diffuse -w <(ls_dir mnth_4_*) <(ls_dir mnth_2_*) ------- Dedicated perl script based on Kevin Korb script "diff_backup.pl" from http://www.sanitarium.net/unix_stuff/Kevin%27s%20Rsync%20Backups/ Produce almost identical output! home_backup_changes day_{06,05}_* But it only shows moved/renamed files as being deleted and re-added, when in reality there is no dist usage change, just re-organization. cmpdir From rsync-time-machine https://github.com/infinet/rsync-time-machine/ another, very simplistic, rsync backup program. Program is compiled making it difficult to use across systems. ------- Using rsync dryrun ignoring directory changes, but still very verbose rsync -aHin day_{06,05}_*/ 2>&1 | grep -v '^\.d' It also reports when a file was moved/renamed (and its hardlink re-linked) Example... rsync -aHin day_{06,05}_*/ 2>&1 | grep -v '^.d' ... >f..t...... File_Changed >f+++++++++ File_Removed ... You will need to reverse the order of arguments to get files that were added ------- But how much disk space did these changes involve? Forward DU - Files Removed du -s `ls -dr day_{05,06}_*/store/minecraft/world_saves/Flatland/` 16728 day_05_2016-05-22_0612/store/minecraft/world_saves/Flatland/ 7140 day_06_2016-05-21_0612/store/minecraft/world_saves/Flatland/ The first line is the size of the newer (05) snapshot The second indicates the total disk space of files that were removed. Reverse DU - Files Added du -s `ls -d day_{06,05}_*/store/minecraft/world_saves/Flatland/` 16728 day_06_2016-05-21_0612/store/minecraft/world_saves/Flatland/ 7140 day_05_2016-05-22_0612/store/minecraft/world_saves/Flatland/ The first line is the size of the older (06) snapshot The second indicates the total disk space of files that were added. From this we see the size of the directory did not change (17Mb). But about 7 Mbytes of files were modified, due to the breaking the hardlinks between them. ------------------------------------------------------------------------------- Large "disk image" files, or Database Data Files These files can be large and possibly 'holey'. That is it can contain 'zero blocks' that does not actually need disk space to store. Also when such files change but only small isolated parts of the file changes, the overall lenght does not. Because of this while normal rsync will only copy the changed parts, it will still break the hardlink for this changed large file, forcing rsync snapshots to create multiple copies -- of the completely filled file. The same happens for 'log' type files such as email "mbox" file format. where changes are only appended to the 'end' of the file. Run this over 20 to 30 snapshots (typical in a rsync backup) and you end up with a huge backup repository, which little to no disk savings. As such 'rsync snapshoting' is NOT appropriate for databases. -- A perl rsync program "StoreBackup" http://www.nongnu.org/storebackup/ has some method of backing up these files in pieces, so that only the changed pieces will be updated and break the hard link. See its explanation at http://www.nongnu.org/storebackup/node36.html Research needed ------------------------------------------------------------------------------- Snapshotting to USB sticks.... Rsync only replaces files on the destination (breaking any hardlinked copies), if a file data changes, which is why you can create large numbers of 'snapshots' (even once an hour) using very little disk space. Such rsync backups are not compressed, which allows each snapshot to look almost exactly like a simple full working copy of the directories that were backed up. That is, it is easy to search, and access any file in any snapshot. You do not have do searching multiple incremental compressed backup files just to recover a specific bit of data, perhaps without knowing the exact filename that data is in. Just search for it directly as you normally would, across all the snapshots. It is the hard linking of unchanged files that gives a rsync multi-snapshot backup method such a good compression. However hardlinks only work on the same disk storage mount, so each USB would have to have at least one full copy of the files being backed up. As such it should be at least twice the size of the data being backed up. WARNING: As rsync snapshots requires hardlinks you cannot use the default VFAT or other windows filesystems on the USB, you will need to replace it with a ext4, xfs or other UNIX/Linix filesystem. As such the snapshots are not directly available to windows without extra software. ------------------------------------------------------------------------------- Snapshots and Cloud Filesystems (like dropbox) The use of a cloud based filesystem (like dropbox) also precludes the use of the hardlinking unchanged files. As such snapshotting to such a cloud filesystem does not compress well as you do not get hardlink sharing of files across individual snapshots. SOme of these file systems can do it, but then they will need to be able to read the files, creating a privacy issue. -------------------------------------------------------------------------------