-------------------------------------------------------------------------------
Rsync Backups, and Snapshoting
-------------------------------------------------------------------------------
Making incremental backups (snapshots) with rsync...

It's actually very easy to use rsync to create multiple snapshot of a backup.
All you have to do is create a hard linked copy of the backup tree before
(or after) you run rsync.  Each snapshot will have a hardlinked copy of the
files, so each snapshot uses very little extra disk space.

Then when a file changes, rsync will unlink and recreate that files, while
leaving the old version of the changed file older hardlinked snapshots
untouched.

This means you can have many 'snapshots' of your backup (hours, days, etc) each
with a copy of the files as they existed at the time the snapshot was made,
with only the changed files using extra disk space.  A very efficent of disk
space for a snapshot system.

However be warned that file premissions, and ownership is shared via hardlinks,
so if later updated changes some files owner or permission than ALL the copies
of that linked file will also recieved the same change.   This is the only
cavat with a hardlinked snapshot system.

The following is a simple script to make a snapshot of a existing copy (today,
and call it yesterday) before updating the current backup directory.  You may
want to change some of the options, especially on the rsync command.

   cp -l today `/bin/date -I -d yesterday`
   rsync -a --delete fileserver:/home today

Also look at the option --compare-dest=DIR
Though I think the hard link option may be better.

Rsync backup snapshot scheme as originally pieced together by Mike Rubel
  http://www.mikerubel.org/computers/rsync_snapshots/

Better documention of the techniquye
  http://www.sanitarium.net/golug/rsync_backups_2010.html

And taken to a higher level with
  http://rsnapshot.org/

------------------

NOTE: The above makes a snapshot BEFORE doing the syncronization.

As the directories are all hardlinks, you can make multiple snapshot 'cycles'.
For this I fould it is better to do it immediatally afetr you have synced the
primary 'current' backup.  That is do the backup, then create all the snapshot
cycles such as 'dayly', 'weekly' or 'monthly' snapshots of the 'current'
backup, as appropriate.

For example  This creates a single snapshot immediatally after
the 'current' primary backup directory, has been updated (sync'ed)

   rsync -a --delete $HOME backup_server:current
   ssh backup_server cp -l current snap_`date +%F_%H%M`

The 'cp' command can be a more complex script that performs multiple backup
cycles to preserve hourly/daily/monthly/yearly backup cycles, removing older
ones as appropriate.

-------------------------------------------------------------------------------
WARNING: Renaming directories breaks rsync hardlinks!

If you rename a directory in your home, the backups will break all the
hardlinks between what are essentially the same unchanged files within those
directories.  This means an rsync backup will then result in a massive
increase in the disk space needed for the backup.

For example... Say I have a sub-directory called 'my_photos' which has been
backed up by a rsync backup program.  Then I rename that directory to
'photos_2009'.  Later when the rsync backup makes new 'snapshot' it sees the
old directory deleted, and a new directory created. It does not see it as
a 'renaming', as so breaks all the hardlinks.

The result is that the hardlinks between the old and new snapshots for all the
files are now broken.  As a whole directory of photos (possibility containing
many sub-directories of photos), this can be very very large, suddenly your
backup uses vast amount of disk space.

This happened to me, eventually causing me to run out of diskspace in the
backup server, even though I did not have any new data!

If the backups continue for some time like this, new snapshots will be created
with hardlinks to the files new location.  That means you can end up with two
copies of the files each with its own collection of mulitple hardlinks
between snapshots.  This can make fixing the hardlinking between files really
difficult.

The problem is that simply deleting one file and relinking the file to the
other hardlink 'group' does not actually free up any disk space. That will only
happen if ALL the files in the 'hardlink group' gets re-linked to the other
hardlink group.  A very complex business, and not one you would want to do by
hand.

----

One solution to this have been a C program that re-links similar files called
'hardlink'

http://helmut.hullen.de/filebox/Linux/slackware/ap/hardlink-1.2-i486-1hln.tgz

I downloaded and looked at this and the program has no documentation, no
comments and is very difficult to decode to figure out how exactly it does the
job.  However it does not deal with the situation of their being two 'hardlink
groups' of files, and properly sorting out that mess.

Other solutions I have seen
  http://linux.die.net/man/1/hardlink
Again, this program make no mention of how it would handle 'hardlink groups'


Eventually I wrote a perl script called "linkdups" that will actually
understand and re-link hardlink groups.  It keeps track of all the files
in a group, and when it finds the same file in two different groups it
fixes ALL the hardlinks to form one single hardlink group.

It is available at

  https://antofthy.gitlab.io/software/#linkdups

It also has many options to provide better controls of the re-linking
process, such as how to handle file permissions.

The script only does a full comparison of files that actually have the same
length, making it very fast in comparing large numbers of files.

In other words it is much more hardlink savvy, than most other similar
programs.  Finally it tries to make intelligent choices about how permissions
and date stamps should be merged between two hardlinked groups being merged.

Comments and suggestions are welcome.


ASIDE: Unison (a bidirectional rsync) can see renamed/moved files!
But does not appear suitable for creating hardlinked backups.

WARNING WARNING WARNING

Re-linking an actual home directory (compared to relinking backup snapshots)
is dangerous and can have un-foreseen consequences, in terms of configuration
files, and SVN/GIT repository copies.   Do not relink the files of your
working home.

See the CAVEAT section of the script built-in documentation.

-------------------------------------------------------------------------------
My Backup Scheme "home_backup" (vs "rsnapshot")

A copy of rsync backup scripts...
  https://antofthy.gitlab.io/software/#home_backup

My version stores this slightly differently to rsnapshot, and is mostly as it
handles the snapshot cycles in a completely seperate step to the backup
process.  However it has features that make it very useful and which is not
provided by rsnapshot.

BACKUP...

The actual backup is just that. A complete backup of my home to a separate
directory on a separate machine...   user@remote:backup/current
That part is all it does, nothing else.

The "backup" directory is where things are stored, and is the "current" most
recent 'copy' of my primary home, regardless of when the last backup was made,
be it yesterday, or in the last hour.

This is a "push" backup from my primary home, and as such you can make
additional backups to any number of remote accounts or mounted USB hard drives
(two or more backup locations is a good idea as a further disaster recovery
technique).  It could be made to be a "pull" backup, but I prefer a "push", so
that I can control it from my main home.

RELINK...

A 're-link' of the "current" backup against previous hourly and daily
snapshots is then performed, so as to handle the possibility of
a directory/file being renamed, moved, restored, and thus improve the
hardlinked disk space saving.  This uses my "linkdups" script, see above.

An occasional relink over ALL snapshots is also recommended, but not vital.
this will relink files that changed but was then restored or put back to
normal, something that a normal relink may not find.


SNAPSHOT CYCLES...

Once the current backup is complete, a separate step creates the hardlinked
'snapshots'.  That is I create multiple 'snapshot' cycles, not just daily but
also weekly, monthly and yearly!  This is done by 'rolling' the "current"
directory into various cycles (removing the oldest snapshot if needed).

I found it is a lot easier to do this as a separate step AFTER the backup has
been complete.  Most other rsync snapshot systems do this before the rsync
backup, creating a single 'daily' backup cycle.  To create multiple cycles They
create seperate backups for each cycle! doubling or tripling disk space usage.

At this time I have five such cycles: hourly, daily, weekly, monthly, yearly.
While this 'rolling' is launched after the rsync backup is complete, it can
actually be done at any time, and in many different ways.

In my case I designed it so that each 'cycle' is rolled and a new snapshot is
made, only when any of three conditions exist...

  1/ This cycle does not exist yet.     -- start a new cycle immediately
  2/ It is a specific time or day.      -- a preferred snapshot cycle time
  3/ Last snapshot is older than the cycle time -- it missed snapshot roll

The first rule is obvious, create it.  The second means that is it will prefer
to make snapshots at a specific time, 6am for daily, Monday morning for weekly
or 1st day of the month for monthly, etc.  While the third rule means that if
that 'preferred' time passes and the snapshot did not happen (machine down or
network issues), just make one when next run.

This means that when the backup is performed the cycles are updated only as
needed, but the snapshot cycles will still work even when the backup program
runs haphazardly.  That is a cycle will be rolled if the prefered time/day was
missed for some reason.

This means scheduling does not actually need to be continuous, or regular.  An
hourly flagged backup for example can easily be scheduled to be every two
hours, or four hours, or only during work-time hours (8am to 6pm - which is
what I do).  I could even do a current backup every 15 minutes, or perhaps
even continuously, and hourly cycle will still remain an hourly cycle, with
the specified period, and at the prefered times.

Note that the prefered time rules probably should be aborted if a recent
snapshot roll was made recently.  For example don't do a dayly snapshot at the
prefered time, if the last daily snapshot was just couple of hours ago, due to
a missed update.

This way if a snapshot cycle run was not performed for sometime, appropriate
snapshots will be made immediately, but will migrate with slightly longer
cycle periods (3rd rule), until the preferred time is reached.  This avoids
getting two snapshots in a cycle very close together.

It also means that if say an hourly snapshot took longer than 30 minutes the
next hourly snapshot will be aborted! As such snapshot intervals will probably
need some tweaking to get it 'just right' (not too soon, and not too late).


SNAPSHOT NAMING...

Simply using a incrementing number (such as in rsnapshot), is not very useful,
expecially in debuging a backup program that is going wrong in some way.

For example, this is difficult to debug!
  hourly.0 hourly.1

I wanted to be able to look at the backup and know the backup dates directly,
while still having the directory names sort and cycle correctly.  That is know
which more recent, and which is the next to be removed.

Here is a listing of my backup directory...

  Backup_Summery            day_14_2012-06-08_0606/  mnth_3_2012-04-01_0605/
  current/                  day_15_2012-06-07_0606/  mnth_4_2012-03-01_0605/
  current_2012-06-21_1106@  home_backup_prep*        mnth_5_2012-02-01_0605/
  day_01_2012-06-21_0606/   home_backup_roll*        mnth_6_2012-01-01_0605/
  day_02_2012-06-20_0606/   home_backup_summ*        mnth_7_2011-12-01_0609/
  day_03_2012-06-19_0606/   hour_1_2012-06-21_1106/  mnth_8_2011-11-01_0609/
  day_04_2012-06-18_0608/   hour_2_2012-06-21_1006/  mnth_9_2011-10-01_0608/
  day_05_2012-06-17_0606/   hour_3_2012-06-21_0907/  week_1_2012-06-18_0608/
  day_06_2012-06-16_0606/   hour_4_2012-06-21_0806/  week_2_2012-06-11_0608/
  day_07_2012-06-15_0606/   hour_5_2012-06-20_1806/  week_3_2012-06-04_0608/
  day_08_2012-06-14_0606/   hour_6_2012-06-20_1706/  week_4_2012-05-28_0608/
  day_09_2012-06-13_0910/   hour_7_2012-06-20_1606/  week_5_2012-05-21_0608/
  day_10_2012-06-12_0606/   hour_8_2012-06-20_1506/  week_6_2012-05-14_0608/
  day_11_2012-06-11_0608/   hour_9_2012-06-20_1406/  week_7_2012-05-07_0608/
  day_12_2012-06-10_0606/   mnth_1_2012-06-01_0606/  week_8_2012-04-30_0607/
  day_13_2012-06-09_0606/   mnth_2_2012-05-01_0606/  week_9_2012-04-23_0607/

As you can see each backup not only has an increment, but also the time-date.
It is easy to see how old each backup is and if some problem has occurred.

For example the "day_09" snapshot roll happened late on that day. In this case
my main computer had been turned off until I came in that day.  As such the
daily snapshot was made from the first hourly backup/snapshot that was
performed, rather than at it's normal preferred time.

Technically you don't need the incrementing number, as the filenames will sort
correctly alphabetically (just in reverse order), and you can still remove
older files based on file counts, or how old they are. But the incrementing
number I find useful for referring to specific backups, or for comparing and
restoration. For example restoring from the snapshot 2 days ago is easier when
you can just jump to 'day_02_*' directory. As such I still prefer to have then
in the naming scheme.

Future Possibility: At this time my rolling script does not handle gracefully
situations where the incrementing number has gaps or duplicates.  It would be
nice if when rolling backups the scheme ignored that number, and automatically
renumber the filenames based on the date component, so as to 'close the gap'.

Note the "current_*" file is just a symbolic link to "current" so I can see
what time the last backup was made. In the above that backup, is also normally
hardlink copied as the last hourly snapshot.  This has minimal cost in terms of
diskspace.

The 'roll' script is designed to make it easy to modify and configure new
cycles, or naming schemes.  Thus is can be quite easy to change, to make
weekly, fortnightly, or bi-weekly,  or add a yearly, half-yearly, or quarterly,
snapshots, if desired.


SUMMERY OF FEATURES...

  * The backups are 'pushed' to remote accounts, or even a local USB mount.
  * You can have multiple, separate backup 'storage'.
  * Re-linking is performed to handle large scale directory renaming.
  * The backup and the snapshot cycles are treated as separate tasks.
  * No need to figure out what 'cycle' the current backup is for (just do it).
  * There is a 'preferred time' to make a snapshot for each cycle.
  * But recovers if the preferred time is missed (previous snapshot is old)
  * No need to make perform a seperate backup for different cycles
  * Better naming of the snapshots so you can see problems or gaps
  * You can easilly specify a snapshot, such as say 2 days ago.
  * Snapshot rolling is designed to be easily modified in the 'roll' script.

These features make great additions to any backup scheme.

-------------------------------------------------------------------------------
Comparing snapshot directories

Directory Disk Usage changes (hardlinked directorys only)

  cd ~/backup
  du current {day,week,mnth,year}_* |
    sed '/\tcurrent/d' | xdiskusage - &

-------
Directory Listing Compare
  compare_ls  mnth_{4,2}_*/archive
OR
  ls_dir() {
    # get listing with size, date and full path
    find "$1" -type f -ls |
      sed 's/^.\{27\}\(.\{4\}\).\{20\}\(.\{21\}\)[^/]*\//\1\2/;' |
      sort -k6  # sort result
  }
  diffuse -w <(ls_dir mnth_4_*) <(ls_dir mnth_2_*)

-------
Dedicated perl script based on Kevin Korb script "diff_backup.pl" from
http://www.sanitarium.net/unix_stuff/Kevin%27s%20Rsync%20Backups/
Produce almost identical output!

  home_backup_changes day_{06,05}_*

But it only shows moved/renamed files as being deleted and re-added, when
in reality there is no dist usage change, just re-organization.

cmpdir
From rsync-time-machine   https://github.com/infinet/rsync-time-machine/
another, very simplistic, rsync backup program.
Program is compiled making it difficult to use across systems.

-------
Using rsync dryrun

ignoring directory changes, but still very verbose

  rsync -aHin day_{06,05}_*/  2>&1 | grep -v '^\.d'

It also reports when a file was moved/renamed (and its hardlink re-linked)

Example...

  rsync -aHin day_{06,05}_*/  2>&1 | grep -v '^.d'

  ...
  >f..t...... File_Changed
  >f+++++++++ File_Removed
  ...

You will need to reverse the order of arguments to get files that were added

-------
But how much disk space did these changes involve?

Forward DU - Files Removed
  du -s `ls -dr day_{05,06}_*/store/minecraft/world_saves/Flatland/`

  16728   day_05_2016-05-22_0612/store/minecraft/world_saves/Flatland/
  7140    day_06_2016-05-21_0612/store/minecraft/world_saves/Flatland/

The first line is the size of the newer (05) snapshot
The second indicates the total disk space of files that were removed.

Reverse DU - Files Added
  du -s `ls -d day_{06,05}_*/store/minecraft/world_saves/Flatland/`

  16728   day_06_2016-05-21_0612/store/minecraft/world_saves/Flatland/
  7140    day_05_2016-05-22_0612/store/minecraft/world_saves/Flatland/

The first line is the size of the older (06) snapshot
The second indicates the total disk space of files that were added.


From this we see the size of the directory did not change (17Mb).  But about
7 Mbytes of files were modified, due to the breaking the hardlinks between
them.


-------------------------------------------------------------------------------
Large "disk image" files, or Database Data Files

These files can be large and possibly 'holey'.  That is it can contain 'zero
blocks' that does not actually need disk space to store. Also when such files
change but only small isolated parts of the file changes, the overall lenght
does not.

Because of this while normal rsync will only copy the changed parts, it will
still break the hardlink for this changed large file, forcing rsync snapshots
to create multiple copies -- of the completely filled file.

The same happens for 'log' type files such as email "mbox" file format.
where changes are only appended to the 'end' of the file.

Run this over 20 to 30 snapshots (typical in a rsync backup) and you end up
with a huge backup repository, which little to no disk savings.

As such 'rsync snapshoting' is NOT appropriate for databases.

--

A perl rsync program  "StoreBackup"
   http://www.nongnu.org/storebackup/
has some method of backing up these files in pieces, so that only the changed
pieces will be updated and break the hard link.  See its explanation at
http://www.nongnu.org/storebackup/node36.html

Research needed

-------------------------------------------------------------------------------
Snapshotting to USB sticks....

Rsync only replaces files on the destination (breaking any hardlinked copies),
if a file data changes, which is why you can create large numbers of
'snapshots' (even once an hour) using very little disk space.

Such rsync backups are not compressed, which allows each snapshot to look
almost exactly like a simple full working copy of the directories that were
backed up. That is, it is easy to search, and access any file in any snapshot.
You do not have do searching multiple incremental compressed backup files just
to recover a specific bit of data, perhaps without knowing the exact filename
that data is in. Just search for it directly as you normally would, across all
the snapshots. It is the hard linking of unchanged files that gives a rsync
multi-snapshot backup method such a good compression.

However hardlinks only work on the same disk storage mount, so each USB would
have to have at least one full copy of the files being backed up. As such it
should be at least twice the size of the data being backed up.

WARNING: As rsync snapshots requires hardlinks you cannot use the default VFAT
or other windows filesystems on the USB, you will need to replace it with
a ext4, xfs or other UNIX/Linix filesystem.  As such the snapshots are not
directly available to windows without extra software.

-------------------------------------------------------------------------------
Snapshots and Cloud Filesystems (like dropbox)

The use of a cloud based filesystem (like dropbox) also precludes the use of
the hardlinking unchanged files. As such snapshotting to such a cloud
filesystem does not compress well as you do not get hardlink sharing of files
across individual snapshots.

SOme of these file systems can do it, but then they will need to be able to
read the files, creating a privacy issue.

-------------------------------------------------------------------------------