-------------------------------------------------------------------------------
Methods of reading and handling multi-line records

EG: multi-line data with some type of record markers.

Most API's will read whole files of data into memory.  EG; CSV, JSON, etc.  But
this is not good for large scale pipelining of data, where you want to deal
with one or two records at a time.

This would also apply to data being recieved from network pipelines, where you
need to process that data, as it arrives, so you can respond to it.

Most algorithms are for perl, but apply equally well for many other languages.

-------------------------------------------------------------------------------
General UNIX multi-line record handlers.

Many UNIX command can deal with a different separator than just newlines.

    # Example input records...
    echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n==='

  tac   (reverse records).

    echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' |
      tac -s $'===\n'

  sort
    # sort will work with NUL separated records
    # So add NULs between records, sort, remove NULs
    # This for example also reverses the order of records
    #
    echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' |
      perl -pe '$_ .= "\0" if /^===$/' |
        sort -r -z |
         tr -d '\0'

    # To place seperator before/after the record (start/end marker)
    # For example with a end marker...
    echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' |
      perl -pe '$_ = "$_\0" if /^===$/' |
        sort -r -z |
         tr -d '\0'

    # WARNING sort -u (uniq) does not work with -z (nul separated)

-------------------------------------------------------------------------------
Record Reading and separation (perl)

Reading records as Lines, Paragraphs, or whole Files are relatively easy,
in just about any language.

But using a string as a record separator or marker is harder and comes in three
styles.


END of record marker...
    Perl handles this by default, for its handling of end-of-line, and blocks.
    You do not even need to re-add the separator is it is automatically
    retained as part of the record, unless specifically removed.

    $/ = 'end-of-record-marker'

Inter-record marker...

    For example, commas between various records.  In this case the separator
    is, in general, not part of the record itself, and is to be completely
    ignored.  But it does need to be readded, between records.

    This is relatively easy to do, just treat the separator as a
    END of record separator, and delete it from incoming records
    if found.  Remember the last record will not have a marker.

    However when writing you need to output the separator, WITH CARE.
    Otherwise you could add a empty record at the start or end of the output.
    You can either adding it to the start of every record except the first, or
    the end of every record except the last.  Skipping 'on first' is easier.

    See next example with is almost identical.


BEGIN record marker...
    This is tricky.  First it you are not careful you may get a 'blank' record
    at the start of the file (perhaps file comments?) which needs to be
    ignored, or saved as a separate object.

    In this case you use the marker, as a end marker, and clean up as
    appropriate.

    $/ = 'start-of-record-marker'

    In the example below the difference between the first and later records
    is solved by including 'newline' with the "$BEGIN" pattern, and using
    it as a inter-record separator, then cleaning this up afterwards.

    For example...

      # Read Records, the first line starting with "Window "
      # Note there is no newline after the marker
      # but there may be a newline before it on second and later records!
      # This makes it all very tricky to deal with!

      xlsclients -l | perl -e '
        $BEGIN = "Window ";       # set the start-of-record marker
        local $/ = "\n$BEGIN";    # convert it to a inter-record separator
        while(<>) {
           s/$BEGIN\Z//;               # remove marker from end of record
           s/\A/$BEGIN/ unless $.==1;  # re-add start of record (except first)
           # ... deal with record ...
           print "#--- $. ---\n$_";    # output records with a count header
        }
        print "--- Total $. Windows ---\n";
        '

    If you cannot convert the start-of-record to a inter-record separator
    then you will need to use a flag to prevent reading a 'blank' record
    at the start.

    Remember that the separator also needs to be re-added to the start and
    removed from the end (except last).  And you need may need a extra newline
    on second and later records.

Multiple record separators

    This gets worse if the separator could be one of a number of different
    strings (start record types?).  In that case you may need to keep a copy of
    the separator, so you can prepend it to the next record.

    Practical Example?

-------------------------------------------------------------------------------
Record/Line continuation.

In this case the record separator is a 'newline', but there are special cases,
where records may be continued over multiple lines.

Two common styles of line continuation...

-----
A '\' at the end of a line. (A not-end-of-record-marker)
(AKA the shell scripting command line continuation method)

  Examples:  /etc/hosts.allow

  Relativally easy, read in lines until end of record is indicated.
  But last record may not be marked correctly

  BASH makes this easy by using read WITHOUT the -r flag
  I use a sub-shell to disable bash use of $PS2 as a continuation prompt.
    ( while read -p '' line; do
        # ... process lines ....
        echo 'LINE:' "$line"
      done
    )

  # Perl Script...
  # Append continuation lines (merging spaces with a single space).
  # It will also ignore whole line comments between continuation lines
  # and ignore comments after a '\' on the continuation line,
  # and continuation lines can be terminated by a blank line.
  #
  # Comments are NOT preserved.
  #
  # WARNING: It does not handle a line continuation at EOF!
  # To handle that you will need a subroutine call so you can process
  # the line again on the incomplete completion on EOF condition.
  #
  my ($next, $line);
  while( $next = <DATA> ) {     # get next line (with continuations)
    next if $next =~ s/^\s+#//; # ignore pure comment lines completely
    $next =~ s/#.*$//;          # ignore comments at end of lines
    $next =~ s/^\s+//;          # remove start of line spaces
    $next =~ s/\s+$//;          # remove end of line spaces

    # join up multi line records
    $line = length $line ? "$line $next" : $next;
    next if $line =~ s/\s*\\$//;
    $_ = $line; $line = '';     # pass joined line, prepare for next one.

    next if /^$/;               # skip blank lines

    # ...process line...
    print "$_\n";
    next;

  }

  # Using a sub-routine:  do_line()
  perl -ne 'BEGIN {
              sub do_line{
                ...process record here...;
                print $l; $l="";
              }
            }
            $l.=$_; next if $l=~/\\$/; &do_line;

            END { &do_line }
           '  File


-----
White Space at start of next line, means previous line is continued
(AKA EMail Header line continuation)

  This is harder as you will need to 'pre-read' the next record, and hold it
  for the next loop.  Also at the end you need to still handle that record.

  Typical example of this line continuation are
    * mail headers,
    * solaris "ifconfig -a"
    * multi-line "df" output
    * multi-line "quota" output

  Simple way is to pre-process to a single line.
  Though you may need to also pre-handle a mail header/body separation
  Some of these do not seem to work well!

    cat mail | sed '/^$/q; /^[a-zA-Z]/{x;p;x;}'

    ifconfig -a | sed '/^[a-z]/{x;p;x;}'

    df | sed '/^[^ ]/{x;p;x;}'

    yum list installed | sed -e :a -e '$!N;s/\n\s\+/ /;ta' -e 'P;D'


  More direct and less simple way is to use a function call for each line

    # Read in header collecting multi-line headers together into one variable
    while (<>) {
      if ( /^$/ ) {
        # end of headers - process last
        do_header($line) if defined $line;

        # process mail body
        while(<>) {   # just gobble up the rest of the input
           do_body($_);
        }
        exit 0;
      }
      if ( /^\s/ ) {
        # continued multi-line - concatanate
        $line .= $_;
        next;
      }
      # start of a new line -- process the old collected line
      do_header($line) if defined $line;

      # start collecting a new line
      $line = $_;
    }

    sub do_header() { ..process a mail header line.. }
    sub do_body() { ..process a body line.. }

-------------------------------------------------------------------------------
Sub-Record in a Record.

The 'leading' record separator is not only used as separators individual
records, but also sub-records within records.

And example, 'JSON' bookbarks files (firefox)
   {N:"dir",[{N:f1},{N:subdir,[{N:f2}]},{N;f3}]}
Where
   {...} is a record  and [...] sub-directory of more records

And we want to output
   DIR:dir
     Item:f1
     DIR:subdir
       Item:f2
     END
     Item:f3
   END

This was implemented in my script "hotlist_firefox_json" with the following
programming structure...

  $/='{';
  $level=0
  while( <> ) {
    ( $name ) = /N:"(.*?[^\\])",/ || /N:(.*?),/;
    if ( /\[/ ) {
      # output directory type
      print "  "x$level, "DIR:$name\n";
      $level++;
      next;
    }

    # output normal record
    print "  "x$level, "Item:$name\n";

    # exit sub-directories
    if ( my (@count) = /]}/g ) {
      for ( @count ) {
        $level--;
        print "  "x$level, "END\n";
      }
    }
  }

NOTE the above assumes the characters "{", "[" and "]}" are unique,
and not found in any strings in the data.

Also sub-records are always at the end of the directory record,
with no other feilds following.

In other words it is a hack, but it does not require the directory record to be
held in memory until it is complete. -- A Simplification

This worked for firefox bookmarks, but not for google-chrome bookmarks.

-------------------------------------------------------------------------------