------------------------------------------------------------------------------- Methods of reading and handling multi-line records EG: multi-line data with some type of record markers. Most API's will read whole files of data into memory. EG; CSV, JSON, etc. But this is not good for large scale pipelining of data, where you want to deal with one or two records at a time. This would also apply to data being recieved from network pipelines, where you need to process that data, as it arrives, so you can respond to it. Most algorithms are for perl, but apply equally well for many other languages. ------------------------------------------------------------------------------- General UNIX multi-line record handlers. Many UNIX command can deal with a different separator than just newlines. # Example input records... echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' tac (reverse records). echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' | tac -s $'===\n' sort # sort will work with NUL separated records # So add NULs between records, sort, remove NULs # This for example also reverses the order of records # echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' | perl -pe '$_ .= "\0" if /^===$/' | sort -r -z | tr -d '\0' # To place seperator before/after the record (start/end marker) # For example with a end marker... echo $'A\nB\nC\n===\nD\nE\nF\n===\nX\nY\nZ\n===' | perl -pe '$_ = "$_\0" if /^===$/' | sort -r -z | tr -d '\0' # WARNING sort -u (uniq) does not work with -z (nul separated) ------------------------------------------------------------------------------- Record Reading and separation (perl) Reading records as Lines, Paragraphs, or whole Files are relatively easy, in just about any language. But using a string as a record separator or marker is harder and comes in three styles. END of record marker... Perl handles this by default, for its handling of end-of-line, and blocks. You do not even need to re-add the separator is it is automatically retained as part of the record, unless specifically removed. $/ = 'end-of-record-marker' Inter-record marker... For example, commas between various records. In this case the separator is, in general, not part of the record itself, and is to be completely ignored. But it does need to be readded, between records. This is relatively easy to do, just treat the separator as a END of record separator, and delete it from incoming records if found. Remember the last record will not have a marker. However when writing you need to output the separator, WITH CARE. Otherwise you could add a empty record at the start or end of the output. You can either adding it to the start of every record except the first, or the end of every record except the last. Skipping 'on first' is easier. See next example with is almost identical. BEGIN record marker... This is tricky. First it you are not careful you may get a 'blank' record at the start of the file (perhaps file comments?) which needs to be ignored, or saved as a separate object. In this case you use the marker, as a end marker, and clean up as appropriate. $/ = 'start-of-record-marker' In the example below the difference between the first and later records is solved by including 'newline' with the "$BEGIN" pattern, and using it as a inter-record separator, then cleaning this up afterwards. For example... # Read Records, the first line starting with "Window " # Note there is no newline after the marker # but there may be a newline before it on second and later records! # This makes it all very tricky to deal with! xlsclients -l | perl -e ' $BEGIN = "Window "; # set the start-of-record marker local $/ = "\n$BEGIN"; # convert it to a inter-record separator while(<>) { s/$BEGIN\Z//; # remove marker from end of record s/\A/$BEGIN/ unless $.==1; # re-add start of record (except first) # ... deal with record ... print "#--- $. ---\n$_"; # output records with a count header } print "--- Total $. Windows ---\n"; ' If you cannot convert the start-of-record to a inter-record separator then you will need to use a flag to prevent reading a 'blank' record at the start. Remember that the separator also needs to be re-added to the start and removed from the end (except last). And you need may need a extra newline on second and later records. Multiple record separators This gets worse if the separator could be one of a number of different strings (start record types?). In that case you may need to keep a copy of the separator, so you can prepend it to the next record. Practical Example? ------------------------------------------------------------------------------- Record/Line continuation. In this case the record separator is a 'newline', but there are special cases, where records may be continued over multiple lines. Two common styles of line continuation... ----- A '\' at the end of a line. (A not-end-of-record-marker) (AKA the shell scripting command line continuation method) Examples: /etc/hosts.allow Relativally easy, read in lines until end of record is indicated. But last record may not be marked correctly BASH makes this easy by using read WITHOUT the -r flag I use a sub-shell to disable bash use of $PS2 as a continuation prompt. ( while read -p '' line; do # ... process lines .... echo 'LINE:' "$line" done ) # Perl Script... # Append continuation lines (merging spaces with a single space). # It will also ignore whole line comments between continuation lines # and ignore comments after a '\' on the continuation line, # and continuation lines can be terminated by a blank line. # # Comments are NOT preserved. # # WARNING: It does not handle a line continuation at EOF! # To handle that you will need a subroutine call so you can process # the line again on the incomplete completion on EOF condition. # my ($next, $line); while( $next = ) { # get next line (with continuations) next if $next =~ s/^\s+#//; # ignore pure comment lines completely $next =~ s/#.*$//; # ignore comments at end of lines $next =~ s/^\s+//; # remove start of line spaces $next =~ s/\s+$//; # remove end of line spaces # join up multi line records $line = length $line ? "$line $next" : $next; next if $line =~ s/\s*\\$//; $_ = $line; $line = ''; # pass joined line, prepare for next one. next if /^$/; # skip blank lines # ...process line... print "$_\n"; next; } # Using a sub-routine: do_line() perl -ne 'BEGIN { sub do_line{ ...process record here...; print $l; $l=""; } } $l.=$_; next if $l=~/\\$/; &do_line; END { &do_line } ' File ----- White Space at start of next line, means previous line is continued (AKA EMail Header line continuation) This is harder as you will need to 'pre-read' the next record, and hold it for the next loop. Also at the end you need to still handle that record. Typical example of this line continuation are * mail headers, * solaris "ifconfig -a" * multi-line "df" output * multi-line "quota" output Simple way is to pre-process to a single line. Though you may need to also pre-handle a mail header/body separation Some of these do not seem to work well! cat mail | sed '/^$/q; /^[a-zA-Z]/{x;p;x;}' ifconfig -a | sed '/^[a-z]/{x;p;x;}' df | sed '/^[^ ]/{x;p;x;}' yum list installed | sed -e :a -e '$!N;s/\n\s\+/ /;ta' -e 'P;D' More direct and less simple way is to use a function call for each line # Read in header collecting multi-line headers together into one variable while (<>) { if ( /^$/ ) { # end of headers - process last do_header($line) if defined $line; # process mail body while(<>) { # just gobble up the rest of the input do_body($_); } exit 0; } if ( /^\s/ ) { # continued multi-line - concatanate $line .= $_; next; } # start of a new line -- process the old collected line do_header($line) if defined $line; # start collecting a new line $line = $_; } sub do_header() { ..process a mail header line.. } sub do_body() { ..process a body line.. } ------------------------------------------------------------------------------- Sub-Record in a Record. The 'leading' record separator is not only used as separators individual records, but also sub-records within records. And example, 'JSON' bookbarks files (firefox) {N:"dir",[{N:f1},{N:subdir,[{N:f2}]},{N;f3}]} Where {...} is a record and [...] sub-directory of more records And we want to output DIR:dir Item:f1 DIR:subdir Item:f2 END Item:f3 END This was implemented in my script "hotlist_firefox_json" with the following programming structure... $/='{'; $level=0 while( <> ) { ( $name ) = /N:"(.*?[^\\])",/ || /N:(.*?),/; if ( /\[/ ) { # output directory type print " "x$level, "DIR:$name\n"; $level++; next; } # output normal record print " "x$level, "Item:$name\n"; # exit sub-directories if ( my (@count) = /]}/g ) { for ( @count ) { $level--; print " "x$level, "END\n"; } } } NOTE the above assumes the characters "{", "[" and "]}" are unique, and not found in any strings in the data. Also sub-records are always at the end of the directory record, with no other feilds following. In other words it is a hack, but it does not require the directory record to be held in memory until it is complete. -- A Simplification This worked for firefox bookmarks, but not for google-chrome bookmarks. -------------------------------------------------------------------------------