------------------------------------------------------------------------------- Why are there many data file representations? The problem is many people decide that, XYZZY data format, is too complicated, so they may a replacement only to find out why data files are so complicated. Specifically... * A "magic number" at the top of the file to indicate format * Character encoding (ASCII, Unicode, etc) * Extensibility, different data types * Escaping Rules * Accommodate multi-line data * Parsing and Syntax Rules (schema) The best idea is for the file to be self-describing instead of looking like line noise. But in the end you end up with something as complex as XML and JSON all over again. ------------------------------------------------------------------------------- Type of data files Key Value (with groups) Data forms specific key and value pairs of simple data types Typical Config Files Array or Table based The data forms distinct rows (records) of fixed columns of data CSV FLIRT Hierarchical Tree You can embed complex freeform data structures within data structures Some elements may be mssing or as yet undefined. XML JSON ------------------------------------------------------------------------------- Parsing freeform records... For techniques for ASCII reading/handling records in perl... See "Record Reading and separation" in ../perl/general.hints ------------------------------------------------------------------------------- General UNIX text file conventions As per "The Art of UNIX Programming" Eric S. Raymond * One record per line (if possible) * less than 80 characters per line (if posible) * Use # as a introducer for comments * Support backslash convention * Use colon (passwd) or a run of white space as field separators * Do not make distinctions between tab and whitespace * Favor Hex over Octal * Do not make compression or binary encoding part of the format. 'Stanza' Format. * Multi-line records * Use of % or %% on there own as record seperators The %% can also act as a comment by ignoring any following text This means of course that only intra-record comments can be used. Empty records are ignored. * one field per line using colon between key and value (as per email) * support some form of line continuation Either: \ at end of line OR white space at start of line (as one space) * ignore trailing white space * include a version number or self-describing chunks (future proof it) Finally beware of floating-point round-off problems, Especially between number strings and binary forms. -------------------------------------------------------------------------------