------------------------------------------------------------------------------- Why are there so many data file representations? The problem is many people decide that, XYZZY data format, is too complicated, so they make a replacement only to find out why data files are so complicated. Specifically datafile usually needs... * A "magic number" at the top of the file to indicate format * Character encoding (ASCII, Unicode, etc) * Extensibility, for different data types * Escaping Rules * Accommodate multi-line data * Parsing and Syntax Rules (schema) The best idea is for the file to be self-describing instead of looking like line noise. But in the end you end up with something as complex as XML and JSON all over again. ------------------------------------------------------------------------------- Type of data files by complexity Key-Value pairs Data forms specific key and value pairs of simple data types The many different configuration file formats For example using '=', ':', or spaces, between key-value pairs As complexity grows you can add... Grouping with '[section]' lines. Allow the creation of arrays Simple forms are basically python, perl, or bash script defining variables. But as these are executed such data formats can become dangerious, if the file or the data used in the file comes from external sources. Array or Table based The data forms distinct rows (records) of fixed columns of data These MUST be interpreted, usualy by some library, and can become very application dependant. Ex: CSV FLIRT Hierarchical Tree You can embed complex freeform data structures within data structures. Some elements may be mssing or as yet undefined. These are well knows with standard libraries to read the data. But the data being defined can itself become complex. Ex: XML JSON ------------------------------------------------------------------------------- Parsing freeform records... For techniques for ASCII reading/handling records (mostly perl)... See "Record Reading and separation" in "multiline_records.txt" ------------------------------------------------------------------------------- General UNIX 'text' file conventions As per "The Art of UNIX Programming" Eric S. Raymond * One record per line (if possible, easier to edit) * Less than 80 characters per line (if posible, easier to edit) * Use # as a introducer for comments (full line or end of line) * Support a backslash convention (for escapes and continued lines) * Use colon (passwd), or a run of white space as field separators * Do not make distinctions between tab and whitespace!!! * Favor Hexadecimal over Octal * Do not make compression or binary encoding part of the format. (it just makes it hard to read and edit) 'Stanza' Format (more formal UNIX test file format) * Multi-line records * Use of % or %% on there own as record seperators The %% can also act as a comment by ignoring any following text This means of course that only intra-record comments can be used. Empty records are ignored. * One field per line using colon between key and value (as per email) * Support some form of line continuation Either: \ at end of line OR white space at start of line (as one space) * Ignore trailing white space * Include a version number or self-describing chunks (future proof it) Finally beware of floating-point round-off problems. Especially when converting between between: a string representation, and its numerical binary form. -------------------------------------------------------------------------------