------------------------------------------------------------------------------- CSV - Comma seperated Values The CSV seems like a simple format and for most data it is. But it can become much more more complex when specific characters are added. Basically simple forms are easy to parse, other forms much much harder! If the data starts to get hard, you may be better off using XML or JSON. --- Simple CSV (no spaces around commas!) A,B,C,D Quotes CSV (not all feilds need to be quoted!) A,B,"This is a ""quoted"" string, with a comma and quotes.",D Quoted strings could include linebreaks! A,B,"This is a ""quoted"" string, with a commas, quotes and a newline.",D Additional notes... The number of fields per record should be constant. First line typically holds the field names, but there is no way to record this fact in the CSV file itself. Human identification is needed. Typically fields with leading or trailing spaces are quoted to avoid confusion. But that is not always the case. Also you typically quote all the fields or none of them to make parsing easier. It is a Mixed-Encoding that makes the format very hard to parse. With a always quote scheme records start with '^"' end in '"$' and separated by '","' Unicode can make it harder. Especially as utf-16 may use unicode commas and quotes! An overview, including variations is http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm ------------------------------------------------------------------------------- Variations... embedded quotes are "" or \" embedded commas are \, in unquoted fields embeded linefeeds are "\n" Field delimeter used... , ; : \t | TAB is particularly useful as was often used for database tables However they may sometimes be accidentally converted to spaces. Ascii Code has specific control charcters for record structures. * FS - File Separator * GS - Group Separator * RS - Record Separator * US - Unit Separator However these are often not easilly editable, and never really caught on. A suggested variation is multi-table and colunm specifications using a special @@ prefix... @@Table: Customers @@Columns: custID, title, locationID, balance, notes 512,"IBM",36,2406.34,"" 883,"Johson Movers",572,0,"47A compliant" Etc.... @@Table: Invoices @@Columns: custRef, invDate, Total, Tax 423,"12/14/2003",3500,854.17 822,"03/01/2002",1476.34,322.05 Etc.... With added data types @@Table: Invoices @@Columns: custRef(int), invDate(date), total(decimal:10.2) 423,"12/14/2003",3500 822,"03/01/2002",1476.34 If null values are not allowed: "custRef(int)+" for multiple intergers ------------------------------------------------------------------------------- CSV and perl modules perl-Class-CSV A good basic CSV read/writer (not a serial processor) perl-Text-CSV More complex CSV handler, looks harder to use. But can process serially. perl-Text-CSV-Separator figure out the field separator of a random CSV ------------------------------------------------------------------------------- Shell parsing CSV ab1pp1,ab1,pp1 bb1oo1,bb1,oo1 cc1qq1,cc1,qq1 Can be parsed using (known number of fields) # backup current IFS (Internal File Separator) IFS_backup="${IFS}" # change IFS to directly read input file into 3 variables a,b,c IFS="," while read a b c do echo "LOGIN:$a LASTNAME:$b FIRSTNAME:$c" done < fileA.csv # restore IFS IFS="${IFS_backup}" But if all fields are quoted "ab1,pp1","ab1","pp1" "bb1,oo1","bb1","oo1" "cc1,qq1","cc1","qq1" then you are better of 'cutting' on the quotes. while read line; do unset fields for i in {2..6..2}; do fields=( "${fields[@]}" "$(echo "$line" | cut -d '"' -f${i})" ) done echo ${fields[0]} - ${fields[1]} - ${fields[2]} unset fields done But generally you need a full tokenizer to parse CSV -------------------------------------------------------------------------------