When you start querying/analyzing/and wanting to export data from MySQL (especially anything free-form like SEO/SEM Keyword Data Extracts) you’ll quickly find out you can’t simply concatenate your columns together and join them with a delimiter to be read later.
Sampling some of our keywords illustrates a number of problems:
Quotes: ! 1/4″ inlet muffler
Newlines: xtra vision
headlight h4703xv hours
Control Characters: ^HToyota+starlet+coolant+temp+sensor, 242°/250° duration @ .050″ 601 lift
Commas/Tabs/Semi Colans or other typical delimiters: “+bosch, +part no., +67647″, 2500hd
chevy cat back exhaust
Unicode: ايسوزو 20011
ב.מ.וו 545i, 宝马m3冷却系统
The Original MySQL attempt to export this was:
SELECT `medium`, `keyword`, `landingpage`, SUM(`entrances`) as entrances INTO OUTFILE 'keyword_data.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\r\n' FROM `daily_medium_keyword_landingpage_visits` WHERE medium = 'organic' GROUP BY `medium`, `keyword`, `landingpage`;
This produced a really great file format, any crazy characters were escaped properly and it looked great on paper:
"organic","! 1/4\" inlet muffler","/cache.aspx?q=\"1+4\"+\"inlet+muffler\"&d=76309488474199&mkt=en-US&setlang=en-US&w=6d792683,38b10c05",1 "organic","!((^ toyota avalon mechanical lifter repairs","/auto-parts/toyota/engine-mechanical-valve-lifter",1 "organic","!(gl-508 rm) motorcraft","/auto-part/149373-motorcraft-gl508rm-remanufactured-alternator",1 "organic","!947 dodge power wagon","/auto-part/983439-mallory-ignition-wire-set-m11947",1 "organic","!967 ford mustang parts","/auto-parts/ford-mustang",1 "organic","!971 142 volvo specifications","/auto-parts/volvo-142/engine-electrical-distributor",1 "organic","!972 buick skylark heater core","/cars/1972-buick-skylark",1
I loaded up FasterCSV to start parsing the file and quickly figured out how naive I was about the CSV file formats. MySQL’s default use of \ as an escape character isn’t “standard.” It turns out there isn’t an official CSV standard but I did find something close: http://tools.ietf.org/html/rfc4180#section-2
Hmm so effectively none of the control characters should be escaped, but fields should be quoted if they have a delimeter in them, and a quote in the field should be replaced with “” to be a CSV Standard.
The other wrinkle is the standard calls for \r\n to be the line delimeter, but what happens if there are other \r\n in the columns?
First Step – Remove Escape Characters:
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY ''
By adding the ESCAPED BY ”, none of the characters will be escaped in the output
Step Two – Replace \r\n with \n:
REPLACE(`keyword`, "\r\n", "\n")
Step Three – Replace ” with “”
REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""')
Step Four – Replace NULL-Byte with a blank
REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), "")
Step Five – Replace NULL field with a blank
IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), ""), "")
Putting it all together:
SELECT IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as medium, IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as keyword, IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as landingpage, SUM(IFNULL(`entrances`, 0)) as entrances INTO OUTFILE 'keyword_data_new.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '' LINES TERMINATED BY '\r\n' FROM `daily_medium_keyword_landingpage_visits` WHERE medium = 'organic' GROUP BY IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), ""), IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), ""), IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "");
Leave a Reply