Howto create “standardized” CSV exports from MySQL

When you start querying/analyzing/and wanting to export data from MySQL (especially anything free-form like SEO/SEM Keyword Data Extracts) you’ll quickly find out you can’t simply concatenate your columns together and join them with a delimiter to be read later.

Sampling some of our keywords illustrates a number of problems:

Quotes: ! 1/4″ inlet muffler

Newlines: xtra vision

headlight h4703xv hours

Control Characters: ^HToyota+starlet+coolant+temp+sensor, 242°/250° duration @ .050″ 601 lift

Commas/Tabs/Semi Colans or other typical delimiters: “+bosch, +part no., +67647″, 2500hd

chevy cat back exhaust

Unicode: ايسوزو 20011

ב.מ.וו 545i, 宝马m3冷却系统

The Original MySQL attempt to export this was:

SELECT
	`medium`,
	`keyword`,
	`landingpage`,
	SUM(`entrances`) as entrances
INTO OUTFILE 'keyword_data.csv'
	FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
	LINES TERMINATED BY '\r\n'
FROM
	`daily_medium_keyword_landingpage_visits`
WHERE
	medium = 'organic'
GROUP BY
	`medium`,
	`keyword`,
	`landingpage`;

This produced a really great file format, any crazy characters were escaped properly and it looked great on paper:

"organic","! 1/4\" inlet muffler","/cache.aspx?q=\"1+4\"+\"inlet+muffler\"&d=76309488474199&mkt=en-US&setlang=en-US&w=6d792683,38b10c05",1
"organic","!((^ toyota avalon mechanical lifter repairs","/auto-parts/toyota/engine-mechanical-valve-lifter",1
"organic","!(gl-508 rm) motorcraft","/auto-part/149373-motorcraft-gl508rm-remanufactured-alternator",1
"organic","!947 dodge power wagon","/auto-part/983439-mallory-ignition-wire-set-m11947",1
"organic","!967 ford mustang parts","/auto-parts/ford-mustang",1
"organic","!971 142 volvo specifications","/auto-parts/volvo-142/engine-electrical-distributor",1
"organic","!972 buick skylark heater core","/cars/1972-buick-skylark",1

I loaded up FasterCSV to start parsing the file and quickly figured out how naive I was about the CSV file formats. MySQL’s default use of \ as an escape character isn’t “standard.” It turns out there isn’t an official CSV standard but I did find something close: http://tools.ietf.org/html/rfc4180#section-2

Hmm so effectively none of the control characters should be escaped, but fields should be quoted if they have a delimeter in them, and a quote in the field should be replaced with “” to be a CSV Standard.

The other wrinkle is the standard calls for \r\n to be the line delimeter, but what happens if there are other \r\n in the columns?

First Step – Remove Escape Characters:

	FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY ''

By adding the ESCAPED BY ”, none of the characters will be escaped in the output

Step Two – Replace \r\n with \n:

	REPLACE(`keyword`, "\r\n", "\n")

Step Three – Replace ” with “”

	REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""')

Step Four – Replace NULL-Byte with a blank

	REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), "")

Step Five – Replace NULL field with a blank

	IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), ""), "")

Putting it all together:


SELECT
 IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as medium,
 IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as keyword,
 IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as landingpage,
 SUM(IFNULL(`entrances`, 0)) as entrances
INTO OUTFILE 'keyword_data_new.csv'
 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY ''
 LINES TERMINATED BY '\r\n'
FROM `daily_medium_keyword_landingpage_visits`
WHERE medium = 'organic'
GROUP BY
 IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), ""),
 IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), ""),
 IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "");
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s