Parsing CSV while staying within the “total number of executed script statements” governor limit

I have some sample data included as static resources in a Force.com application and a setup page that gives the user the option of loading this data into some custom objects. The files are Comma Separated Value (CSV) files exported by the Apex Data Loader.

(Incidentally this code supports automatic relationship id fixup and also self-referencing objects such as Contact that has a ReportsTo field that is a lookup to another Contact. So a graph of related objects can be loaded.)

Parsing a CSV file is easy if things like line feeds, quotes and escape characters embedded in values are ignored:

String[] fields = line.split(',', -1);

Note that the -1 argument stops trailing empty fields from being discarded.

But the logic to handle the more complicated cases is pretty messy and not easy to get right. So I ported the opencsv code from Java to Apex and this code worked well for a while. Then recently the amount of sample data increased and the code hit the “Total number of executed script statements” governor limit of 200,000. (Note that in Winter ’14 this limit is being replaced by a CPU timeout limit that should result in a higher number of statements being allowed but I have not yet seen any quantitative data about how much more…) Given that the compute capacity already appears to be throttled as part of the multi-tenant architecture, there is a natural limitation imposed by how long your users are prepared to wait. So this brick-wall limit that means that at best you have to optimize your code and at worst you have to throw your code away is really annoying. It reminds me of the bad old days when platform limitations often got in the way of getting functionality of value to the business implemented.

To optimize the code, I first used some unit test cases and Limits.getScriptStatements calls to discover that the CSV parsing was costing about 100 statements per value because the algorithm was working character by character. My first optimization attempts only got me about a 20% improvement in return for some pretty ugly code changes – not the way to go. The core problem was processing characters in Apex rather than whole values or whole lines. So before processing a line the code now checks for the embedded line feeds, quotes and escape characters. If they are found (the rare case for my data) the expensive opencsv parsing is used and if they are not found (the common case for my data) the cheap String.split as shown above is used. The result is a 90% improvement (ten times less script statements executed) which gives me plenty of headroom for the near future.

Advertisements

4 thoughts on “Parsing CSV while staying within the “total number of executed script statements” governor limit

  1. Any plans in opensourcing the CSV parser?

    I already have my custom implementation but its specific to a certain csv format. Other developers could use your implementation since you’ve already optimized it for Force.com

    Thanks!

  2. I don’t think the solution is good enough to share. First it is a literal port of opencsv that really ought to be refined more and certainly needs a much bigger set of test cases before I would be confident to share it. And the test to switch to using String.split works with my data but could be fooled by some combinations of embedded quotes.

    This post was just intended to forewarn anyone else implementing their own parser of the script statement issue. What is really needed is a platform API…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s