I have some sample data included as static resources in a Force.com application and a setup page that gives the user the option of loading this data into some custom objects. The files are Comma Separated Value (CSV) files exported by the Apex Data Loader.
(Incidentally this code supports automatic relationship id fixup and also self-referencing objects such as Contact that has a ReportsTo field that is a lookup to another Contact. So a graph of related objects can be loaded.)
Parsing a CSV file is easy if things like line feeds, quotes and escape characters embedded in values are ignored:
String fields = line.split(',', -1);
Note that the -1 argument stops trailing empty fields from being discarded.
But the logic to handle the more complicated cases is pretty messy and not easy to get right. So I ported the opencsv code from Java to Apex and this code worked well for a while. Then recently the amount of sample data increased and the code hit the “Total number of executed script statements” governor limit of 200,000. (Note that in Winter ’14 this limit is being replaced by a CPU timeout limit that should result in a higher number of statements being allowed but I have not yet seen any quantitative data about how much more…) Given that the compute capacity already appears to be throttled as part of the multi-tenant architecture, there is a natural limitation imposed by how long your users are prepared to wait. So this brick-wall limit that means that at best you have to optimize your code and at worst you have to throw your code away is really annoying. It reminds me of the bad old days when platform limitations often got in the way of getting functionality of value to the business implemented.
To optimize the code, I first used some unit test cases and Limits.getScriptStatements calls to discover that the CSV parsing was costing about 100 statements per value because the algorithm was working character by character. My first optimization attempts only got me about a 20% improvement in return for some pretty ugly code changes – not the way to go. The core problem was processing characters in Apex rather than whole values or whole lines. So before processing a line the code now checks for the embedded line feeds, quotes and escape characters. If they are found (the rare case for my data) the expensive opencsv parsing is used and if they are not found (the common case for my data) the cheap String.split as shown above is used. The result is a 90% improvement (ten times less script statements executed) which gives me plenty of headroom for the near future.