History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: GEOT-1231
Type: Task Task
Status: Open Open
Priority: Major Major
Assignee: Jesse Eichar
Reporter: Andrea Aime
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
GeoTools

Dbase numeric parsing is really inefficient

Created: 05/Apr/07 09:29 AM   Updated: 04/Sep/08 08:04 PM
Component/s: data shapefile
Affects Version/s: 2.3.1
Fix Version/s: 2.4.6

File Attachments: 1. Java Source File DbaseNumberParser.java (8 kb)
2. Java Source File DbaseNumberParserTest.java (10 kb)
3. Text File integer_parsing.patch (2 kb)



 Description  « Hide
extractNumberString shows quite prominently in a profile made against a shapefile with a hundred of integer fields.
This is because three reallocations of memory are performed in this tiny method.
At least for integers and number, it would be a lot better to have some custom parsing that works directly against
the charbuffer. The code would not be complex (not nearly as complex as double/float parsing).

 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Sasa Ivetic - 05/May/08 11:18 PM
I would like to offer a patch to address the issue of integer parsing inefficiency. As the task states, the inefficiency comes from creating 3 intermediate objects (subsequence of the charbuffer, String representation of it, trimming of the string) before a final Integer/Long/Double object can be created. I've played around with, and added some code that instead works by extracting a char array from the character buffer, then manually parses the integer/long from the array (removing any leading/trailing whitespaces in the process). While my code is considerably longer than the original code, in my testing I am seeing about 3x the improvement in processing speeds over old code.

My code measure consisted of inserting 1 million random integers into a CharBuffer, then one by one extracting and parsing them using the two methods. On average, 1 million integers takes 155ms to parse with my proposed methods, while the same amount of records takes 440ms to parse using the existing method.

If this code is acceptable, I can write a similar method to handle extracting Doubles. If there is anything wrong with the code, I am more than happy to improve it.

Andrea Aime - 06/May/08 02:25 AM - edited
I had a quick look at the patch, it looks good. I'd suggest you roll out the parsing methods in a separate utility class (with static public parsing methods, just like java.lang.Math if you know what I mean) and then heavily test it with all corner cases, so that there is actual proof it's not only working fast, but also fine.
Heavily test -> JUnit tests that can be included in our build

Sasa Ivetic - 06/May/08 08:13 AM
Great, will do that.

Sasa Ivetic - 07/May/08 11:15 PM
Andrea,

I've updated the code (attached a new patch, the new class with static methods, and a test class).

I've done some time testing with this code, and these are the results:

Time for the entire test to run (that includes writing 1mil records to a DBF, then reading (and consequently parsing) each of those 1mil records)

My Code:
testcase time="15.547" name="testParseFromDBF2"

Old Code:
testcase time="17.079" name="testParseFromDBF2"


Timing just the code that reads the records produces these results (timing was done using System.nanoTime(), then divided by 1mil to get time in ms):

New Code:
Time Took: 1631ms

Old Code:
Time Took: 2735ms

The timing test case is left in (but commented out) to show how I arrived at above times.

As mentioned in IRC, all the code has been written by me, so there should be no copyright issues.