At last night meeting someone asked about ways to transpose the rows and
columns in a CSV file that was too large to fit into memory and
potentially had thousands of columns and hundreds of thousands of rows.
If your data fits the criteria of having reasonably uniform cell widths,
say for example no cells holding more than 50 bytes, you could use a
sparse file with a fixed width record structure. You would make a single
pass through your source data file, parsing the CSV only once, and make
random (as in random access) seeks into the target file, following a
simple formula to determine the cell's offset.
(If a sparse file isn't doable, but you have lots of disk space, you can
instead create the target file by padding it with enough nulls to hold
the number of cells you are expecting. Using dd to copy blocks from
/dev/zero would do the trick.)
Conceptually this isn't all that different from using a database, as
someone suggested, but should have much less overhead.
Of course if the desired end result is to have another CSV file, you'll
have to write another chunk of code to convert the fixed record file
back to CSV.
Whether millions of random seeks and a post conversion step really
provide enough of a performance gain over making multiple passes through
the source file is debatable. If this is a one-time conversion effort,
the multi pass approach might be the way to go.
A variation on this idea, which would eliminate copying the data to an
intermediary file, and would be able to tolerate widely varying cell
widths, is to make one pass through the source file and generate an
index for it - say the byte offset of each row, and then the relative
offset and length of each cell - which, if you're lucky, might fit into
memory, otherwise you'd again use a sparse file. Then you'd walk through
the index, make random seeks into the source file, and write the cells
to your target CSV file in the desired order. Not quite a single pass on
the source file, but at least you'd have the overhead of performing the
CSV parsing only once.
-Tom
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
|