OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSVSplitter - Splittable DoFn


Generally reading CSV by doing a TextIO read followed by a DoFn that splits
at comma boundaries should be more straightforward than writing a custom
SDF. There are some complications, however.

1) How to deal with headers (and/or footers). Especially if you can't
distinguish the header line from "real" data based on its contents, and/or
may want to use the header line to parse the data into a structured format.
2) Some CSV specs allow newlines to appear in cells as long as they're
quoted, i.e.

value1,value2,"value3
with a newline",value4

This makes it impossible to construct the full record given the two pieces,
and in general impossible to correctly parse without starting at the very
beginning.

On Tue, Apr 24, 2018 at 2:08 PM Eugene Kirpichov <kirpichov@xxxxxxxxxx>
wrote:

> Hi Peter,

> Thanks for experimenting with SDF! However, in this particular case: any
reason why you can't just use TextIO.read() and parse each line as CSV?
Seems like that would require considerably less code.

> A few comments on this code per se:
> - The ProcessElement implementation immediately claims the entire range,
which means that there can be no dynamic splitting and the code behaves
equivalently to a regular DoFn
> - The code for initial splitting of the restriction seems very complex -
can you just split it blindly into a bunch of byte ranges of about equal
size? Looking at the actual data while splitting should be never necessary
- you should be able to just look at the file size (say, 100MB) and split
it into a bunch of splits, say, [0, 10MB), [10MB, 20MB) etc.
> - It seems that the splitting code tries to align splits with record
boundaries - this is not useful: it does not matter whether the split
boundaries fall onto record boundaries or not; instead, the reading code
should be able to read an arbitrary range of bytes in a meaningful way.
That typically means that reading [a, b) means "start at the first record
boundary located at or after "a", end at the first record boundary located
at or after "b""
> - Fine-tuning the evenness of initial splitting is also not useful:
dynamic splitting will even things out anyway; moreover, even if you are
able to achieve an equal amount of data read by different restrictions, it
does not translate into equal time to process the data with the ParDo's
fused into the same bundle (and that time is unpredictable).


> On Tue, Apr 24, 2018 at 1:24 PM Peter Brumblay <pbrumblay@xxxxxxxxxxxxxx>
wrote:

>> Hi All,

>> I noticed that there is no support for CSV file reading (e.g. rfc4180)
in Apache Beam - at least no native transform. There's an issue to add this
support: https://issues.apache.org/jira/browse/BEAM-51.

>> I've seen examples which use the apache commons csv parser. I took a
shot at implementing a SplittableDoFn transform. I have the full code and
some questions in a gist here:
https://gist.github.com/pbrumblay/9474dcc6cd238c3f1d26d869a20e863d.

>> I suspect it could be improved quite a bit. If anyone has time to
provide feedback I would really appreciate it.

>> Regards,

>> Peter Brumblay
>> Fearless Technology Group, Inc.