[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reading CSV from google cloud storage to Data Flow

The same holds true in Python: Read the files with TextIO and follow with a Map operation that splits the lines into records. 

This, of course, only works if you don't have newlines within your records. In that case, you may need to use a DoFn that takes as input a each filename and reads the entire file (e.g. using the standard library csv parsers), emitting the records (possibly followed by a Reshuffle), e.g.

 | beam.Create([list of filenames])
 | beam.FlatMap(lambda path: csv.reader(open(path)))
 | beam.Reshuffle()
 | ...)

If your files are too big to read in a single mapper *and* have newlines, you may have to implement something like https://blog.etleap.com/2016/11/27/distributed-csv-parsing/

On Sun, Nov 25, 2018 at 2:29 PM Unais T <tpunais@xxxxxxxxx> wrote:

On Sun, Nov 25, 2018 at 4:54 PM Jean-Baptiste Onofré <jb@xxxxxxxxxxxx> wrote:
Hi Unais,

What SDK do you plan to use ? Java or Python ?

Regarding Java, I would use directly TextIO.


On 25/11/2018 13:09, Unais T wrote:
> Hey guys,
> One doubt 
> I want to read a csv file from google cloud storage to Data Flow
> which is best method?
> 1.   Read csv and sync to BQ and then use BigQuerySource method
> 2.   Read from cloud storage directly to Data Flow (Is there any source
> method for csv from cloud storage to CSV - like `ReadFromText` )
> Whats the best way to read csv from cloud storage to Data Flow?

Jean-Baptiste Onofré
Talend - http://www.talend.com