osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Returning dataframe from parDo and printing its value - advice?


Hi,

Is anyone listening on the user@ mailing list? or should I use a different mailing list?

I have made some progress. 
- ParDo returns a List now
- add a header to the WriteToText.

The pipeline looks like that:
ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('archs4.Debug_annotation'))
                | "create more columns" >> beam.ParDo(CreateColForSampleFn(colListSubset,outputPath)))

(ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://dataExploration.txt',file_name_suffix='.csv',num_shards=1,append_trailing_newlines=True,header=colListStr))


The remaining issue is that the output has new line after each value:
None
None
None
None
None
 30
 Primary Tissue
None
None
None
Please let me know how do I get read from this new lines. I hope to be able to open the output file with Google Sheet.

Thanks,
Eila


On Fri, Jun 15, 2018 at 2:45 PM, OrielResearch Eila Arich-Landkof <eila@xxxxxxxxxxxxxxxxx> wrote:
Hi all,

I am running a pipeline, where a table from BQ is being processed line by line using ParDo function. 
CreateColForSampleFn generates a data frame, with headers and values (shape: 1x164 ) that I want to pass to WriteToText.
See the followings:

ExploreData = (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('archs4.Debug_annotation'))
                | "create more columns" >> beam.ParDo(CreateColForSampleFn(colListSubset,outputPath)))

(ExploreData | 'writing to CSV files' >> beam.io.WriteToText('gs://dataExploration.txt',num_shards=1))
 
My questions are related to the returned DF and WriteToText:
1. when I pass DF from the CreateColForSampleFn to WriteToText , I get only the headers:
Sample_contact_phone
Sample_extract_protocol_ch1
Sample_platform_id
Sick
Sample_title
index
Sample_last_update_date
Sample_contact_country
Sample_channel_count
Sample_library_source
Sample_taxid_ch1

2. When I return the df in a list [df], I get the following txt for each row (including the dimensions)
 Sample_contact_phone                        Sample_extract_protocol_ch1 Sample_platform_id  Sick
0                       Library construction protocol: Four µg of tota...           GPL11154  None
[1 rows x 168 columns]


I want to generate a text file that includes:
- One header (if needed, I will add it after the pipeline completed)
- All the values from each rows that was processed and generated DF
- Full cell values, without ... in the middle

What am I missing? any advice?
 
Thanks,
--



--