Thanks. I have created a PCollection from the dataset that is available in the H5 file which is provided as numpy array.
It is very challenging for my use case to describe the schema. The original dimensions of the dataset are 70K x 30K . Any suggestion how to work around that?
I think that it was mentioned at the summit that there will be a way to write to BQ without schema. Is something like that on the roadmap?
Sent from my iPhone
(moving dev to bcc)
I was able to make it work by creating the PCollection with the numpy array. However, writing to BQ was impossible because it requested for the schema.
(p | "create all" >> beam.Create(_expression_[1:5,1:5])
| "write all text" >> beam.io.WriteToText('gs://archs4/output/', file_name_suffix='.txt'))
Is there a walk around for providing schema for beam.io.BigQuerySink?
Regarding your earlier question, you do need at least one element in the PCollection that triggers the ParDo to do any work (which can be a create with a single element that you ignore).
Not sure if I fully understood the BigQuery question. You have to specify a schema when writing to a new BigQuery table. See following example,