Reading/Writing to various sources/sinks is very easy in Spark. It will be the best if Beam can provide something similar.
If not possible, then it will also help if you eliminate avro schema - by using Beam schema instead or just accepting the typename (and creating POJOs instead of GenericRecords). It will be so good if we can avoid seeing
intermediate step of dealing with avro while reading/writing to Parquet.
From: Łukasz Gajowy <lgajowy@xxxxxxxxxx>
Sent: Tuesday, July 17, 2018 2:29:22 PM
Subject: Re: Schema class in 2.5 ?
I think what you're asking should be doable but requires modifications in the ParquetIO code. It uses schema in 2 places:
- read: to setCoder on the PCollection . As long as there already is a way to set the coder in a different way that does not require the Avro Schema we're good to go there (at the time of developing ParquetIO I don't think there was). From the doc mentioned above, I suspect that SchemaCoder may be the best fit for that.
- write: avro schema is used by AvroParquetWriter.builder() it explicitly requires the Avro schema.  I think we could accept the Beam's schema as long as there's a way to transform it to Avro Schema. I think it's doable but we would need (for example) to transform Beam's schema to json and then pass it to Avro's new Schema.Parser().parse() method to get Avro's schema for the builder.
wt., 17 lip 2018 o 09:52 Akanksha Sharma B <akanksha.b.sharma@xxxxxxxxxxxx> napisał(a):