I am writing some some jobs intended to run using the DataStream API using a Kafka source. However we also have a lot of data in Avro archives (of the same Kafka source). I would like to be able to run the processing code over parts of the archive so I can generate some "example output".
I've written the transformations needed to read the data from the archives and process the data, but now I'm trying to figure out the best way to write the results of this to some storage.
At the moment I can easily write to Json or CSV using the bucketing sink (although I'm curious about using the watermark time rather than system time to name the buckets), but I'd really like to store to something smaller like Avro.
However I'm not sure this make sense. Writing to a compressed file format in this way from a streaming job doesn't sound intuitively right. What would make the most sense. I could write to some temporary database and then pipe that into an archive, but this seems like a lot of trouble. Is there a way to pipe the output directly into the batch API of flink?