Re: Batch writing from Flink streaming job
If you want to write in batches from a streaming source you always will need some state ie a state database (here a NoSQL database such as a key value store makes sense). Then you can grab the data at certain points in time and convert it to Avro. You need to make sure that the state is logically consistent (eg all from the last day) to avoid that events arrive later then expected and they are not in the files.
You can write your own sink, but it would require some state database to write the data afterwards as batch.
Maybe this could be a generic Flink component, ie writing to a state database to later write a logical consistent (ok this is defined by the application) state into other sinks (CSVsink, avrosink etc).
> On 13. May 2018, at 14:02, Padarn Wilson <padarn@xxxxxxxxx> wrote:
> Hi all,
> I am writing some some jobs intended to run using the DataStream API using a Kafka source. However we also have a lot of data in Avro archives (of the same Kafka source). I would like to be able to run the processing code over parts of the archive so I can generate some "example output".
> I've written the transformations needed to read the data from the archives and process the data, but now I'm trying to figure out the best way to write the results of this to some storage.
> At the moment I can easily write to Json or CSV using the bucketing sink (although I'm curious about using the watermark time rather than system time to name the buckets), but I'd really like to store to something smaller like Avro.
> However I'm not sure this make sense. Writing to a compressed file format in this way from a streaming job doesn't sound intuitively right. What would make the most sense. I could write to some temporary database and then pipe that into an archive, but this seems like a lot of trouble. Is there a way to pipe the output directly into the batch API of flink?