Thank you for expressing interest in taking this on! Let me give you a few pointers to start, and I'll be happy to help everywhere along the way.
Basically we want BigQueryIO.write() to return something (e.g. a PCollection) that can be used as input to Wait.on().
Currently it returns a WriteResult, which only contains a PCollection<TableRow> of failed inserts - that one can not be used directly, instead we should add another component to WriteResult that represents the result of successfully writing some data.
Given that BQIO supports dynamic destination writes, I think it makes sense for that to be a PCollection<KV<DestinationT, ???>> so that in theory we could sequence different destinations independently (currently Wait.on() does not provide such a feature, but it could); and it will require changing WriteResult to be WriteResult<DestinationT>. As for what the "???" might be - it is something that represents the result of successfully writing a window of data. I think it can even be Void, or "?" (wildcard type) for now, until we figure out something better.
Implementing this would require roughly the following work:
- Add this PCollection<KV<DestinationT, ?>> to WriteResult
- Modify the BatchLoads transform to provide it on both codepaths: expandTriggered() and expandUntriggered()
...- expandTriggered() itself writes via 2 codepaths: single-partition and multi-partition. Both need to be handled - we need to get a PCollection<KV<DestinationT, ?>> from each of them, and Flatten these two PCollections together to get the final result. The single-partition codepath (writeSinglePartition) under the hood already uses WriteTables that returns a KV<DestinationT, ...> so it's directly usable. The multi-partition codepath ends in WriteRenameTriggered - unfortunately, this codepath drops DestinationT along the way and will need to be refactored a bit to keep it until the end.
...- expandUntriggered() should be treated the same way.
- Modify the StreamingWriteTables transform to provide it
...- Here also, the challenge is to propagate the DestinationT type all the way until the end of StreamingWriteTables - it will need to be refactored. After such a refactoring, returning a KV<DestinationT, ...> should be easy.
Another challenge with all of this is backwards compatibility in terms of API and pipeline update.
Pipeline update is much less of a concern for the BatchLoads codepath, because it's typically used in batch-mode pipelines that don't get updated. I would recommend to start with this, perhaps even with only the untriggered codepath (it is much more commonly used) - that will pave the way for future work.
Hope this helps, please ask more if something is unclear!