osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Best way to repeatedly update a side input


Hi all!!

I want to create a job that reads data from pubsub and then joins with another collection (side input) read from a BigQuery table (actually the schemas). The thing I'm not 100% sure is what would be the best way to update that side input (as those schemas may change at any given point in time). I thought of using GenerateSequence + FixedTimeWindows but I was wondering if that would be the best approach as it sounded "hacky" to me.

The idea would be something like this:
val schemas = inputPipeline.context.customInput("Schemas",
GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1)))
//.withFixedWindows(Duration.standardMinutes(1))
.withFixedWindows(Duration.standardMinutes(1))
.flatMap { _ =>
log.info("Retrieving schemas...")
Seq("Table1", "Table2").map(n => (n, getBQSchema(table, n)))
}.asMapSideInput
In case this makes sense I have another question, how would that behave with many workers? Will each of the workers actually retrieve the schema or will it be ran by one and then "broadcasted"?

Thanks!