[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Implementing a “join” between a DataStream and a “set of rules”



What is the best practice recommendation for the following use case? We need to match a stream against a set of “rules”, which are essentially a Flink DataSet concept. Updates to this “rules set" are possible but not frequent. Each stream event must be checked against all the records in “rules set”, and each match produces one or more events into a sink. Number of records in a rule set are in the 6 digit range.


Currently we're simply loading rules into a local List of rules and using flatMap over an incoming DataStream. Inside flatMap, we're just iterating over a list comparing each event to each rule.


To speed up the iteration, we can also split the list into several batches, essentially creating a list of lists, and creating a separate thread to iterate over each sub-list (using Futures in either Java or Scala).



1.            Is there a better way to do this kind of a join?

2.            If not, is it safe to add additional parallelism by creating new threads inside each flatMap operation, on top of what Flink is already doing?


Thanks in advance!