[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Implementing a “join” between a DataStream and a “set of rules”

Hi Aljoscha,


Thank you, this seems like a match for this use case. Am I understanding correctly that since only MemoryStateBackend is available for broadcast state, the max amount possible is 5MB?


If I use Flink state mechanism for storing rules, I will still need to iterate through all rules inside of a flatMap, and there’s no higher-level join mechanism that I can employ, right? Is there any downside in trying to parallelize that iteration inside my user flatMap operation?





From: Aljoscha Krettek <aljoscha@xxxxxxxxxx>
Date: Tuesday, June 5, 2018 at 12:05 PM
To: Amit Jain <aj2011it@xxxxxxxxx>
Cc: "Sandybayev, Turar (CAI - Atlanta)" <Turar.Sandybayev@xxxxxxxxxxxxxx>, "user@xxxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxxx>
Subject: Re: Implementing a “join” between a DataStream and a “set of rules”







On 5. Jun 2018, at 14:46, Amit Jain <aj2011it@xxxxxxxxx> wrote:


Hi Sandybayev,

In the current state, Flink does not provide a solution to the
mentioned use case. However, there is open FLIP[1] [2] which has been
created to address the same.

I can see in your current approach, you are not able to update the
rule set data. I think you can update rule set data by building
DataStream around changelogs which are stored in message
queue/distributed file system.
You can store rule set data in the external system where you can query
for incoming keys from Flink.

[1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
[2]: https://issues.apache.org/jira/browse/FLINK-6131

On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
<Turar.Sandybayev@xxxxxxxxxxxxxx> wrote:


What is the best practice recommendation for the following use case? We need
to match a stream against a set of “rules”, which are essentially a Flink
DataSet concept. Updates to this “rules set" are possible but not frequent.
Each stream event must be checked against all the records in “rules set”,
and each match produces one or more events into a sink. Number of records in
a rule set are in the 6 digit range.

Currently we're simply loading rules into a local List of rules and using
flatMap over an incoming DataStream. Inside flatMap, we're just iterating
over a list comparing each event to each rule.

To speed up the iteration, we can also split the list into several batches,
essentially creating a list of lists, and creating a separate thread to
iterate over each sub-list (using Futures in either Java or Scala).


1.            Is there a better way to do this kind of a join?

2.            If not, is it safe to add additional parallelism by creating
new threads inside each flatMap operation, on top of what Flink is already

Thanks in advance!