OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Implementing a “join” between a DataStream and a “set of rules”


Thanks Garvit for your suggestion!

 

From: Garvit Sharma <garvits45@xxxxxxxxx>
Date: Tuesday, June 5, 2018 at 8:44 AM
To: "Sandybayev, Turar (CAI - Atlanta)" <Turar.Sandybayev@xxxxxxxxxxxxxx>
Cc: "user@xxxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxxx>
Subject: Re: Implementing a “join” between a DataStream and a “set of rules”

 

Hi,

 

For the above use case, you should do the following :

 

1. Convert your DataStream into KeyedDataStream by defining a key which would be used to get validated against your rules.

2. Same as 1 for rules stream.

3. Join the two keyedStreams using Flink's connect operator.

4. Store the rules into Flink's internal state i,e. Flink's managed keyed state.

5. Validate the data coming in the dataStream against the managed keyed state.

 

Refer to [1] [2] for more details.

 

 

 

 

On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta) <Turar.Sandybayev@xxxxxxxxxxxxxx> wrote:

Hi,

 

What is the best practice recommendation for the following use case? We need to match a stream against a set of “rules”, which are essentially a Flink DataSet concept. Updates to this “rules set" are possible but not frequent. Each stream event must be checked against all the records in “rules set”, and each match produces one or more events into a sink. Number of records in a rule set are in the 6 digit range.

 

Currently we're simply loading rules into a local List of rules and using flatMap over an incoming DataStream. Inside flatMap, we're just iterating over a list comparing each event to each rule.

 

To speed up the iteration, we can also split the list into several batches, essentially creating a list of lists, and creating a separate thread to iterate over each sub-list (using Futures in either Java or Scala).

 

Questions:

1.            Is there a better way to do this kind of a join?

2.            If not, is it safe to add additional parallelism by creating new threads inside each flatMap operation, on top of what Flink is already doing?

 

Thanks in advance!

Turar

 



 

--


Garvit Sharma
github.com/garvitlnmiit/

No Body is a Scholar by birth, its only hard work and strong determination that makes him master.