osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

your advice please regarding state


Hi ,
I am very new to flink so please be gentle :)

The challenge:
I have a road sensor that should scan billons of cars per day. for starter I want to recognise if each car that passes by is new or not. new cars (never been seen before by that sensor ) will be placed on a different topic on kafka than the other (total of two topics for new and old) .
 under the assumption that the state will contain billions of unique car ids.

Suggested Solutions
My question is it which approach is better. 
Both approaches using RocksDB 

1. use the ValueState and to split the steam like 
  val domainsSrc = env
    .addSource(consumer)
    .keyBy(car => car.id)
    .map(...)
and checking if the state value is null to recognise new cars. if new than I will update the state
how will the persistent data will be shard among the nodes in the cluster (let's say that I have 10 nodes) ?

2. use MapState and to partition the stream to groups by some arbitrary factor e.g
val domainsSrc = env
    .addSource(consumer)
    .keyBy{ car =>
        val h car.id.hashCode % partitionFactor
        math.abs(h)
    } .map(...)
and to check mapState.keys.contains(car.id) if not - add it to the state 

which approach is better ?

Thanks in advance 
Avi