[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Running Flink on Yarn

I think the data buffered for join will be distributed among threads by order_id (a1 and a2 will be internally keyed). 
Each thread will have non-shared window state (for 2 hours) per certain order_id's.
Slots will share some common JVM resources mentioned in docs, also access to state DB but not the majority of storage occupied by state.

cc Timo, Piotr

On Mon, Dec 24, 2018 at 7:46 PM Anil <anilsingh.jsr@xxxxxxxxx> wrote:
I am using  time-windowed join only. Here's a sample query -

SELECT a1.order_id, a2.order.restaurant_id FROM awz_s3_stream1 a1 INNER JOIN
awz_s3_stream2 a2 ON CAST(a1.order_id AS VARCHAR) = a2.order_id AND
a1.to_state = 'PLACED' AND a1.proctime BETWEEN a2.proctime - INTERVAL '2'
HOUR AND a2.proctime + INTERVAL '2' HOUR GROUP BY HOP(a2.proctime, INTERVAL
'2' MINUTE, INTERVAL '1' HOUR), a2.`order`.restaurant_id

Just to simplify my question -

Suppose I have a TM with 4 slots and I deploy a flink job with parallelism=4
with 2 container - 1 JM and 1 TM. Each parallel instance will be deployed in
one task slot each in the TM (the entire job pipeline running per slot ).My
jobs does a join(SQL time-windowed join on non-keyed stream) and they buffer
last few hours of data. My question is will these threads running in
different task slot share this data buffered for join. What all data is
shared across these threads.

Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/