It is good to see this discussed!I think there needs to be a good balance between the SDK harness capabilities/complexity and responsibilities. Additionally the user will need to be able to adjust the runner behavior, since the type of workload executed in the harness also is a factor. Elsewhere we already discussed that the current assumption of a single SDK harness instance per Flink task manager brings problems with it and that there needs to be more than one way how the runner can spin up SDK harnesses.There was the concern that instantiation if multiple SDK harnesses per TM host is expensive (resource usage, initialization time etc.). That may hold true for a specific scenario, such as batch workloads and the use of Docker containers. But it may look totally different for a streaming topology or when SDK harness is just a process on the same host.Thanks,ThomasOn Fri, Aug 17, 2018 at 8:36 AM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:SDK harnesses were always responsible for executing all work given to it concurrently. Runners have been responsible for choosing how much work to give to the SDK harness in such a way that best utilizes the SDK harness.I understand that multithreading in python is inefficient due to the global interpreter lock, it would be upto the runner in this case to make sure that the amount of work it gives to each SDK harness best utilizes it while spinning up an appropriate number of SDK harnesses.On Fri, Aug 17, 2018 at 7:32 AM Maximilian Michels <mxm@xxxxxxxxxx> wrote:Hi Ankur,
Thanks for looking into this problem. The cause seems to be Flink's
pipelined execution mode. It runs multiple tasks in one task slot and
produces a deadlock when the pipelined operators schedule the SDK
harness DoFns in non-topological order.
The problem would be resolved if we scheduled the tasks in topological
order. Doing that is not easy because they run in separate Flink
operators and the SDK Harness would have to have insight into the
execution graph (which is not desirable).
The easiest method, which you proposed in 1) is to ensure that the
number of threads in the SDK harness matches the number of
ExecutableStage DoFn operators.
The approach in 2) is what Flink does as well. It glues together
horizontal parts of the execution graph, also in multiple threads. So I
agree with your proposed solution.
On 17.08.18 03:10, Ankur Goenka wrote:
> tl;dr Dead Lock in task execution caused by limited task parallelism on
> * Job type: /*Beam Portable Python Batch*/ Job on Flink standalone
> * Only a single job is scheduled on the cluster.
> * Everything is running on a single machine with single Flink task
> * Flink Task Manager Slots is 1.
> * Flink Parallelism is 1.
> * Python SDKHarness has 1 thread.
> *Example pipeline:*
> Read -> MapA -> GroupBy -> MapB -> WriteToSink
> With multi stage job, Flink schedule different dependent sub tasks
> concurrently on Flink worker as long as it can get slots. Each map tasks
> are then executed on SDKHarness.
> Its possible that MapB gets to SDKHarness before MapA and hence gets
> into the execution queue before MapA. Because we only have 1 execution
> thread on SDKHarness, MapA will never get a chance to execute as MapB
> will never release the execution thread. MapB will wait for input from
> MapA. This gets us to a dead lock in a simple pipeline.
> Set worker_count in pipeline options more than the expected sub tasks
> in pipeline.
> 1. We can get the maximum concurrency from the runner and make sure
> that we have more threads than max concurrency. This approach
> assumes that Beam has insight into runner execution plan and can
> make decision based on it.
> 2. We dynamically create thread and cache them with a high upper bound
> in SDKHarness. We can warn if we are hitting the upper bound of
> threads. This approach assumes that runner does a good job of
> scheduling and will distribute tasks more or less evenly.
> We expect good scheduling from runners so I prefer approach 2. It is
> simpler to implement and the implementation is not runner specific. This
> approach better utilize resource as it creates only as many threads as
> needed instead of the peak thread requirement.
> And last but not the least, it gives runner control over managing truly
> active tasks.
> Please let me know if I am missing something and your thoughts on the