In 1.5 the latency metric was changed to be reported on the job-level,
that's why you see it under /jobs/.../metrics now, but not in 1.4.
In 1.4 you would see something similar under
/jobs/.../vertices/.../metrics, for each vertex.
Additionally it is now a proper histogram, which significantly increases
the number of accesses to the ConcurrentHashMaps that store metrics fort
he UI. It could be that this code is just too slow for the amount of
On 23.08.2018 19:06, Jozef Vilcek wrote:
parallelism is 100. I tried clusters with 1 and 2 slots per TM yielding
100 or 50 TMs in cluster.
I did notice that URL http://jobmanager:port/jobs/job_id/metrics in
returns huge list of "latency.source_id. ...." IDs. Heap dump shows that
hash map takes 1.6GB for me. I am guessing that is the one dispatcher
threads keep updating. Not sure what are those. In 1.4.0 that URL
something else, very short list.
On Thu, Aug 23, 2018 at 6:44 PM Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx
How many task slots do you have in the cluster and per machine, and
parallelism are you using?
On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
Yes, on smaller data and therefore smaller resources and parallelism
exactly same job runs fine
On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljoscha@xxxxxxxxxx>
So with Flink 1.5.3 but a smaller parallelism the job works fine?
On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vilcek@xxxxxxxxx>
I am trying to get my Beam application (run on newer version of
(1.5.3) but having trouble with that. When I submit application,
works fine but after a few mins (as soon as 2 minutes after job
cluster just goes bad. Logs are full of timeouts for heartbeats,
lost leadership, TaskExecutor timed out etc.
At that time, also WebUI is not usable. Looking into job manager, I
notice that all of "flink-akka.actor.default-dispatcher" threads are
or blocked. Most blocks are on metrics:
java.lang.Thread.State: BLOCKED (on object monitor)
- waiting to lock <0x000000053df75510> (a
I tried to increase memory, as MetricStore seems to hold quite a lot
but it is not helping. On 1.4.0 job manager was running with 4GB
this behaviour also occur with 10G.
P.S.: Executed Beam app has problem in setup with 100 parallelism,
slots, 2100 running task, streaming mode. Smaller job runs without