osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Small checkpoint data takes too much time


Hi Zhijiang,
Thanks for your response.
I add the checkpointAlignmentTime, the data shows that the checkpointDuration is about 150s, and the checkpointAlignmentTims is about 4s. There is a big gap between them.

Best
Henry

在 2018年10月10日,下午1:26,Zhijiang(wangzhijiang999) <wangzhijiang999@xxxxxxxxxx> 写道:

The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state.
I guess the barrier alignment may take long time for your case, and it is specially critical during backpressure. You can check the metric of "checkpointAlignmentTime" for confirmation.

Best,
Zhijiang
------------------------------------------------------------------
发件人:徐涛 <happydexutao@xxxxxxxxx>
发送时间:2018年10月10日(星期三) 13:13
收件人:user <user@xxxxxxxxxxxxxxxx>
主 题:Small checkpoint data takes too much time

Hi 
 I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
 I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
 Thanks a lot.

Best
Henry