osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink leaves a lot RocksDB sst files in tmp directory


Hi,

Can you maybe show us what is inside of one of the directory instance? Furthermore, your TM logs show multiple instances of OutOfMemoryErrors, so that might also be a problem. Also how was the job moved? If a TM is killed, of course it cannot cleanup. That is why the data goes to tmp dir so that the OS can eventually take care of it, in container environments this dir should always be cleaned anyways.

Best,
Stefan

On 11. Oct 2018, at 10:15, Sayat Satybaldiyev <sayatez@xxxxxxxxx> wrote:

Thank you Piotr for the reply! We didn't run this job on the previous version of Flink. Unfortunately, I don't have a log file from JM only TM logs. 


On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx> wrote:
Hi,

Was this happening in older Flink version? Could you post in what circumstances the job has been moved to a new TM (full job manager logs and task manager logs would be helpful)? I’m suspecting that those leftover files might have something to do with local recovery.

Piotrek 

On 9 Oct 2018, at 15:28, Sayat Satybaldiyev <sayatez@xxxxxxxxx> wrote:

After digging more in the log, I think it's more a bug. I've greped a log by job id and found under normal circumstances TM supposed to delete flink-io files. For some reason, it doesn't delete files that were listed above.

2018-10-08 22:10:25,865 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
2018-10-08 22:10:25,867 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
2018-10-08 22:10:25,874 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
2018-10-08 22:17:38,680 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,686 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,691 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.


On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <sayatez@xxxxxxxxx> wrote:
Dear all,

While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint FS, I've noticed that after a job has moved to a different host it leaves quite a huge state in temp folder(1.2TB in total). The files are not used as TM is not running a job on the current host. 

The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but then it was moved to a different TM. I'm wondering is it intended behavior or a possible bug?

I've attached files that are left and not used by a job in PrintScreen.