osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (FLINK-10286) Flink Persist Invalid Job Graph in Zookeeper


Sayat Satybaldiyev created FLINK-10286:
------------------------------------------

             Summary: Flink Persist Invalid Job Graph in Zookeeper
                 Key: FLINK-10286
                 URL: https://issues.apache.org/jira/browse/FLINK-10286
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.6.0
            Reporter: Sayat Satybaldiyev


In HA mode Flink 1.6, Flink persist job graph in Zookpeer even if the job was not accepted by Job Manager. This particularly bad as later if JM dies and restarts JM tries to recover the job and obviously fails and dies completely.

 

How to reproduce:

1. Have HA Flink cluster 1.6

2. Submit invalid job, in my case I'm put invalid file schema for rocksdb state backed

```

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.enableCheckpointing(5000);
RocksDBStateBackend backend = new RocksDBStateBackend("hddd:///tmp/flink/rocksdb");

backend.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED);
env.setStateBackend(backend);

```

Client returns:

```

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: 9680f02ae2f3806c3b4da25bfacd0749)

```

JM does not accept job, this truncated error log from JM:

```

Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager

 

Caused by: java.lang.RuntimeException: Failed to start checkpoint ID counter: Could not find a file system implementation for scheme 'hddd'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.

 

```

4. Go to ZK and observe that JM has saved job to ZK

ls /flink/flink_ns/jobgraphs/9680f02ae2f3806c3b4da25bfacd0749
[7f392fd9-cedc-4978-9186-1f54b98eeeb7]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)