|I just stumbled on this same problem without any associated ZK issues. We had a Kafka broker fail that caused this issue:|
2018-07-18 02:48:13,497 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Sink: Produce: <output_topic_name> (2/4) (7e7d61b286d90c51bbd20a15796633f2) switched from RUNNING to FAILED. java.lang.Exception: Failed to send data to Kafka: The server disconnected before a response was received. at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:373) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010.invoke(FlinkKafkaProducer010.java:288) at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
This is the kind of error we should be robust to - the Kafka cluster will (reasonably quickly) recover and give a new broker for a particular partition (in this case, partition #2). Maybe retries should be the default configuration? I believe the client uses the Kafka defaults (acks=0, retries=0), but we typically run with acks=1 (or all) and retries=MAX_INT. Do I need to do anything more than that to get a more robust producer?