thanks for the detailed report.
- In Flink 1.4, jobs are cancelled if the JM looses the connection to ZK and recovered when the connection is re-established (and one JM becomes leader again).
- Regarding the KafkaProducer: I'm not sure from the log message whether Flink closes the KafkaProducer because the job is cancelled or because there is a connectivity issue to the Kafka cluster. Including Piotr (cc) in this thread who has worked on the KafkaProducer in the past. If it is a connectivity issue, it might also explain why you lost the connection to ZK.
Glad to hear that everything is back to normal. Keep us updated if something unexpected happens again.