Hi Flink developers,
We're running some new DataStream jobs on Flink 1.7.0 using
the shaded Hadoop S3 file system, and running into frequent
errors saving checkpoints and savepoints to S3. I'm not sure
what the underlying reason for the error is, but we often fail
with the following stack trace, which appears to be due to
missing the javax.xml.bind.DatatypeConverterImpl class in an
error-handling path for AmazonS3Client.
Could not initialize class
For reference, we're running flink from the "Apache 1.7.0
Flink only Scala 2.11" binary tgz, we've copied
flink-s3-fs-hadoop-1.7.0.jar from opt/ to lib/, we're not
defining HADOOP_CLASSPATH, and we're running java 8 (openjdk
version "1.8.0_191") on Ubuntu 18.04 x86_64.
Presumably there are two issues: 1) some periodic error
with S3, and 2) some classpath / class loading issue with
javax.xml.bind.DatatypeConverterImpl that's preventing the
original error from being displayed. I'm more curious about
the later issue.
This is super puzzling since
javax/xml/bind/DatatypeConverterImpl.class is included in our
rt.jar, and lsof confirms we're reading that rt.jar, so I
suspect it's something tricky with custom class loaders or the
way the shaded S3 jar works. Note that this class is not
included in flink-s3-fs-hadoop-1.7.0.jar (which we are using),
but it is included in flink-shaded-hadoop2-uber-1.7.0.jar
(which we are not using).
Another thing that jumped out to us was that Flink 1.7 is
now able to build JDK9, but Java 9 includes deprecation of the
javax.xml.bind libraries, requiring explicit inclusion in a
Java 9 module . And we saw that direct references to
javax.xml.bind were removed from flink-core for 1.7 
Some things we tried, without success:
- Building flink from source on a computer with java 8
installed. We still got NoClassDefFoundError.
- Using the binary version of Flink on machines with java
9 installed. We get NullPointerException in
- Downloading the jaxb-api jar , which has
javax/xml/bind/DatatypeConverterImpl.class, and setting
HADOOP_CLASSPATH to have that jar. We still got
- Using iptables to completely block S3 traffic, hoping
this would make it easier to reproduce. The connection
errors are properly displayed, so these connection errors
must go down another error handling path.
Would love to hear any ideas about what might be happening,
or further ideas we can try.