Re: Streaming to Parquet Files in HDFS
Thanks for the info. That error caused by I built your code along with not
up-to-date baseline. I rebased my branch build, and there's no more such
I've been testing, and until now have some questions/issues as below:
1. I'm not able to write to S3 with the following URI format: *s3*://<path>,
and had to use *s3a*://<path>. Is this behaviour expected? (I am running
Flink on AWS EMR, and I thought that EMR provides a wrapper for HDFS over S3
with something called EMRFS).
2. Occasionally/randomly I got the below message ( parquet_error1.log
). I'm using ParquetAvroWriters.forReflectRecord() method to write Scala
case classes. Re-running the job doesn't get that error at the same data
location, so I don't think that there's issue with data.
*java.lang.ArrayIndexOutOfBoundsException: <some random number>* /at
3. Sometimes I got this error message when I use parallelism of 8 for the
sink ( parquet_error2.log
Reducing to 2 solves the issue. But is it possible to increase the pool
size? I could not find any place that I can change the
/java.io.InterruptedIOException: initiate MultiPartUpload on
to execute HTTP request: Timeout waiting for connection from pool/
4. Where is the temporary folder that you store the parquet file before
uploading to S3?
Thanks a lot for your help.
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/