[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Streaming to Parquet Files in HDFS

Hi Kostas,

Thanks for the info. That error caused by I built your code along with not
up-to-date baseline. I rebased my branch build, and there's no more such
I've been testing, and until now have some questions/issues as below:

1. I'm not able to write to S3 with the following URI format: *s3*://<path>,
and had to use *s3a*://<path>. Is this behaviour expected? (I am running
Flink on AWS EMR, and I thought that EMR provides a wrapper for HDFS over S3
with something called EMRFS).

2. Occasionally/randomly I got the below message ( parquet_error1.log
). I'm using ParquetAvroWriters.forReflectRecord() method to write Scala
case classes. Re-running the job doesn't get that error at the same data
location, so I don't think that there's issue with data.
 *java.lang.ArrayIndexOutOfBoundsException: <some random number>* /at

3. Sometimes I got this error message when I use parallelism of 8 for the
sink ( parquet_error2.log
Reducing to 2 solves the issue. But is it possible to increase the pool
size? I could not find any place that I can change the
/fs.s3.maxconnections/ parameter.
/java.io.InterruptedIOException: initiate MultiPartUpload on
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Unable
to execute HTTP request: Timeout waiting for connection from pool/

4. Where is the temporary folder that you store the parquet file before
uploading to S3?

Thanks a lot for your help.

Best regards,

Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/