[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: S3 for state backend in Flink 1.4.0


We are using Object Storage for checkpointing. I'd like to point out that we were seeing performance problems using the S3 protocol. Btw, we had quite a few problems using the flink-s3-fs-hadoop jar with Object Storage and had to do some ugly hacking to get it working all over. We recently discovered an alternative connector developed by IBM Research called stocator. It's a streaming writer and performs better than using the S3 protocol.

Here is a link to the library - https://github.com/SparkTC/stocator, and a blog explaining about it - http://www.spark.tc/stocator-the-fast-lane-connecting-object-stores-to-spark/

Good luck!!

-----Original Message-----
From: Edward Rojas [mailto:edward.rojascl@xxxxxxxxx] 
Sent: Wednesday, January 31, 2018 3:02 PM
To: user@xxxxxxxxxxxxxxxx
Subject: RE: S3 for state backend in Flink 1.4.0


We are having a similar problem when trying to use Flink 1.4.0 with IBM Object Storage for reading and writing data. 

We followed
and the suggestion on https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D851&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=jSLW4Ugl1FvEta-R0h_thQVZ6tQ2LsUX10cRoIWNNkk&s=bDXNhnIV4KFTK9Byg5w2R_8UlWiXH05uAp9rkWJm_jo&e=.

We put the flink-s3-fs-hadoop jar from the opt/ folder to the lib/ folder and we added the configuration on the flink-config.yaml:

s3.access-key: <ACCESS_KEY>
s3.secret-key: <SECRET_KEY>
s3.endpoint: s3.us-south.objectstorage.softlayer.net 

With this we can read from IBM Object Storage without any problem when using env.readTextFile("s3://flink-test/flink-test.txt");

But we are having problems when trying to write. 
We are using a kafka consumer to read from the bus, we're making some processing and after saving  some data on Object Storage.

When using stream.writeAsText("s3://flink-test/data.txt").setParallelism(1);
The file is created but only when the job finish (or we stop it). But we need to save the data without stopping the job, so we are trying to use a Sink.

But when using a BucketingSink, we get the error: 
java.io.IOException: No FileSystem for scheme: s3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2798)

Do you have any idea how could we make it work using Sink?



Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=g-5xYRH8L3aCnCNTROw5LrsB5gbTayWjXSm6Nil9x0c&m=jSLW4Ugl1FvEta-R0h_thQVZ6tQ2LsUX10cRoIWNNkk&s=vN9sFldnlnzHZPgOBi42Rwfq1Hbq79gUPUNLgi0zmSM&e=