Re: Support Hadoop 2.6 for StreamingFileSink
Till is correct in that getting rid of the “valid-length” file was a design decision
for the new StreamingFileSink since the beginning. The motivation was that
users were reporting that essentially it was very cumbersome to use.
In general, when the BucketingSink gets deprecated, I could see a benefit in having a
legacy recoverable stream just in case you are obliged to use an older HDFS version.
But, at least for now, this would be useful only for row-wise encoders, and NOT for
bulk-encoders like Parquet.
The reason is that for now, when using bulk encoders you roll on every checkpoint.
This implies that you do not need truncate, or the valid length file. Given this,
you may only need to write a Recoverable stream that just does not truncate.
Would you like to try it out and see if it works for your usecase?
> On Aug 21, 2018, at 1:58 PM, Artsem Semianenka <artfulonline@xxxxxxxxx> wrote:
> Thanks for reply, Till !
> Buy the way, If Flink going to support compatibility with Hadoop 2.6 I don't see another way how to achieve it.
> As I mention before one of popular distributive Cloudera still based on Hadoop 2.6 and it very sad if Flink unsupport it.
> I really want to help Flink comunity to support this legacy. But currently I see only one way to acheve it by emulate 'truncate' logic and recreate new file with needed lenght and replace old .
> On Tue, 21 Aug 2018 at 14:41, Till Rohrmann <trohrmann@xxxxxxxxxx <mailto:trohrmann@xxxxxxxxxx>> wrote:
> Hi Artsem,
> if I recall correctly, then we explicitly decided to not support the valid
> file length files with the new StreamingFileSink because they are really
> hard to handle for the user. I've pulled Klou into this conversation who is
> more knowledgeable and can give you a bit more advice.
> On Mon, Aug 20, 2018 at 2:53 PM Artsem Semianenka <artfulonline@xxxxxxxxx <mailto:artfulonline@xxxxxxxxx>>
> > I have an idea to create new version of HadoopRecoverableFsDataOutputStream
> > class (for example with name LegacyHadoopRecoverableFsDataOutputStream :) )
> > which will works with valid-length files without invoking truncate. And
> > modify check in HadoopRecoverableWriter to use
> > LegacyHadoopRecoverableFsDataOutputStream in case if Hadoop version is
> > lower then 2.7 . I will try to provide PR soon if no objections. I hope I
> > am on the right way.
> > On Mon, 20 Aug 2018 at 14:40, Artsem Semianenka <artfulonline@xxxxxxxxx <mailto:artfulonline@xxxxxxxxx>>
> > wrote:
> > > Hi guys !
> > > I have a question regarding new StreamingFileSink (introduced in 1.6
> > > version) . We use this sink to write data into Parquet format. But I
> > faced
> > > with issue when trying to run job on Yarn cluster and save result to
> > HDFS.
> > > In our case we use latest Cloudera distributive (CHD 5.15) and it
> > contains
> > > HDFS 2.6.0 . This version is not support truncate method . I would like
> > to
> > > create Pull request but I want to ask your advice how better design this
> > > fix and which ideas are behind this decision . I saw similiar PR for
> > > BucketingSink https://github.com/apache/flink/pull/6108 <https://github.com/apache/flink/pull/6108> . Maybe I could
> > > also add support of valid-length files for older Hadoop versions ?
> > >
> > > P.S.Unfortently CHD 5.15 (with Hadoop 2.6) is the latest version of
> > > Cloudera distributive and we can't upgrade hadoop to 2.7 Hadoop .
> > >
> > > Best regards,
> > > Artsem
> > >
> > --
> > С уважением,
> > Артем Семененко
> С уважением,
> Артем Семененко