[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Small-files source - partitioning based on prefix of file

Hi Averell,

One comment regarding what you said:

> As my files are small, I think there would not be much benefit in checkpointing file offset state.

Checkpointing is not about efficiency but about consistency.
If the position in a split is not checkpointed, your application won't operate with exactly-once state consistency unless each split produces exactly one record.

Best, Fabian

2018-08-10 9:10 GMT+02:00 Jörn Franke <jornfranke@xxxxxxxxx>:
Or you write a custom file system for Flink... (for  the tar part).
Unfortunately gz files can only be processed single threaded (there are some multiple thread implementation but they don’t bring the big gain). 

On 10. Aug 2018, at 07:07, vino yang <yanghua1127@xxxxxxxxx> wrote:

Hi Averell,

In this case, I think you may need to extend Flink's existing source. 
First, read your tar.gz large file, when it been decompressed, use the multi-threaded ability to read the record in the source, and then parse the data format (map / flatmap  might be a suitable operator, you can chain them with source because these two operator don't require data shuffle).

Note that Flink doesn't encourage creating extra threads in UDFs, but I don't know if there is a better way for this scenario.

Thanks, vino.

Averell <lvhuyen@xxxxxxxxx> 于2018年8月10日周五 下午12:05写道:
Hi Fabian, Vino,

I have one more question, which I initially planned to create a new thread,
but now I think it is better to ask here:
I need to process one big tar.gz file which contains multiple small gz
files. What is the best way to do this? I am thinking of having one single
thread process that read the TarArchiveStream (which has been decompressed
from that tar.gz by Flink automatically), and then distribute the
TarArchiveEntry entries to a multi-thread operator which would process the
small files in parallel. If this is feasible, which elements from Flink I
can reuse?

Thanks a lot.

Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/