Or you write a custom file system for Flink... (for the tar part).Unfortunately gz files can only be processed single threaded (there are some multiple thread implementation but they don’t bring the big gain).
On 10. Aug 2018, at 07:07, vino yang <yanghua1127@xxxxxxxxx> wrote:Hi Averell,In this case, I think you may need to extend Flink's existing source.First, read your tar.gz large file, when it been decompressed, use the multi-threaded ability to read the record in the source, and then parse the data format (map / flatmap might be a suitable operator, you can chain them with source because these two operator don't require data shuffle).Note that Flink doesn't encourage creating extra threads in UDFs, but I don't know if there is a better way for this scenario.Thanks, vino.Averell <lvhuyen@xxxxxxxxx> 于2018年8月10日周五 下午12:05写道：Hi Fabian, Vino,
I have one more question, which I initially planned to create a new thread,
but now I think it is better to ask here:
I need to process one big tar.gz file which contains multiple small gz
files. What is the best way to do this? I am thinking of having one single
thread process that read the TarArchiveStream (which has been decompressed
from that tar.gz by Flink automatically), and then distribute the
TarArchiveEntry entries to a multi-thread operator which would process the
small files in parallel. If this is feasible, which elements from Flink I
Thanks a lot.
Sent from: http://apache-flink-user-