[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Not all files are processed? Stream source with ContinuousFileMonitoringFunction


Which file system are you reading from? If you are reading from S3, this might be cause by S3's eventual consistency property.
Have a look at FLINK-9940 [1] for a more detailed discussion.
There is also an open PR [2], that you could try to patch the source operator with.

Best, Fabian

Am Fr., 12. Okt. 2018 um 20:41 Uhr schrieb Juan Miguel Cejuela <juanmi@xxxxxxxxxx>:
Dear flinksters,

I'm using the class `ContinuousFileMonitoringFunction` as a source to monitor a folder for new incoming files. I have the problem that not all the files that are sent to the folder get processed / triggered by the function. Specific details of my workflow is that I send up to 1k files per minute, and that I consume the stream as a `AsyncDataStream`.

I myself raised an unrelated issue with the `ContinuousFileMonitoringFunction` class some time ago (https://issues.apache.org/jira/browse/FLINK-8046): if two or more files shared the very same timestamp, only the first one (non-deterministically chosen) would be processed. However, I patched the file myself to fix that problem by using a LinkedHashMap to remember which files had been really processed before or not. My patch is working fine as far as I can tell.

The problem seems to be rather that some files (when many are sent at once to the same folder) do not even get triggered/activated/registered by the class.

Am I properly explaining my problem?

Any hints to solve this challenge would be greatly appreciated ! ❤ THANK YOU

Juanmi, CEO and co-founder @ 🍃tagtog.net

    Follow tagtog updates on 🐦 Twitter: @tagtog_net