[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need help regarding memory leak issue

On Fri, Nov 16, 2018 at 3:36 PM Udi Meiri <ehudm@xxxxxxxxxx> wrote:
which uses guppy for heap profiling.
This is really useful flag. Unfortunetly, we are using Beam + Flink.  It would be really useful to have similar flag for other Streaming engines.

On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
Even tough the algorithm works on your batch system, did you verify anything that can rule out the possibility where it is the underlying ML package causing the memory leak? 

If not, maybe replace your prediction with a dummy function which does not load any model at all, and always just give the same prediction. Then do the same plotting, let us see what it looks like. And a plus with version two: still a dummy prediction, but with model loaded.    Given we don't have much clue at this stage, at least this probably can give us more confidence in whether it is the underlying ML package causing the issue, or from beam sdk. just my 2 cents. 

On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <rakeshkumar@xxxxxxxx> wrote:
Thanks for responding Ruoyun,

We are not sure yet who is causing the leak, but once we run out of the memory then sdk worker crashes and pipeline is forced to restart. Check the memory usage patterns in the attached image. Each line in that graph is representing one task manager host.
 You are right we are running the models for predictions. 

Here are few observations:

1. All the tasks manager memory usage climb over time but some of the task managers' memory climb really fast because they are running the ML models. These models are definitely using memory intensive data structure (pandas data frame etc) hence their memory usage climb really fast.
2. We had almost the same code running in different infrastructure (non-streaming) that doesn't cause any memory issue.
3. Even when the pipeline has restarted, the memory is not released. It is still hogged by something. You can notice in the attached image that pipeline restarted around 13:30. At that time it is definitely released some portion of the memory but didn't completely released all memory. Notice that, when the pipeline was originally started, it started with 30% of the memory but when got restarted by the job manager it started with 60% of the memory.

On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
trying to understand the situation you are having.  

By saying 'kills the appllication', is that a leak in the application itself, or the workers being the root cause?  Also are you running ML models inside Python SDK DoFn's?  Then I suppose it is running some predictions rather than model training? 

On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <rakeshkumar@xxxxxxxx> wrote:
I am using Beam Python SDK to run my app in production. The app is running machine learning models. I am noticing some memory leak which eventually kills the application. I am not sure the source of memory leak. Currently, I am using object graph to dump the memory stats. I hope I will get some useful information out of this. I have also looked into Guppy library and they are almost the same.

Do you guys have any recommendation for debugging this issue? Do we have any tooling in the SDK that can help to debug it? 
Please feel free to share your experience if you have debugged similar issues in past.

Thank you,

Ruoyun  Huang

Ruoyun  Huang