[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need help regarding memory leak issue

On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
Even tough the algorithm works on your batch system, did you verify anything that can rule out the possibility where it is the underlying ML package causing the memory leak? 
It is possible that ML packages can cause the memory leak but when we tried to call the model function in a for loop for 1000 iteration, we didn't notice any memory increase.  Since the memory leak is small and it grows over several hours. I am thinking of running the model for 10,000 times and also observe the reference count and memory profile after each thousand run. I hope this will give some hint.
 We have also tried to explicitly delete the model object, input obect and output object once their usage is done in the operator method. After deleting these objects we also called `gc.collect`. But we still notice that the memory usage is increasing over time. We also tried to log the most common object counts but the number of references are almost constant for each iteration. We are going to log the memrory size of these objects to get more information.

If not, maybe replace your prediction with a dummy function which does not load any model at all, and always just give the same prediction. Then do the same plotting, let us see what it looks like. And a plus with version two: still a dummy prediction, but with model loaded.    Given we don't have much clue at this stage, at least this probably can give us more confidence in whether it is the underlying ML package causing the issue, or from beam sdk. just my 2 cents. 

We have tried one version to allocate a huge memory  in an operator method. We observe that the memory usage oscillate but it doesn't increase over time. We will try your suggested ideas and report it here. When Beam and Model methods are run in isolation they don't show increase in memory consumption so we feel that it is the intraction which is causing the issue.

On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <rakeshkumar@xxxxxxxx> wrote:
Thanks for responding Ruoyun,

We are not sure yet who is causing the leak, but once we run out of the memory then sdk worker crashes and pipeline is forced to restart. Check the memory usage patterns in the attached image. Each line in that graph is representing one task manager host.
 You are right we are running the models for predictions. 

Here are few observations:

1. All the tasks manager memory usage climb over time but some of the task managers' memory climb really fast because they are running the ML models. These models are definitely using memory intensive data structure (pandas data frame etc) hence their memory usage climb really fast.
2. We had almost the same code running in different infrastructure (non-streaming) that doesn't cause any memory issue.
3. Even when the pipeline has restarted, the memory is not released. It is still hogged by something. You can notice in the attached image that pipeline restarted around 13:30. At that time it is definitely released some portion of the memory but didn't completely released all memory. Notice that, when the pipeline was originally started, it started with 30% of the memory but when got restarted by the job manager it started with 60% of the memory.

On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
trying to understand the situation you are having.  

By saying 'kills the appllication', is that a leak in the application itself, or the workers being the root cause?  Also are you running ML models inside Python SDK DoFn's?  Then I suppose it is running some predictions rather than model training? 

On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <rakeshkumar@xxxxxxxx> wrote:
I am using Beam Python SDK to run my app in production. The app is running machine learning models. I am noticing some memory leak which eventually kills the application. I am not sure the source of memory leak. Currently, I am using object graph to dump the memory stats. I hope I will get some useful information out of this. I have also looked into Guppy library and they are almost the same.

Do you guys have any recommendation for debugging this issue? Do we have any tooling in the SDK that can help to debug it? 
Please feel free to share your experience if you have debugged similar issues in past.

Thank you,

Ruoyun  Huang

Ruoyun  Huang