[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Out of memory exception

I hate trying to troubleshoot memory usage; it's nearly impossible to do
well, especially after the fact. It's very hard to tell exactly what causes
the increase in memory.

When measuring memory, it's important to understand that top measures total
RAM, which 1) includes memory that's not on the heap, and 2) doesn't
differentiate between heap memory that's allocated and heap memory that's
free. In general, it's a pretty awful tool for evaluating memory issues,
and you'll be much better off with a JMX-based tool such as JConsole,
JVisualVM, Mission Control, etc.

One thing that comes to mind is that if you're using non-persistent
messages, they're written to the memory story, which keeps its messages in
(surprise, surprise) memory. If you haven't configured your limits to keep
that store significantly under your total heap size (I'd say 50% max), it's
possible that you just had a bunch of unconsumed messages pile up till they
filled it. Based on your description of a sudden, fast growth, that seems
like the most likely scenario, but I'm sure there are other possibilities
as well.

The pattern of memory usage you're describing in your final paragraph
doesn't sound like garbage collection. GC activity follows a saw-tooth
pattern, where it goes up and to the right for a while and then
instantaneously drops significantly before slowly growing again. Usually
this occurs pretty frequently, over a period of seconds or minutes, not
days. But again, you'd only see that via a JMX-based tool, not in top. What
you described might be the result of the JVM growing and shrinking the heap
over time (staying in the range between Xms and Xmx); there's clearly a
tie-in to the garbage collector there in that if you're not freeing up
significant amounts of memory the JVM will never consider shrinking the
heap, but it's not a direct result of GC activity per se.


On Sat, May 5, 2018 at 6:29 PM, Lionel van den Berg <lionelv@xxxxxxxxx>

> Sorry about the no response, I was monitoring and since we increased the
> memory to 4Gb we hadn't seen an issue until two days ago. The memory use I
> found was from top, the lab result we got where it "hung" hasn't happened
> again, I'm not sure what else was at play there.
> Two days ago we had an instance at site where it ran out of memory again at
> 4gb. The memory use climbed rapidly over under two hours from 2gb to 4gb
> and then it threw the exception. We have a script monitoring which
> restarted it, unfortunately we must have a bug in our client code so they
> didn't automatically reconnect :(. Also unfortunately because it was at
> site and we didn't have experts immediately available we failed to get the
> logs.
> So I'm sure we must have a bug in our system somewhere, my suspicion is on
> our client side where the majority of subscribers are, and in particular
> where the subscriptions are made to what I call dynamic topic names (topic
> names are made at run time by altering one field that represents a device
> id), so these subscriptions should come and go as the devices enter the
> system and then are removed. I'm not sure if some of these could be leaking
> so that that consumers are not being remove, but if this was the case we
> don't expect more than 50 new ones per day so doesn't explain such a rapid
> increase. I wonder if our consumers could be getting slow due to processing
> load and if this can have any impact even though the topics are not
> durable?
> Any further pointers on likely causes? And what kind of config could I look
> at in AMQ that would possibly protect such scenarios? I realise it's almost
> certainly an issue in our system code.
> Also at site we have been monitoring memory usage through an NMS, and we
> found that the minimum is around 800mb, this is fine, but it will grow to
> say 2gb steadily and then suddenly it will drop over an hour or two back to
> 1gb or so. In the above case it was already at 2Gb of course. Is this
> likely to be garbage collection? The drop in memory use was quite
> infrequent, it could be 2 days which is odd to me.
> On 6 April 2018 at 23:13, Tim Bain <tbain@xxxxxxxxxxxxxxx> wrote:
> > 1GB sounds a little small for that volume, especially if there is any
> > danger of some consumers of durable topics being offline for a while, or
> of
> > all consumers on a given queue being offline. Either way, you've proven
> > that 1GB isn't enough, by hitting an OOM. The fact that you haven't hit
> it
> > till now probably means you could get away with using 2GB, but if your
> host
> > has the memory available, I'm never going to argue against using it.
> >
> > In your test environment, I'm confused about how you can limit the JVM to
> > 4GB of heap, and then have it take 5GB. Unless the 5GB number is total
> > memory as measured by something like top? If so, that just means that the
> > JVM made the heap 4GB, but it doesn't mean that there's actually 4GB of
> > data in it. Too can't tell you that, so you'd want to use JConsole or
> > JVisualVM to get an understanding of how much heap is actually used and
> how
> > much time is being spent GCing.
> >
> > Also, can you more clearly describe what you mean by "unresponsive"?
> >
> > Tim
> >
> > On Fri, Apr 6, 2018, 12:22 AM Lionel van den Berg <lionelv@xxxxxxxxx>
> > wrote:
> >
> > > Hi,
> > >
> > > We're still investigating, turning up logging etc. but we've come
> across
> > > two issues:
> > >
> > > 1. At our site deployment with default memory usage (1gb) AMQ threw an
> > out
> > > of memory exception. We couldn't determine exactly why, whether it was
> > > cumulative memory use of a peak memory use. We have around 50
> connections
> > > and perhaps a few thousand topics with quite a lot of data, perhaps
> > > 4GB/hour going in and 15 x that much going out.
> > >
> > > 2. In our lab we increased memory available to 4Gb by modifying env
> (see
> > > attached) and turn up logging (also see attached), within about 5 hours
> > AMQ
> > > had reach 5Gb and hung without an exception. Unfortunately the system
> > > wasn't being monitored and apparently the logs weren't any good because
> > > they'd rolled over too many times.
> > >
> > > I realise the information is a little vague at this stage so I'm only
> > > looking for pointers on where to look.
> > >
> >