[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Artemis 2.4.0 message loss in durability tests upon system power-off


Thanks so much for following up.  It will be helpful for those who might
run into this issue in the future.


Justin

On Thu, Jun 14, 2018 at 11:48 AM, Anindya Haldar <anindya.haldar@xxxxxxxxxx>
wrote:

> Sorry I forgot to update the thread on this. Yes we tested the scenarios
> in bare metal systems, and found the durability tests passed there. So I am
> guessing the issue we saw came from using the VirtualBox VM. It worked as
> expected when we eliminated VirtualBox from the equation.
>
> Thanks,
>
> Anindya Haldar
> Oracle Marketing Cloud
>
>
>
> > On Jun 13, 2018, at 7:31 PM, Justin Bertram <jbertram@xxxxxxxxxx> wrote:
> >
> > Did you have a chance to test this scenario on a bare metal system?  If
> so,
> > what were the results?  If not, did you find the root cause of the
> missing
> > messages in something related to the VM?
> >
> > Based on your other recent email to the list about HA I assume you've
> moved
> > past this issue, but I wanted to confirm for sure.
> >
> > Thanks!
> >
> >
> > Justin
> >
> > On Wed, Feb 14, 2018 at 11:32 AM, Anindya Haldar <
> anindya.haldar@xxxxxxxxxx>
> > wrote:
> >
> >> We powered off the VM while the producers were kicking and alive, and no
> >> one was consuming. Then we tallied for the number of committed messages
> by
> >> the producers. After that we restart the VM, and then restart the
> broker,
> >> and take the queue stats. Then we use the JMS QueueBrowser API to count
> the
> >> number of actual messages in the queues. Finally we consumer all
> messages
> >> from the queues and tally them against the number of messages committed
> by
> >> the producers at the time the failure was triggered.
> >>
> >> We are looking forward to run the tests using a bare metal system in
> order
> >> to eliminate VirtualBox VM from the picture.
> >>
> >> Thanks,
> >> Anindya Haldar
> >>
> >> Oracle Marking Cloud
> >>
> >> -----Original Message-----
> >> From: Justin Bertram [mailto:jbertram@xxxxxxxxxx]
> >> Sent: Wednesday, February 14, 2018 6:58 AM
> >> To: users@xxxxxxxxxxxxxxxxxxx
> >> Subject: Re: Artemis 2.4.0 message loss in durability tests upon system
> >> power-off
> >>
> >> The "messages added" metric for a queue is volatile so when the broker
> is
> >> stopped it will be reset.  When the broker is started again the
> "messages
> >> added" will be 0.  In your test you say the broker is "powered off" and
> >> then you "resume" the broker.  What exactly does this mean?  It seems
> clear
> >> that you aren't actually shutting down the broker otherwise the
> "messages
> >> added" would be 0 when you started your consumers.  Please clarify.
> >>
> >> Also, how do the broker's metrics compare with the producer's and
> >> consumer's metrics?  I assume here that the producer and consumer are
> both
> >> tracking the number of messages they produce/consume.
> >>
> >> Also, do you have a way to reproduce this without a VM?
> >>
> >>
> >> Justin
> >>
> >> On Mon, Feb 5, 2018 at 7:11 PM, Anindya Haldar <
> anindya.haldar@xxxxxxxxxx>
> >> wrote:
> >>
> >>> We are in the process of qualifying Artemis 2.4.0 for our stack. We
> >>> ran some message durability related tests in the face of a power
> >>> failure. The broker is running in a VirtualBox VM, and is set up in a
> >>> system where disk caching is disabled. The VM runs OEL Linux 7, and
> >>> the VirtualBox Manger itself is running under Windows 7 Enterprise.
> >>>
> >>>
> >>>
> >>> We use JMS API and persistent messaging. The transaction batch size in
> >>> the producers is 1, and the message size for the tests in 1024 bytes.
> >>> No consumers are running at this time, and we let the queues build up.
> >>> Then the VirtualBox VM running the broker is 'powered off' (using
> >>> VirtualBox
> >>> facilities) 5 minutes along the way. The producers detect the broker's
> >>> absence and stop.
> >>>
> >>>
> >>>
> >>> Then we resume the VM and the broker. The broker starts up and we get
> >>> the queue stats from it before anything else:
> >>>
> >>>
> >>>
> >>> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> >>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> >>> |testQueue1               |testQueue1               |0
> >>> |106988        |106988         |0                |0              |
> >>> |testQueue2               |testQueue2               |0
> >>> |107077        |107077         |0                |0              |
> >>> |testQueue3               |testQueue3               |0
> >>> |106996        |106996         |0                |0              |
> >>> |testQueue4               |testQueue4               |0
> >>> |107076        |107076         |0                |0              |
> >>>
> >>>
> >>>
> >>> The total message count across the queues is 428137.
> >>>
> >>> Now we start the consumers (no producers this time). Finally when the
> >>> consumers finish, we get the stats again. The consumers are claiming
> >>> that they received and acknowledged 428126 messages, which is
> >>> corroborated by the broker in the MESSAGES_ACKED column.
> >>>
> >>>
> >>>
> >>> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> >>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> >>>
> >>> |testQueue1               |testQueue1               |0              |0
> >>>         |106988         |0                |106984         |
> >>>
> >>> |testQueue2               |testQueue2               |0              |0
> >>>         |107077         |0                |107074         |
> >>>
> >>> |testQueue3               |testQueue3               |0              |0
> >>>         |106996         |0                |106992         |
> >>>
> >>> |testQueue4               |testQueue4               |0              |0
> >>>         |107076         |0                |107076         |
> >>>
> >>>
> >>>
> >>> You can clearly see some apparent anomalies:
> >>>
> >>> 1)      Post failure, and upon resumption, the broker said it had
> 428,137
> >>> messages in the test queues, all combined (column MESSAGES_ADDED).
> >>>
> >>> 2)      When the consumers consumed it got 428,126 messages and
> >>> acknowledged all of them. That is 11 short of 428,137.
> >>>
> >>> 3)      The broker, upon the consumers' completion reported 0 queue
> >> depth,
> >>> but also said it got acknowledgements on 428,126 messages (column
> >>> MESSAGES_ACKED).
> >>>
> >>>
> >>>
> >>> Questions:
> >>>
> >>> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
> >>> happed to additional 11 messages that the consumers never received,
> >>> and, as a result never acknowledged?
> >>>
> >>> 2)      If, according to the broker, the number of acknowledged
> messages
> >>> is 11 less than the number of messages added to the queue, why did it
> >>> declare the queues to be empty when 11 of the messages were not
> >>> acknowledged?
> >>>
> >>> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number
> then
> >>> the system lost messages. And if we do not trust that statistic then
> >>> what do we trust, and how do we know if it lost messages?
> >>>
> >>>
> >>>
> >>> The system ran into this issue 3 out of 4 times I ran the VM power
> >>> failure test (with slightly different statistics, of course). We are
> >>> very concerned that it is symptom of message loss in the system, and
> >>> are also concerned about how to explain the anomalies. Will greatly
> >>> appreciate any pointer that can help us understand and address the
> >> underlying issue here.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Anindya Haldar
> >>>
> >>> Oracle Marketing Cloud
> >>>
> >>>
> >>>
> >>
>
>