OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Artemis 2.4.0 message loss in durability tests upon system power-off


Sorry I forgot to update the thread on this. Yes we tested the scenarios in bare metal systems, and found the durability tests passed there. So I am guessing the issue we saw came from using the VirtualBox VM. It worked as expected when we eliminated VirtualBox from the equation.

Thanks,

Anindya Haldar
Oracle Marketing Cloud



> On Jun 13, 2018, at 7:31 PM, Justin Bertram <jbertram@xxxxxxxxxx> wrote:
> 
> Did you have a chance to test this scenario on a bare metal system?  If so,
> what were the results?  If not, did you find the root cause of the missing
> messages in something related to the VM?
> 
> Based on your other recent email to the list about HA I assume you've moved
> past this issue, but I wanted to confirm for sure.
> 
> Thanks!
> 
> 
> Justin
> 
> On Wed, Feb 14, 2018 at 11:32 AM, Anindya Haldar <anindya.haldar@xxxxxxxxxx>
> wrote:
> 
>> We powered off the VM while the producers were kicking and alive, and no
>> one was consuming. Then we tallied for the number of committed messages by
>> the producers. After that we restart the VM, and then restart the broker,
>> and take the queue stats. Then we use the JMS QueueBrowser API to count the
>> number of actual messages in the queues. Finally we consumer all messages
>> from the queues and tally them against the number of messages committed by
>> the producers at the time the failure was triggered.
>> 
>> We are looking forward to run the tests using a bare metal system in order
>> to eliminate VirtualBox VM from the picture.
>> 
>> Thanks,
>> Anindya Haldar
>> 
>> Oracle Marking Cloud
>> 
>> -----Original Message-----
>> From: Justin Bertram [mailto:jbertram@xxxxxxxxxx]
>> Sent: Wednesday, February 14, 2018 6:58 AM
>> To: users@xxxxxxxxxxxxxxxxxxx
>> Subject: Re: Artemis 2.4.0 message loss in durability tests upon system
>> power-off
>> 
>> The "messages added" metric for a queue is volatile so when the broker is
>> stopped it will be reset.  When the broker is started again the "messages
>> added" will be 0.  In your test you say the broker is "powered off" and
>> then you "resume" the broker.  What exactly does this mean?  It seems clear
>> that you aren't actually shutting down the broker otherwise the "messages
>> added" would be 0 when you started your consumers.  Please clarify.
>> 
>> Also, how do the broker's metrics compare with the producer's and
>> consumer's metrics?  I assume here that the producer and consumer are both
>> tracking the number of messages they produce/consume.
>> 
>> Also, do you have a way to reproduce this without a VM?
>> 
>> 
>> Justin
>> 
>> On Mon, Feb 5, 2018 at 7:11 PM, Anindya Haldar <anindya.haldar@xxxxxxxxxx>
>> wrote:
>> 
>>> We are in the process of qualifying Artemis 2.4.0 for our stack. We
>>> ran some message durability related tests in the face of a power
>>> failure. The broker is running in a VirtualBox VM, and is set up in a
>>> system where disk caching is disabled. The VM runs OEL Linux 7, and
>>> the VirtualBox Manger itself is running under Windows 7 Enterprise.
>>> 
>>> 
>>> 
>>> We use JMS API and persistent messaging. The transaction batch size in
>>> the producers is 1, and the message size for the tests in 1024 bytes.
>>> No consumers are running at this time, and we let the queues build up.
>>> Then the VirtualBox VM running the broker is 'powered off' (using
>>> VirtualBox
>>> facilities) 5 minutes along the way. The producers detect the broker's
>>> absence and stop.
>>> 
>>> 
>>> 
>>> Then we resume the VM and the broker. The broker starts up and we get
>>> the queue stats from it before anything else:
>>> 
>>> 
>>> 
>>> |NAME                     |ADDRESS                  |CONSUMER_COUNT
>>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>>> |testQueue1               |testQueue1               |0
>>> |106988        |106988         |0                |0              |
>>> |testQueue2               |testQueue2               |0
>>> |107077        |107077         |0                |0              |
>>> |testQueue3               |testQueue3               |0
>>> |106996        |106996         |0                |0              |
>>> |testQueue4               |testQueue4               |0
>>> |107076        |107076         |0                |0              |
>>> 
>>> 
>>> 
>>> The total message count across the queues is 428137.
>>> 
>>> Now we start the consumers (no producers this time). Finally when the
>>> consumers finish, we get the stats again. The consumers are claiming
>>> that they received and acknowledged 428126 messages, which is
>>> corroborated by the broker in the MESSAGES_ACKED column.
>>> 
>>> 
>>> 
>>> |NAME                     |ADDRESS                  |CONSUMER_COUNT
>>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>>> 
>>> |testQueue1               |testQueue1               |0              |0
>>>         |106988         |0                |106984         |
>>> 
>>> |testQueue2               |testQueue2               |0              |0
>>>         |107077         |0                |107074         |
>>> 
>>> |testQueue3               |testQueue3               |0              |0
>>>         |106996         |0                |106992         |
>>> 
>>> |testQueue4               |testQueue4               |0              |0
>>>         |107076         |0                |107076         |
>>> 
>>> 
>>> 
>>> You can clearly see some apparent anomalies:
>>> 
>>> 1)      Post failure, and upon resumption, the broker said it had 428,137
>>> messages in the test queues, all combined (column MESSAGES_ADDED).
>>> 
>>> 2)      When the consumers consumed it got 428,126 messages and
>>> acknowledged all of them. That is 11 short of 428,137.
>>> 
>>> 3)      The broker, upon the consumers' completion reported 0 queue
>> depth,
>>> but also said it got acknowledgements on 428,126 messages (column
>>> MESSAGES_ACKED).
>>> 
>>> 
>>> 
>>> Questions:
>>> 
>>> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
>>> happed to additional 11 messages that the consumers never received,
>>> and, as a result never acknowledged?
>>> 
>>> 2)      If, according to the broker, the number of acknowledged messages
>>> is 11 less than the number of messages added to the queue, why did it
>>> declare the queues to be empty when 11 of the messages were not
>>> acknowledged?
>>> 
>>> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then
>>> the system lost messages. And if we do not trust that statistic then
>>> what do we trust, and how do we know if it lost messages?
>>> 
>>> 
>>> 
>>> The system ran into this issue 3 out of 4 times I ran the VM power
>>> failure test (with slightly different statistics, of course). We are
>>> very concerned that it is symptom of message loss in the system, and
>>> are also concerned about how to explain the anomalies. Will greatly
>>> appreciate any pointer that can help us understand and address the
>> underlying issue here.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Anindya Haldar
>>> 
>>> Oracle Marketing Cloud
>>> 
>>> 
>>> 
>>