Hello everyone,
we just had a very nightmarish incident with our spread setup.
It consists of about 15 spread daemons, all in one gigabit ethernet segment.
First one of the daemons started to spit out these exact same messages about once per second until it was killed:
[Sun 13 Jan 2008 15:41:26] Prot_handle_token: BUG WORKAROUND: Too many rounds in EVS state; swallowing token; state:
[Sun 13 Jan 2008 15:41:26] Aru: 3334
[Sun 13 Jan 2008 15:41:26] My_aru: 3334
[Sun 13 Jan 2008 15:41:26] Highest_seq: 2147482050
[Sun 13 Jan 2008 15:41:26] Highest_fifo_seq: 23401
[Sun 13 Jan 2008 15:41:26] Last_discarded: 2147482050
[Sun 13 Jan 2008 15:41:26] Last_delivered: 2147482050
[Sun 13 Jan 2008 15:41:26] Last_seq: 3334
[Sun 13 Jan 2008 15:41:26] Token_rounds: 501
[Sun 13 Jan 2008 15:41:26] Last Token:
[Sun 13 Jan 2008 15:41:26] type: 0x80050080
[Sun 13 Jan 2008 15:41:26] transmiter_id: -1062731508
[Sun 13 Jan 2008 15:41:26] seq: 1
[Sun 13 Jan 2008 15:41:26] proc_id: -1062731508
[Sun 13 Jan 2008 15:41:26] aru: 3334
[Sun 13 Jan 2008 15:41:26] aru_last_id: -1062731505
[Sun 13 Jan 2008 15:41:26] flow_control: 0
[Sun 13 Jan 2008 15:41:26] rtr_len: 1440
[Sun 13 Jan 2008 15:41:26] conf_hash: -2002019299
After killing this daemon all of the other daemon stopped processing messages and client connections. New client connections where immediatly closed.
We then stopped all spread clients, so no message would be injected in to the spread ring. Still the same.
Then we started to restart one spread daemon after the other, and still nothing would work. The only thing that helped was killing all daemons but one, effectively doing a 'cold boot' of the whole spread segement.
It seems like something was totally out of sync, and would event upset a daemon after restarting it.
I believe the reason for this problem lies in the fact that we lowered the't timeouts in membership.c too much. I roughly divided everything by 10 in the beginning when we only had 3 daemons and a lot less messages/s, because back then we had the problem that everything seemed to stop every Hurry_timeout (2s) for a few 100 ms (enough to be noticed by out application). Lowering the Hurry_timeout made these 'hickups' appear more often, but they where also shorter. Another workaround was producing messages with sp_flood, made the hickups go away (but I didn't llike this very much).
So my questions:
1) I my guess correct that this error might by triggered by to small timeouts under heavy load?
2) can you explain this very nasty behaviour where all daemons are stuck, and seem to get 'infected' even after restarting.
3) does anyone have an explanation for the original problem with the Hurry timeout?
Sorry, I don't have a lot of information to go on, but our first priority was to get the system up again, and the log didn't contain much information.
Have a nice Weekend,
Nico Meyer
_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users
Thread at a glance:
Previous Message by Date:
click to view message preview
Java API: is SpreadConnection thread safe?
Hello!
Subj:) Can I use same SpreadConnection object in different threads in same application?
Serg Gulko
_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users
Next Message by Date:
click to view message preview
small bug and questions
Bug:
Sp.c
int SP_connect_timeout(
…
host_address = ( (i1
<< 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ); // bug here
under win32 on AMD Athlon XP 2800+ it is not
work…
If string for connect is 3333@xxxxxxxxxxxxx
TCP connection goes to port 3333, IP 1.221.220.128,
but it is expected “128.220.221.1”
I suggest use instead:
host_address = htonl((
(i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ));
Questions:
1) How
works message synchronization – i.e. if someone sends message(MSG) to
group with two members M1 and M2 is message arrived in the same time to both
members or asynchronously?
2) As
I understand a message stays in some queue on spread demon(s) until all spread
users in group received it. Is it true? How I can manipulate this queue –
measure max load, change size, add “expired time”?
3) How
I can change the “strategy of delivery”? For example I want to
implement load sharing in the group with two members M1 and M2. Odd messages
process M1, even messages process M2.
Best regards,
Igor Lobiv
Telco Software Development Sector
SITRONICS Telecom Solutions, Czech Republic a.s.
Tel.: +420 211 030 655, Mobile:
+420 724 936 638, Fax: +420 296 524 103
BB Centrum – Beta, Vyskočilova 1461/2b, 140 00
Praha 4, Czech Republic
www.sitronicsts.com
DISCLAIMER
This e-mail may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Top secret. Please delete it before
read. Any distribution, use or copying of this e-mail or the information it
contains by other than an intended recipient is unauthorized. If you received
this e-mail in error, please advise me (by return e-mail or otherwise)
immediately.
_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users
Previous Message by Thread:
click to view message preview
Java API: is SpreadConnection thread safe?
Hello!
Subj:) Can I use same SpreadConnection object in different threads in same application?
Serg Gulko
_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users
Next Message by Thread:
click to view message preview
small bug and questions
Bug:
Sp.c
int SP_connect_timeout(
…
host_address = ( (i1
<< 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ); // bug here
under win32 on AMD Athlon XP 2800+ it is not
work…
If string for connect is 3333@xxxxxxxxxxxxx
TCP connection goes to port 3333, IP 1.221.220.128,
but it is expected “128.220.221.1”
I suggest use instead:
host_address = htonl((
(i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ));
Questions:
1) How
works message synchronization – i.e. if someone sends message(MSG) to
group with two members M1 and M2 is message arrived in the same time to both
members or asynchronously?
2) As
I understand a message stays in some queue on spread demon(s) until all spread
users in group received it. Is it true? How I can manipulate this queue –
measure max load, change size, add “expired time”?
3) How
I can change the “strategy of delivery”? For example I want to
implement load sharing in the group with two members M1 and M2. Odd messages
process M1, even messages process M2.
Best regards,
Igor Lobiv
Telco Software Development Sector
SITRONICS Telecom Solutions, Czech Republic a.s.
Tel.: +420 211 030 655, Mobile:
+420 724 936 638, Fax: +420 296 524 103
BB Centrum – Beta, Vyskočilova 1461/2b, 140 00
Praha 4, Czech Republic
www.sitronicsts.com
DISCLAIMER
This e-mail may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Top secret. Please delete it before
read. Any distribution, use or copying of this e-mail or the information it
contains by other than an intended recipient is unauthorized. If you received
this e-mail in error, please advise me (by return e-mail or otherwise)
immediately.
_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users