osdir.com
mailing list archive

Subject: Error condition stopping the whole ring for good - msg#00007

List: network.spread.user

Date: Prev Next Index Thread: Prev Next Index

Hello everyone,

we just had a very nightmarish incident with our spread setup.

It consists of about 15 spread daemons, all in one gigabit ethernet segment.

First one of the daemons started to spit out these exact same messages about once per second until it was killed:

[Sun 13 Jan 2008 15:41:26] Prot_handle_token: BUG WORKAROUND: Too many rounds in EVS state; swallowing token; state:

[Sun 13 Jan 2008 15:41:26] Aru: 3334

[Sun 13 Jan 2008 15:41:26] My_aru: 3334

[Sun 13 Jan 2008 15:41:26] Highest_seq: 2147482050

[Sun 13 Jan 2008 15:41:26] Highest_fifo_seq: 23401

[Sun 13 Jan 2008 15:41:26] Last_discarded: 2147482050

[Sun 13 Jan 2008 15:41:26] Last_delivered: 2147482050

[Sun 13 Jan 2008 15:41:26] Last_seq: 3334

[Sun 13 Jan 2008 15:41:26] Token_rounds: 501

[Sun 13 Jan 2008 15:41:26] Last Token:

[Sun 13 Jan 2008 15:41:26] type: 0x80050080

[Sun 13 Jan 2008 15:41:26] transmiter_id: -1062731508

[Sun 13 Jan 2008 15:41:26] seq: 1

[Sun 13 Jan 2008 15:41:26] proc_id: -1062731508

[Sun 13 Jan 2008 15:41:26] aru: 3334

[Sun 13 Jan 2008 15:41:26] aru_last_id: -1062731505

[Sun 13 Jan 2008 15:41:26] flow_control: 0

[Sun 13 Jan 2008 15:41:26] rtr_len: 1440

[Sun 13 Jan 2008 15:41:26] conf_hash: -2002019299

After killing this daemon all of the other daemon stopped processing messages and client connections. New client connections where immediatly closed.

We then stopped all spread clients, so no message would be injected in to the spread ring. Still the same.

Then we started to restart one spread daemon after the other, and still nothing would work. The only thing that helped was killing all daemons but one, effectively doing a 'cold boot' of the whole spread segement.

It seems like something was totally out of sync, and would event upset a daemon after restarting it.

I believe the reason for this problem lies in the fact that we lowered the't timeouts in membership.c too much. I roughly divided everything by 10 in the beginning when we only had 3 daemons and a lot less messages/s, because back then we had the problem that everything seemed to stop every Hurry_timeout (2s) for a few 100 ms (enough to be noticed by out application). Lowering the Hurry_timeout made these 'hickups' appear more often, but they where also shorter. Another workaround was producing messages with sp_flood, made the hickups go away (but I didn't llike this very much).

So my questions:

1) I my guess correct that this error might by triggered by to small timeouts under heavy load?

2) can you explain this very nasty behaviour where all daemons are stuck, and seem to get 'infected' even after restarting.

3) does anyone have an explanation for the original problem with the Hurry timeout?

Sorry, I don't have a lot of information to go on, but our first priority was to get the system up again, and the log didn't contain much information.

Have a nice Weekend,

Nico Meyer

_______________________________________________
Spread-users mailing list
Spread-users@xxxxxxxxxxxxxxxx
http://lists.spread.org/mailman/listinfo/spread-users
Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Java API: is SpreadConnection thread safe?

Hello! Subj:) Can I use same SpreadConnection object in different threads in same application? Serg Gulko _______________________________________________ Spread-users mailing list Spread-users@xxxxxxxxxxxxxxxx http://lists.spread.org/mailman/listinfo/spread-users

Next Message by Date: click to view message preview

small bug and questions

Bug:   Sp.c int SP_connect_timeout( …       host_address = ( (i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ); // bug here   under win32 on AMD Athlon XP 2800+ it is not work… If string for connect is 3333@xxxxxxxxxxxxx TCP connection goes to port 3333, IP 1.221.220.128, but it is expected “128.220.221.1”   I suggest use instead: host_address = htonl(( (i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ));     Questions:   1)       How works message synchronization – i.e. if someone sends message(MSG) to group with two members M1 and M2 is message arrived in the same time to both members or asynchronously? 2)       As I understand a message stays in some queue on spread demon(s) until all spread users in group received it. Is it true? How I can manipulate this queue – measure max load, change size, add “expired time”? 3)       How I can change the “strategy of delivery”? For example I want to implement load sharing in the group with two members M1 and M2. Odd messages process M1, even messages process M2.   Best regards, Igor Lobiv Telco Software Development Sector   SITRONICS Telecom Solutions, Czech Republic a.s.   Tel.: +420 211 030 655,  Mobile: +420 724 936 638, Fax: +420 296 524 103 BB Centrum – Beta, Vyskočilova 1461/2b, 140 00  Praha 4, Czech Republic www.sitronicsts.com DISCLAIMER This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Top secret. Please delete it before read. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.   _______________________________________________ Spread-users mailing list Spread-users@xxxxxxxxxxxxxxxx http://lists.spread.org/mailman/listinfo/spread-users

Previous Message by Thread: click to view message preview

Java API: is SpreadConnection thread safe?

Hello! Subj:) Can I use same SpreadConnection object in different threads in same application? Serg Gulko _______________________________________________ Spread-users mailing list Spread-users@xxxxxxxxxxxxxxxx http://lists.spread.org/mailman/listinfo/spread-users

Next Message by Thread: click to view message preview

small bug and questions

Bug:   Sp.c int SP_connect_timeout( …       host_address = ( (i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ); // bug here   under win32 on AMD Athlon XP 2800+ it is not work… If string for connect is 3333@xxxxxxxxxxxxx TCP connection goes to port 3333, IP 1.221.220.128, but it is expected “128.220.221.1”   I suggest use instead: host_address = htonl(( (i1 << 24 ) | (i2 << 16 ) | (i3 << 8) | i4 ));     Questions:   1)       How works message synchronization – i.e. if someone sends message(MSG) to group with two members M1 and M2 is message arrived in the same time to both members or asynchronously? 2)       As I understand a message stays in some queue on spread demon(s) until all spread users in group received it. Is it true? How I can manipulate this queue – measure max load, change size, add “expired time”? 3)       How I can change the “strategy of delivery”? For example I want to implement load sharing in the group with two members M1 and M2. Odd messages process M1, even messages process M2.   Best regards, Igor Lobiv Telco Software Development Sector   SITRONICS Telecom Solutions, Czech Republic a.s.   Tel.: +420 211 030 655,  Mobile: +420 724 936 638, Fax: +420 296 524 103 BB Centrum – Beta, Vyskočilova 1461/2b, 140 00  Praha 4, Czech Republic www.sitronicsts.com DISCLAIMER This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Top secret. Please delete it before read. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.   _______________________________________________ Spread-users mailing list Spread-users@xxxxxxxxxxxxxxxx http://lists.spread.org/mailman/listinfo/spread-users
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by