To whom it might concern,
We have observed bandwidth problems of TCP when running benchmarks which
gradually increase its payload and having (most of its) data flowing
unidirectional. The degradation is observed between two dual XEON/E7500
machines, using Gbe and a cross-over cable. The Gbe is eth1 and has an
MTU of 9000. eth0 is a FE with an MTU of 1500. The machines is running
linux 2.4.20, and the NIC in question is an Intel Corp. 82544GC Gigabit
Ethernet Controller (rev 02).
During an attempt to change TCP parameter settings, and also
systematically changing socket rx/tx buffer sizes, I discovered that once
in a while, the benchmark ran well. The OK run was not deterministic, and
had nothing to do with change of the parameter settings. Hence, I let the
machine work during the week-end, and took tcpdumps which I saved for the
successful run. An analysis of the "OK" run vs. the
"BAD" run, discovers a couple of interesting things. The
problem is that the advertised window of the receiver does not increase.
The second problem, which also affects the "OK" run, is that
the ratio of packets sent (from src to dst) and the number of packets
received (#advertisements) is 1 for the "BAD" scenario
(actually this is probably a consequence of the window being only ~2xMTU
size). Surprisingly, it is as large as 0.5 for the "OK"
scenario. IMHO, this violates RFC813, which states:
- There are two reasons for prompt
acknowledgement. One is to
- prevent retransmission. We will discuss later how to
determine whether
- unnecessary retransmission is
occurring. The other
reason one
- acknowledges promptly is to permit further data to
be sent. However,
- the previous section makes quite clear that it is not
always desirable
- to send a little bit of data, even though the receiver may have room
for
- it. Therefore, one can
state a general rule that under normal
- operation, the receiver of data need not,
and for efficiency reasons
- should not, acknowledge the data unless either the
acknowledgement is
- intended to produce an increased useable window
, is
necessary in order
- to prevent retransmission or is
being sent as part of a reverse
- direction segment being sent for some other reason. We will
consider an
- algorithm to achieve these goals.
The two tcpdumps has been analyzed by a simple awk scripts, which
gives information for every 1/10 of a second of the runtime:
time: elapsed time relative to the first packet
MB/s: sum of TCP payload (based on TCP sequence numbers) *1e-6 / delta
time
avg_len: average packet length (TCP payload) sent
nsent: no of packets sent from src to dst
avg_win: average window size advertised by the src
adv/sent: ratio between #advertisements (sum of prompt acks and
piggybacked acks) and #packets sent
The extract of the analysis is enclosed as "ok.txt" and
"bad.txt". Also, the information is presented as graphs in
"bw_winsiz.png" and "bw_adv-sent.png". In the graphs,
the "OK" data uses the bottom x-axis, the "BAD" data
uses the upper x-axis (the "BAD" run-time is longer due to the
lower bandwidth). Bandwidth uses the left y-axes, whereas the average
advertised window size and the ratio of the #advertisements and #packets
sent uses the rightmost x-axis.
I do hope this information is useful 4u and that the problem can be
fixed. Since I do not read the mailing lists, I would appreciate a note
back by email of someone triggers on this.
Cheers, Håkon
--
Håkon Bugge; VP Product Development; Scali AS;
mailto:hob@xxxxxxxx;
http://www.scali.com;
fax: +47 22 62 89 51;
Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
Mail Addr: Scali AS, Postboks 150, Oppsal, N-0619 Oslo,
Norway;
bw_adv-sent.png
Description: PNG image
bad.txt
Description: Text document
bw_winsiz.png
Description: PNG image
ok.txt
Description: Text document
|