logo       

Re: Disk cache: msg#00023

network.bit-torrent.libtorrent

Subject: Re: Disk cache

On Apr 11, 2006, at 21:56, Radu Hociung wrote:
Arvid Norberg wrote:
On Apr 10, 2006, at 20:53, Radu Hociung wrote:
[...]
It is true though, that in the case where the swarm has tendencies of
being "serialized", it is very common to request pieces which were just
made available from a peer. But I'm not sure that is a very common
case. Obviously some data is needed here.

The serialized downloads make sense in at least two scenarios:

a. well-seeded torrents, and/or
b. applications where it is desirable to start playback as soon as
possible after the download starts. Obviously this makes sense mainly
for media-type torrents.

With serialized, I meant that the pieces downloaded are distributed in a chain, rather than in a star. Possibly a poorly chosen word.

I think that in the case where peers join at different times, and have different download speeds, the rare first algorithm might actually make it less likely that a newly downloaded piece is requested from a peer. Since that piece just got bumped up to a lower priority level (having one extra peer).

[...]
The client isn't completely free of choosing who to upload to/whose
requests to respond do. In order to maximize download speed, it has to
upload to those that it can download from. I also think that the most
common case is that the output pipe very seldom is saturated. So, I'm
not sure weighting peers depending on the pieces they request would
work very well, it would affect the behavior of the client in other ways.

Your distinction of cheap and expensive data doesn't take seek distance
into account. I think that may be the most important thing to
concentrate on, at least when it comes to writing.

I thought your number 1 aim deals with seeds on high-speed links, hence
no downloading.

I think the cases of downloading, uploading and seeding are a bit mixed up.

Since there are different scenarios which have distinct properties,
maybe caching behaviour could be altered as necessary. The 4 scenarios
I'm thinking of are:

a. seeding with out pipe saturated
b. seeding with out pipe underutilized
c. downloading from torrent "good" or better
d. downloading from torrent "acceptable" or worse

The key difference between a-b is that the data being requested is
likely available cheaper from another seed.

The key difference between c-d is that a random dl pattern is
unnecessary in c.

True. The seeding case seems to be more difficult to optimze. Since it would require to pick the "best" piece requests and respond to those, and still saturate the out pipe. It is difficult partly because the limit of the out pipe is unknown or hard to estimate unless the user sets it (and I think the likelyhood of it being set accurately is probably not too high). And partly because it involves picking piece requests that may belong to different files and estimate the seek distance for the drive's head. The selection would also make sure that requests are answered sooner or later, to avoid creating dead locks.

Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every block is
only downloaded once, you cannot really optimize away disk writes.
What
you can do though, is to optimize the disk seeking.


Absolutely. Also, since while downloading, the client also uploads, it
would be more disk efficient to only seek for disk writes, but never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.


Yes, at least to a certain degree. One thing to keep in mind though, is
that the number of pieces in a torrent is generally much more than the
number of peers that a client can see. How rare a piece is is quantized
to the number of peers that has it. This means that the number of
equally rare pieces is usually very high, far higher than a write cache
would be.

I think you're making the (healthy) assumption that a client would cache
one, or very few pieces for upload? This is very good for lowering
memory footprints of the client.

What I am thinking is that as soon as a client advertises a completed
piece, many swarm members would request it immediately, instead of
requesting other pieces. The just-downloaded piece is cheap data, and
the downloading client would prioritize its upload.

I'm not sure that is very common. Some measurement should be done here. Parsing libtorrent logs would be enough here.

In an ideal swarm, each piece in a torrent would be read from a disk
exactly once, and written to disk exactly N-1 times (for N clients in
the swarm). A torrent that is well seeded ("good" or better) should be
able to approach this ideal, without risking the availability of any
piece to be so low as to endanger the torrent. For a torrent that is
"acceptable" or worse, the rarest-first approach may be given a higher
preference than the ideal.

Yes, this is probably a very good idea.

In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.

I have thought about optimizing the dataflow a lot. I believe there are
at least two good reasons for improving caching of torrents:

1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.


Definitely, I think the disk perfomance is very important.

I might add that there's a huge difference for me (OSX 10.4) when
downloading in full allocation mode and compact allocation mode. The
compact allocation mode will have the effect that most reads and writes
are kept relatively close to each other. They will spread out over time
of course though. But just downloading with client_test will saturate
my connection at slightly above 1 MB/s if I'm in compact mode. In full
allocation mode though, I will not get above about 600 kB/s, and my
drive is really making a noise and slowing down the computer to be
virtually useless.

Fascinating! This would seem to support compact/serialized downloading.

[...]
So, I think my first attempt will be to cache pieces, and only write
whole pieces instead of blocks to disk. To see if the performance is
improved. I don't know how to measure it though, maybe I could try
saturating my download link in full allocation mode and see if it
increases.

But would this not increase the memory footprint of the application,
seeing how as many download buffers are needed as there are parallel
downloads, while none of these buffers can be used to upload from, as
their data has not been checksummed until they have completed downloading?

If the download cache has a fixed size, finished pieces can be kept in there as long as possible, allowing it to be used as a read cache as well.

Do you have any other suggestions that are reasonably easy to try out?

I think the first thing to do in studying the effects of libtorrent
caching would be to disable the O/S cache, in order to make comparisons
between versions relevant.

It would be interesting to find out how much riskier it is for a client
to serialize its downloads on poor torrents. Obviously it is safest to
do rarest-first on such torrents, but I am curious as to the difference
between rarest-first and serialized.

I would say it's a huge difference. This is not too hard to calculate analytically though, no real need to test it.

Also, we should not forget of the relationship between memory footprint
and caching effectiveness. It's easy to get more effective by allowing
increased memory use, but we should resist increased memory use when
possible.

Yes. It has to be configurable and possible to disable.

How are you able to get 1MB/s ? Do you have torrents running on a test
network? Are the experiments reproducible? Would you mind describing
your test setup?

I have a 10 Mb/s symmetric connection, and so do most other people on this tracker I use.


--
Arvid Norberg




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise