logo       

Re: Disk cache: msg#00026

network.bit-torrent.libtorrent

Subject: Re: Disk cache

(adding very little, but keeping most old text for context)

Arvid Norberg wrote:
> On Apr 11, 2006, at 21:56, Radu Hociung wrote:
>
>> Arvid Norberg wrote:
>>
>>> On Apr 10, 2006, at 20:53, Radu Hociung wrote:
>>> [...]
>>> It is true though, that in the case where the swarm has tendencies of
>>> being "serialized", it is very common to request pieces which were
>>> just
>>> made available from a peer. But I'm not sure that is a very common
>>> case. Obviously some data is needed here.
>>
>>
>> The serialized downloads make sense in at least two scenarios:
>>
>> a. well-seeded torrents, and/or
>> b. applications where it is desirable to start playback as soon as
>> possible after the download starts. Obviously this makes sense mainly
>> for media-type torrents.
>
>
> With serialized, I meant that the pieces downloaded are distributed in
> a chain, rather than in a star. Possibly a poorly chosen word.
>
> I think that in the case where peers join at different times, and have
> different download speeds, the rare first algorithm might actually make
> it less likely that a newly downloaded piece is requested from a peer.
> Since that piece just got bumped up to a lower priority level (having
> one extra peer).
>
>> [...]
>>
>>> The client isn't completely free of choosing who to upload to/whose
>>> requests to respond do. In order to maximize download speed, it has to
>>> upload to those that it can download from. I also think that the most
>>> common case is that the output pipe very seldom is saturated. So, I'm
>>> not sure weighting peers depending on the pieces they request would
>>> work very well, it would affect the behavior of the client in other
>>> ways.
>>>
>>> Your distinction of cheap and expensive data doesn't take seek
>>> distance
>>> into account. I think that may be the most important thing to
>>> concentrate on, at least when it comes to writing.
>>
>>
>> I thought your number 1 aim deals with seeds on high-speed links, hence
>> no downloading.
>
>
> I think the cases of downloading, uploading and seeding are a bit mixed
> up.
>
>> Since there are different scenarios which have distinct properties,
>> maybe caching behaviour could be altered as necessary. The 4 scenarios
>> I'm thinking of are:
>>
>> a. seeding with out pipe saturated
>> b. seeding with out pipe underutilized
>> c. downloading from torrent "good" or better
>> d. downloading from torrent "acceptable" or worse
>>
>> The key difference between a-b is that the data being requested is
>> likely available cheaper from another seed.
>>
>> The key difference between c-d is that a random dl pattern is
>> unnecessary in c.
>
>
> True. The seeding case seems to be more difficult to optimze. Since it
> would require to pick the "best" piece requests and respond to those,
> and still saturate the out pipe. It is difficult partly because the
> limit of the out pipe is unknown or hard to estimate unless the user
> sets it (and I think the likelyhood of it being set accurately is
> probably not too high). And partly because it involves picking piece
> requests that may belong to different files and estimate the seek
> distance for the drive's head. The selection would also make sure that
> requests are answered sooner or later, to avoid creating dead locks.
>
>>>>> Number 2 is a little bit more interesting though. Since every block
>>>>> that is downloaded has to be written to disk, and since every
>>>>> block is
>>>>> only downloaded once, you cannot really optimize away disk writes.
>>>>> What
>>>>> you can do though, is to optimize the disk seeking.
>>>>
>>>>
>>>>
>>>> Absolutely. Also, since while downloading, the client also uploads, it
>>>> would be more disk efficient to only seek for disk writes, but
>>>> never for
>>>> disk reads. Ie, the client should prefer to respond to requests
>>>> for data
>>>> is has freshly downloaded, instead of reading other data from disk.
>>>> Assuming that the client downloads least-common data first, it's a
>>>> reasonable assumtion that the data it holds in memory buffers is still
>>>> rare to other clients.
>>>
>>>
>>>
>>> Yes, at least to a certain degree. One thing to keep in mind
>>> though, is
>>> that the number of pieces in a torrent is generally much more than the
>>> number of peers that a client can see. How rare a piece is is
>>> quantized
>>> to the number of peers that has it. This means that the number of
>>> equally rare pieces is usually very high, far higher than a write
>>> cache
>>> would be.
>>
>>
>> I think you're making the (healthy) assumption that a client would cache
>> one, or very few pieces for upload? This is very good for lowering
>> memory footprints of the client.
>>
>> What I am thinking is that as soon as a client advertises a completed
>> piece, many swarm members would request it immediately, instead of
>> requesting other pieces. The just-downloaded piece is cheap data, and
>> the downloading client would prioritize its upload.
>
>
> I'm not sure that is very common. Some measurement should be done here.
> Parsing libtorrent logs would be enough here.
>
>> In an ideal swarm, each piece in a torrent would be read from a disk
>> exactly once, and written to disk exactly N-1 times (for N clients in
>> the swarm). A torrent that is well seeded ("good" or better) should be
>> able to approach this ideal, without risking the availability of any
>> piece to be so low as to endanger the torrent. For a torrent that is
>> "acceptable" or worse, the rarest-first approach may be given a higher
>> preference than the ideal.
>
>
> Yes, this is probably a very good idea.
>
>>>> In this case, the library should write to disk when downloaded pieces
>>>> are complete, but do not free the buffers until other clients have
>>>> downloaded them enough to cause the uplink utilization to drop.
>>>>
>>>> I have thought about optimizing the dataflow a lot. I believe
>>>> there are
>>>> at least two good reasons for improving caching of torrents:
>>>>
>>>> 1. Fewer seeks == less noise == drives lasting longer.
>>>> 2. Fewer seeks == drive is more responsive to other apps that run
>>>> concurently.
>>>
>>>
>>>
>>> Definitely, I think the disk perfomance is very important.
>>>
>>> I might add that there's a huge difference for me (OSX 10.4) when
>>> downloading in full allocation mode and compact allocation mode. The
>>> compact allocation mode will have the effect that most reads and
>>> writes
>>> are kept relatively close to each other. They will spread out over
>>> time
>>> of course though. But just downloading with client_test will saturate
>>> my connection at slightly above 1 MB/s if I'm in compact mode. In full
>>> allocation mode though, I will not get above about 600 kB/s, and my
>>> drive is really making a noise and slowing down the computer to be
>>> virtually useless.
>>
>>
>> Fascinating! This would seem to support compact/serialized downloading.
>>
>> [...]
>>
>>> So, I think my first attempt will be to cache pieces, and only write
>>> whole pieces instead of blocks to disk. To see if the performance is
>>> improved. I don't know how to measure it though, maybe I could try
>>> saturating my download link in full allocation mode and see if it
>>> increases.
>>
>>
>> But would this not increase the memory footprint of the application,
>> seeing how as many download buffers are needed as there are parallel
>> downloads, while none of these buffers can be used to upload from, as
>> their data has not been checksummed until they have completed
>> downloading?
>
>
> If the download cache has a fixed size, finished pieces can be kept in
> there as long as possible, allowing it to be used as a read cache as well.

I would just like to suggest a possible definition for "as long as
possible": the algorithm for adding/replacing cached data should be "add
rarest pieces, retire most plentiful pieces, and replace only when the
new piece is rarer than the replaced piece", since a rare piece is more
likely to be requested again.


>>> Do you have any other suggestions that are reasonably easy to try out?
>>
>>
>> I think the first thing to do in studying the effects of libtorrent
>> caching would be to disable the O/S cache, in order to make comparisons
>> between versions relevant.
>>
>> It would be interesting to find out how much riskier it is for a client
>> to serialize its downloads on poor torrents. Obviously it is safest to
>> do rarest-first on such torrents, but I am curious as to the difference
>> between rarest-first and serialized.
>
>
> I would say it's a huge difference. This is not too hard to calculate
> analytically though, no real need to test it.

Right. I gave it a shot in the other subthread with the
quarter-by-quarter-random vs. piece-by-piece-random strategy comparison.
Is that what you had in mind?


>> Also, we should not forget of the relationship between memory footprint
>> and caching effectiveness. It's easy to get more effective by allowing
>> increased memory use, but we should resist increased memory use when
>> possible.
>
>
> Yes. It has to be configurable and possible to disable.
>
>> How are you able to get 1MB/s ? Do you have torrents running on a test
>> network? Are the experiments reproducible? Would you mind describing
>> your test setup?
>
>
> I have a 10 Mb/s symmetric connection, and so do most other people on
> this tracker I use.

Wow... I'm feeling envy and an itch to immigrate :)

Cheers,
Radu.


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise