|
Re: Disk cache: msg#00026network.bit-torrent.libtorrent
(adding very little, but keeping most old text for context) Arvid Norberg wrote: > On Apr 11, 2006, at 21:56, Radu Hociung wrote: > >> Arvid Norberg wrote: >> >>> On Apr 10, 2006, at 20:53, Radu Hociung wrote: >>> [...] >>> It is true though, that in the case where the swarm has tendencies of >>> being "serialized", it is very common to request pieces which were >>> just >>> made available from a peer. But I'm not sure that is a very common >>> case. Obviously some data is needed here. >> >> >> The serialized downloads make sense in at least two scenarios: >> >> a. well-seeded torrents, and/or >> b. applications where it is desirable to start playback as soon as >> possible after the download starts. Obviously this makes sense mainly >> for media-type torrents. > > > With serialized, I meant that the pieces downloaded are distributed in > a chain, rather than in a star. Possibly a poorly chosen word. > > I think that in the case where peers join at different times, and have > different download speeds, the rare first algorithm might actually make > it less likely that a newly downloaded piece is requested from a peer. > Since that piece just got bumped up to a lower priority level (having > one extra peer). > >> [...] >> >>> The client isn't completely free of choosing who to upload to/whose >>> requests to respond do. In order to maximize download speed, it has to >>> upload to those that it can download from. I also think that the most >>> common case is that the output pipe very seldom is saturated. So, I'm >>> not sure weighting peers depending on the pieces they request would >>> work very well, it would affect the behavior of the client in other >>> ways. >>> >>> Your distinction of cheap and expensive data doesn't take seek >>> distance >>> into account. I think that may be the most important thing to >>> concentrate on, at least when it comes to writing. >> >> >> I thought your number 1 aim deals with seeds on high-speed links, hence >> no downloading. > > > I think the cases of downloading, uploading and seeding are a bit mixed > up. > >> Since there are different scenarios which have distinct properties, >> maybe caching behaviour could be altered as necessary. The 4 scenarios >> I'm thinking of are: >> >> a. seeding with out pipe saturated >> b. seeding with out pipe underutilized >> c. downloading from torrent "good" or better >> d. downloading from torrent "acceptable" or worse >> >> The key difference between a-b is that the data being requested is >> likely available cheaper from another seed. >> >> The key difference between c-d is that a random dl pattern is >> unnecessary in c. > > > True. The seeding case seems to be more difficult to optimze. Since it > would require to pick the "best" piece requests and respond to those, > and still saturate the out pipe. It is difficult partly because the > limit of the out pipe is unknown or hard to estimate unless the user > sets it (and I think the likelyhood of it being set accurately is > probably not too high). And partly because it involves picking piece > requests that may belong to different files and estimate the seek > distance for the drive's head. The selection would also make sure that > requests are answered sooner or later, to avoid creating dead locks. > >>>>> Number 2 is a little bit more interesting though. Since every block >>>>> that is downloaded has to be written to disk, and since every >>>>> block is >>>>> only downloaded once, you cannot really optimize away disk writes. >>>>> What >>>>> you can do though, is to optimize the disk seeking. >>>> >>>> >>>> >>>> Absolutely. Also, since while downloading, the client also uploads, it >>>> would be more disk efficient to only seek for disk writes, but >>>> never for >>>> disk reads. Ie, the client should prefer to respond to requests >>>> for data >>>> is has freshly downloaded, instead of reading other data from disk. >>>> Assuming that the client downloads least-common data first, it's a >>>> reasonable assumtion that the data it holds in memory buffers is still >>>> rare to other clients. >>> >>> >>> >>> Yes, at least to a certain degree. One thing to keep in mind >>> though, is >>> that the number of pieces in a torrent is generally much more than the >>> number of peers that a client can see. How rare a piece is is >>> quantized >>> to the number of peers that has it. This means that the number of >>> equally rare pieces is usually very high, far higher than a write >>> cache >>> would be. >> >> >> I think you're making the (healthy) assumption that a client would cache >> one, or very few pieces for upload? This is very good for lowering >> memory footprints of the client. >> >> What I am thinking is that as soon as a client advertises a completed >> piece, many swarm members would request it immediately, instead of >> requesting other pieces. The just-downloaded piece is cheap data, and >> the downloading client would prioritize its upload. > > > I'm not sure that is very common. Some measurement should be done here. > Parsing libtorrent logs would be enough here. > >> In an ideal swarm, each piece in a torrent would be read from a disk >> exactly once, and written to disk exactly N-1 times (for N clients in >> the swarm). A torrent that is well seeded ("good" or better) should be >> able to approach this ideal, without risking the availability of any >> piece to be so low as to endanger the torrent. For a torrent that is >> "acceptable" or worse, the rarest-first approach may be given a higher >> preference than the ideal. > > > Yes, this is probably a very good idea. > >>>> In this case, the library should write to disk when downloaded pieces >>>> are complete, but do not free the buffers until other clients have >>>> downloaded them enough to cause the uplink utilization to drop. >>>> >>>> I have thought about optimizing the dataflow a lot. I believe >>>> there are >>>> at least two good reasons for improving caching of torrents: >>>> >>>> 1. Fewer seeks == less noise == drives lasting longer. >>>> 2. Fewer seeks == drive is more responsive to other apps that run >>>> concurently. >>> >>> >>> >>> Definitely, I think the disk perfomance is very important. >>> >>> I might add that there's a huge difference for me (OSX 10.4) when >>> downloading in full allocation mode and compact allocation mode. The >>> compact allocation mode will have the effect that most reads and >>> writes >>> are kept relatively close to each other. They will spread out over >>> time >>> of course though. But just downloading with client_test will saturate >>> my connection at slightly above 1 MB/s if I'm in compact mode. In full >>> allocation mode though, I will not get above about 600 kB/s, and my >>> drive is really making a noise and slowing down the computer to be >>> virtually useless. >> >> >> Fascinating! This would seem to support compact/serialized downloading. >> >> [...] >> >>> So, I think my first attempt will be to cache pieces, and only write >>> whole pieces instead of blocks to disk. To see if the performance is >>> improved. I don't know how to measure it though, maybe I could try >>> saturating my download link in full allocation mode and see if it >>> increases. >> >> >> But would this not increase the memory footprint of the application, >> seeing how as many download buffers are needed as there are parallel >> downloads, while none of these buffers can be used to upload from, as >> their data has not been checksummed until they have completed >> downloading? > > > If the download cache has a fixed size, finished pieces can be kept in > there as long as possible, allowing it to be used as a read cache as well. I would just like to suggest a possible definition for "as long as possible": the algorithm for adding/replacing cached data should be "add rarest pieces, retire most plentiful pieces, and replace only when the new piece is rarer than the replaced piece", since a rare piece is more likely to be requested again. >>> Do you have any other suggestions that are reasonably easy to try out? >> >> >> I think the first thing to do in studying the effects of libtorrent >> caching would be to disable the O/S cache, in order to make comparisons >> between versions relevant. >> >> It would be interesting to find out how much riskier it is for a client >> to serialize its downloads on poor torrents. Obviously it is safest to >> do rarest-first on such torrents, but I am curious as to the difference >> between rarest-first and serialized. > > > I would say it's a huge difference. This is not too hard to calculate > analytically though, no real need to test it. Right. I gave it a shot in the other subthread with the quarter-by-quarter-random vs. piece-by-piece-random strategy comparison. Is that what you had in mind? >> Also, we should not forget of the relationship between memory footprint >> and caching effectiveness. It's easy to get more effective by allowing >> increased memory use, but we should resist increased memory use when >> possible. > > > Yes. It has to be configurable and possible to disable. > >> How are you able to get 1MB/s ? Do you have torrents running on a test >> network? Are the experiments reproducible? Would you mind describing >> your test setup? > > > I have a 10 Mb/s symmetric connection, and so do most other people on > this tracker I use. Wow... I'm feeling envy and an itch to immigrate :) Cheers, Radu. ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Disk cache: 00026, Radu Hociung |
|---|---|
| Next by Date: | Re: Disk cache: 00026, Arvid Norberg |
| Previous by Thread: | Re: Disk cachei: 00026, Arvid Norberg |
| Next by Thread: | Re: Disk cache: 00026, Arvid Norberg |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |