|
Re: Disk cache: msg#00009network.bit-torrent.libtorrent
Hello Arvid, and all, I think caching at the application level can be far more efficient than at the O/S level, as the application typically can manage characteristics of the dataflow, whereas the O/S cannot. Please see below. I will use "cheap data" and "expensive data" to refer to in-buffer and on-disk data respectively. The theory of Nash equilibrium applies very much to torrents, I think. Downloading torrents is a non-cooperative game, in that each client is concerned with ensuring only its own success, and the strategy it must employ is of maximizing the chance for success while minimizing its cost of achieving success. Please see below for my comments on caching. Arvid Norberg wrote: > There has been some discussion on the IRC channel about benefits and > possible cache algorithms for a disk cache in libtorrent. I'm posting > this hoping that somebody could comment on some of the techniques or > assumptions. > > The way I see it, there are two problems that a cache aims to solve: > > 1. seeding on a high speed connection will make the hard drive seek a > lot back and forth, just to do short reads. > > 2. downloading on a high speed connection will make the hard drive seek > a lot back and forth. > > > For 1, one assumption that can be made is that if block 0 is requested > from a piece, it's very likely that many more blocks from that piece > will be requested soon. So caching the entire piece at once would > probably be a good idea. > > This is exactly what a normal OS' disk cache does. It reads ahead into > the cache. So I don't really see any benefit if caching this at > application level. One possible optimization would be if the disk cache > could be bypassed so that only one piece was cached (since the disk > cache in the OS may cache more than the piece we're reading from). This > would be hard to do, and especially hard to do in a platform > independent manner. > > Is this a reasonable assumption? If the application is able to fill the available outgoing bandwidth with cheap data, there is no benefit to reading more/other data from the disk, is there? I think it would be a good idea for the library to prefer to respond to requests for cheap data, and postpone responses to requests for expensive data until the utilization of the available outgoing bandwidth drops. Clients that request expensive data should be choked until cheap data can no longer fill the outgoing pipe, and it becomes necessary to read expensive data to fill the pipe. > Number 2 is a little bit more interesting though. Since every block > that is downloaded has to be written to disk, and since every block is > only downloaded once, you cannot really optimize away disk writes. What > you can do though, is to optimize the disk seeking. Absolutely. Also, since while downloading, the client also uploads, it would be more disk efficient to only seek for disk writes, but never for disk reads. Ie, the client should prefer to respond to requests for data is has freshly downloaded, instead of reading other data from disk. Assuming that the client downloads least-common data first, it's a reasonable assumtion that the data it holds in memory buffers is still rare to other clients. In this case, the library should write to disk when downloaded pieces are complete, but do not free the buffers until other clients have downloaded them enough to cause the uplink utilization to drop. I have thought about optimizing the dataflow a lot. I believe there are at least two good reasons for improving caching of torrents: 1. Fewer seeks == less noise == drives lasting longer. 2. Fewer seeks == drive is more responsive to other apps that run concurently. I believe the on-drive cache is better than nothing, O/S cache is better than the on-drive cache, and application cache is better than O/S cache. The closer the cache is to the data generation/consumption algorithm, the more information is available for optimizing it. I think this approach to caching gives an answer to multiple-torrent caching as well. One more thing: I have considered working on a test-bench that is able to run a swarm. The testbench would be able to report disk-access efficiency, by measuring some metrics, like how far apart the written pieces are in memory (ie, seek throw), total number of accesses, etc. The idea is that it can be calculated empirically what the minimum number of accesses, and long seeks is to fully download a file, while saturating the output link. The metrics given by the testbench could be compared to the ideal, and a measure of client efficiency would thus be obtained. (Naturally, min-long-seeks = 0, and min-accesses = file-size/buffer-size) The test-bench would hook O/S calls in a manner similar to strace, and would be able to run multiple makes/brands/versions of clients, such that swarm behaviour can be observed. This would be very useful to study potential protocol changes and their effect on older-version clients. I am very interested in your thoughts and comments on caching, Regards, Radu. ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Disk cache: 00009, Arvid Norberg |
|---|---|
| Next by Date: | Re: Disk cache: 00009, Arvid Norberg |
| Previous by Thread: | Re: Disk cachei: 00009, Arvid Norberg |
| Next by Thread: | Re: Disk cache: 00009, Arvid Norberg |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |