logo       

Re: RFC HTTP::Cache module: msg#00041

lang.perl.modules.lwp

Subject: Re: RFC HTTP::Cache module

Mattias Holmlund wrote:

Hi,

I have written a perl module that I want to publish on CPAN. The module
implements a cache for http requests. It provides a single method get
that fetches a url via http. The result of each get is stored in a
cache on disk, and if the same url has been requested before, the
proper ETag and If-Modified-Since headers are sent to the http server.
The server can then respond that the object in the cache is up-to-date
and HTTP::Cache will return a cached version of the data instead of
fetching it from the server. This speeds up the HTTP get and saves
bandwidth for both the server and the client.

The module is very simple to use. Simply create a HTTP::Cache object
and call the get-method on the returned object to fetch a url. You do
not have to care if the data is fetched from the server or from the
cache (but you can find out if you want to know).


Sample usage:

my $c = HTTP::Cache->new( {
BasePath => "/tmp/cache", # Directory to store the cache in.
MaxAge => 8*24, # How many hours should items be
# kept in the cache after they
# were last accessed?
# Default is 8*24.
Verbose => 1, # Print messages to STDERR.
# Default is 0.
UserAgent => "my-spider", # The user-agent string to use.
# Default is "perl-http-cache".
} );

my( $content, $error ) = $c->get( $url );

if( defined( $content ) )
{
# Data retrieved and stored in $content.
# $error indicates if the data was found in the cache (0)
# if it was fetched from the server but equal to the cache (1)
# or if it was fetched from the server and different from the
# cache (2).
}
else
{
print STDERR "Failed to fetch $url. " .
"Error returned by server: $error";
}

Does anyone object to putting this module on CPAN or is it redundant?
Is HTTP::Cache a good name for it?


Regards,


Mattias Holmlund


I'm sure the real programmers on the list will chime in regarding the redundancy issues, as I'm sure there are already plenty of proxy modules around that might have caching ability, if not caching modules directly. However, I haven't yet gotten around to exploring that domain on CPAN, so I can't comment on it myself.

However, a few things to consider:

1) It sucks having to re-implement a subset of LWP::UserAgent parameters in your module (like UserAgent). Even if you're simply passing them along verbatim to the UserAgent constructor, you still have to provide some documentation in your module, and you can't possibly cover all of the params. You could simply say that params get passed through to LWP::UserAgent, I suppose.

2) If you try to over-simplify the process, you eliminate the option of using all less-simple-than-simply-calling-get() functionality in the libwww module. Eventually people will want to be able to cache posts, or check the http status code of the response, and other such things, and you will be busy re-implementing everything that's already implemented.

How about instead of providing "get" methods and returning "content" directly, you integrate properly into the libwww module and cache/return HTTP::Response objects? You can still key on the url (ignoring the parameters, unlike Apache::DBI), although POST content might need to be part of the cache key.

Perhaps you could make HTTP::Cache one of those "magic" modules that if you simply "use" it, or load it and set a global variable, caching starts happening automagically (in the background you could override a few pieces of libwww to insert the caching in the appropriate place - should be fairly seamless).

3) Some global cache configuration options would be nice (instead of per-request). You could look at squid as a model (squid being the premiere open source web caching application), but off the top of my head:

a) set a max-live time (global, or per mime-type, or per domain.... you can get as fancy as you dream)
b) turn on/off depending on verb (like GET, POST) or if query-string params detected
c) set default "expires" time if the web server doesn't offer one
d) whether or not to even bother trying to HEAD the url or just go straight for the goods
e) yes, a user-agent string

-ofer





<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise