logo       

Re: CML RSS: msg#00040

science.chemistry.blue-obelisk

Subject: Re: CML RSS

Hi Henry, all,

thanks for the comments, it been helpful to see where CMLRSS is coming from. I still think enclosures are needed to take things forward.

A quick scenario that hopefully shows why I think enclosures are a much better idea; let's suppose I subscribe to a feed that's 15 items long, where each item includes a 1M CML file. I poll the feed every hour for updates. Let's say that client and server are well written, and use If-Modified-Since so if there are no changes I don't download the feed. Using inline CML, if one item changes, I've got to retrieve the whole feed again, 14MB of which I already have. Using enclosures I retrieve a few KB and then the 1MB file. N.B. that filtering the inline CML doesn't help me here - I still have to get the lot.

More discussion inline: -

Rzepa, Henry wrote:
At 12:09 +0100 24/5/07, Jim Downing wrote:
Hi all,

I've been talking to Nick Day about the issues he's been having with CMLRSS,
i.e. the feed itself is problematically large.
the original intent was to provide CML within the feed to provide
metainformation
about the molecule and to allow eg visual display (using Jmol/JChemPaint)
and allow opportunities to filter the feeds according to eg molecular formula,
or connectivity (which Jmol implemented in part).
Having to retrieve each entry as an enclosure would inhibit/
prevent such uses, particularly if there were many (1000s)
of such enclosures.

This is a valid point - it wouldn't be as easy to filter enclosures (although it's not much harder), and I take Daniel's point that filtering needs to be done on the data.... but... do we really need megabytes of data in the stream so we can filter on molecular formula or a connection table? Couldn't the CMLRSS have an extract inline and then an enclosure link to the full data?

Another intent of CMLRSS is that it would in fact
be dynamically generated by eg a PHP/MySQL query, which would
restrict the answers returned. Obviously, if CMLRSS is in fact a full
expression of a database containing 1M molecules, this would not
be practical.

This is a strawman: - the concept of streaming an entire database is not required for enclosures to be a useful development.

We (Nick Day and myself) are having performance problems now, using CMLRSS to do what it was designed to do. The CMLRSS Nick generates from a single edition of Acta E is somewhere between 20MB and 40MB. This occupies over 64MB of RAM, using a DOM approach - Nick has had to resort to using STAX to generate the file. I know this doesn't sound like much, but in a server environment you don't need too many requests like that in parallel to cause difficulties.

Yes these files contain more than 15 molecules, because each issue of Acta E contains more than 15 molecules. I suppose we could "save up" the updates and let them out every other day in batches of 15, but I feel that's a pound of complexity in implementation for the sake of a penny of simplicity in the protocol.

As we all know, CML files can get a lot bigger than the ones Nick's generating.

Granted, if its used to provide a feed for 1000s of molecules or a smaller
number of very large molecules, the feed itself does get large. However,
recollect that another use of RSS is audio/video. Here, one item may be
50-500 Mbytes in size. Remember, RSS is not designed as "real time"
system, but designed to work automatically "overnight". Thus size
may not be a particular concern when machine is talking to machine.
Having one file containing eg 1000 molecules might be more efficient
than one file containing merely pointers to 1000 molecules, which
would require 1000 http requests.
I'm no expert, but the podcasts I've seen have all been delivered as enclosures. Speaking as a part time sysadmin I'd prefer the load an enclosure based approach generated (content spread over many short-lived connections) than that of an inline approach (much content over a single long-lived connection).

RSS 1.0 which was used for CMLRSS does not in fact support the concept
of an enclosure, for which RSS 2.0 was developed specifically. Atom
may also support enclosure, but as I understand it, only RSS 1.0 allows
RDF to be delivered (neither RSS 2.0 nor Atom do this). So
If we do go down this route, we might have to produce separate RSS 1.0
and eg Atom 1.0 feeds. I have not studied the specifications recently,
so I may well need to be corrected on this. Part of the original intention
was that CMLRSS could be used to automatically populate an RDF
triple store, so the loss of RDF would be missed.
Atom specifies that any foreign markup must be tolerated, so RDF/XML should be fine.

Daniel Zaharevitz wrote:
<snip/>I also echo Henry's misgivings about moving the actual chemical information out of the main feed. The most important summary IS the chemical structure (in the way I think chemists would be interested in using it) and while there is no big problem with having a display of structures with a "click here for additional data", I think it would be pretty useless to have a list of "new compound", "new compound", etc. with a "click here to see structure". <snip/>
It's not a case of clicks - a CMLRSS client would have to know what to do with the enclosure just the same as it currently has to know what to do with the inline data. Programmatically the differences aren't great.

Best regards,

jim


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise