osdir.com
mailing list archive

Subject: Re: Thought on future of XMLC - msg#00123

List: java.enhydra.xmlc

Date: Prev Next Index Thread: Prev Next Index
On Friday 29 November 2002 11:38, Arno Schatz wrote:
> David,
>
> you would need to modify the DOM implementation such that it will be aware
> if there was any change which the application does to the tree. Also you
> would need more information from the parsing process: When the DOM is
> created you would need to store the beginning offset and ending offset
> (from the original HTML string) within the DOM node. The original HTML
> string must be stored in memory of course (little overhead). In the output
> process the for each DOM node we would need to look if it was changed in
> some way be the application. If yes, produce the html from the node output
> as it is done now. If no, take the substriing from the original html from
> the beginning offset to the ending offset and return that as result.

That's basically what the LazyDOM does, with one smalll but important
differences: The LazyDOM doesn't store the *orginal* HTML, but rather caches
the HTML that is constructed by the "standard" output process. The reason for
this is simply that it is much (orders of magnitude :-) easier to create HTML
text from a DOM than to create a DOM-like structure from broken HTML-like
text.

The other difference is that the LazyDOM caches preformatted texts per DOM
node - so, you still have to walk the tree and output each node. But the
treewalk really isn't that much of a performance hit - the big hit is the
text conversion (especially detecing characters that need to be converted to
HTML entities). That said, changing the text cache so that a complete,
unchanged subtree can be output in a single operation is something I've
wanted to do for a while now, and I'll probably implement it along the way
when XMLC is changed to no longer depend on a specific version of Xerces - so
expect this for XMLC 3.something :-)

> If you look at the changes we really make, (even if we consider URL
> mapping) mostly we are changing some leaves. (copying template rows is not
> really a changing operation on the node, as you still can use the original
> html for out putting, because the original html is of course immutable)
>
> So we would have
> 1) the size of the html as memory overhead.
> 2) need to change the parsing process to keep track of the offsets
> 3) need to have a DOM implementation which has a modified flag and a
> beginning and ending offset (probably integer)

LazyDOM already does that, minus the offset stuff.

> And we get
> 1) quite some speed in spitting out html (over the current process)

A bit, but not too much faster than the LazyDOM is my guess.

> 2) large parts of the output html will be exactly what the input (the
> original html) was.

But you spend a huge amount of time on parsing "HTML-like" stuff and forcing
it into something that resembles a DOM - that's a can of worms I definitely
don't want to open.


--
Richard Kunze

[ t]ivano Software, Bahnhofstr. 18, 63263 Neu-Isenburg
Tel.: +49 6102 80 99 07 - 0, Fax.: +49 6102 80 99 07 - 1
http://www.tivano.de, kunze@xxxxxxxxx


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: Thought on future of XMLC

David, you would need to modify the DOM implementation such that it will be aware if there was any change which the application does to the tree. Also you would need more information from the parsing process: When the DOM is created you would need to store the beginning offset and ending offset (from the original HTML string) within the DOM node. The original HTML string must be stored in memory of course (little overhead). In the output process the for each DOM node we would need to look if it was changed in some way be the application. If yes, produce the html from the node output as it is done now. If no, take the substriing from the original html from the beginning offset to the ending offset and return that as result. If you look at the changes we really make, (even if we consider URL mapping) mostly we are changing some leaves. (copying template rows is not really a changing operation on the node, as you still can use the original html for out putting, because the original html is of course immutable) So we would have 1) the size of the html as memory overhead. 2) need to change the parsing process to keep track of the offsets 3) need to have a DOM implementation which has a modified flag and a beginning and ending offset (probably integer) And we get 1) quite some speed in spitting out html (over the current process) 2) large parts of the output html will be exactly what the input (the original html) was. -Arno David Li wrote: Arno, The problem with the approach you are proposing here is that it's impossible to predict which part of the HTML pages will be modified and which part won't. It may be possible for a small projects that only uses simple get/set methods. A lot of XMLC programming is done with DOM API which can potentially traverse the entire page. An alternative is possible with LazyDOM. DOM is a tree structure. At each node, we can keep a serialized string of the node as how it and its subtree would look after being serialized. As LazyDOM keep track of which node is modified, we can assume that its copy of serialized string is invalid and traverse the subtree to generated the new serialized string. However, this would cause a large increase in the memory usage approximately O(filesize * height of the DOM tree * 2). For a 50 K page with 10 level depth, it comes out to be 2M of memory (ascii goes unicode in Java). Some smart pruning of tree is necessary to reduce the memory foot print to make it become feasible solution. David Li --- "It spells Mac OS X but pronounces NeXTSTEP" On Friday, Nov 29, 2002, at 05:10 Asia/Shanghai, Arno Schatz wrote: Hi Jake, sorry to not explain properly, I guess some other did understand me only because I was mentioning this somewhere else before. When the DOM tree is created, there are a lot of nodes which will not be changed by the programm. (Mostly a application program only changes nodes which have an id attribute) So there are whole subtrees of the created DOM tree, which will never be changed by the application. This subtree is created from an html-string (a substring of the original html-page). So if you want to output such an unchanged subtree, you could output the original string from the html file. For generating the output from the DOM, xmlc runs through the hole tree, even through these unchanged nodes and generates the html. If it had a refernce to the original html, it could output the part of the orioginal html it was created from. The current time consumption for outputting html is quite high as you might know. But there are other ways to speed up as well. So the thing in question between me and mark is whether it is better to use the original html string or the html produced by the DOM. Is that understandable? Arno _______________________________________________ XMLC mailing list XMLC@xxxxxxxxxxx http://www.enhydra.org/mailman/listinfo.cgi/xmlc

Next Message by Date: click to view message preview

Re: Thought on future of XMLC

you would need to modify the DOM implementation such that it will be aware if there was any change which the application does to the tree. This is tracked by the LazyDOM already. Also you would need more information from the parsing process: When the DOM is created you would need to store the beginning offset and ending offset (from the original HTML string) within the DOM node. There is no reason to do this. The information is already in the subtree of the node. However, using one long string and keep index into it may not be a bad idea. This would only increase the memory foot print by the size of file * 2 + 4 bytes * number of nodes. Hmm... looks more feasible now. This would be some optimization to performance of LazyDOM. Can we start establishing some benchmark number for DOM so we know how much we gain from using different DOM implementation. Mark, what do you think of this? David Li --- "It spells Mac OS X but pronounces NeXTSTEP"

Previous Message by Thread: click to view message preview

Re: Thought on future of XMLC

David, you would need to modify the DOM implementation such that it will be aware if there was any change which the application does to the tree. Also you would need more information from the parsing process: When the DOM is created you would need to store the beginning offset and ending offset (from the original HTML string) within the DOM node. The original HTML string must be stored in memory of course (little overhead). In the output process the for each DOM node we would need to look if it was changed in some way be the application. If yes, produce the html from the node output as it is done now. If no, take the substriing from the original html from the beginning offset to the ending offset and return that as result. If you look at the changes we really make, (even if we consider URL mapping) mostly we are changing some leaves. (copying template rows is not really a changing operation on the node, as you still can use the original html for out putting, because the original html is of course immutable) So we would have 1) the size of the html as memory overhead. 2) need to change the parsing process to keep track of the offsets 3) need to have a DOM implementation which has a modified flag and a beginning and ending offset (probably integer) And we get 1) quite some speed in spitting out html (over the current process) 2) large parts of the output html will be exactly what the input (the original html) was. -Arno David Li wrote: Arno, The problem with the approach you are proposing here is that it's impossible to predict which part of the HTML pages will be modified and which part won't. It may be possible for a small projects that only uses simple get/set methods. A lot of XMLC programming is done with DOM API which can potentially traverse the entire page. An alternative is possible with LazyDOM. DOM is a tree structure. At each node, we can keep a serialized string of the node as how it and its subtree would look after being serialized. As LazyDOM keep track of which node is modified, we can assume that its copy of serialized string is invalid and traverse the subtree to generated the new serialized string. However, this would cause a large increase in the memory usage approximately O(filesize * height of the DOM tree * 2). For a 50 K page with 10 level depth, it comes out to be 2M of memory (ascii goes unicode in Java). Some smart pruning of tree is necessary to reduce the memory foot print to make it become feasible solution. David Li --- "It spells Mac OS X but pronounces NeXTSTEP" On Friday, Nov 29, 2002, at 05:10 Asia/Shanghai, Arno Schatz wrote: Hi Jake, sorry to not explain properly, I guess some other did understand me only because I was mentioning this somewhere else before. When the DOM tree is created, there are a lot of nodes which will not be changed by the programm. (Mostly a application program only changes nodes which have an id attribute) So there are whole subtrees of the created DOM tree, which will never be changed by the application. This subtree is created from an html-string (a substring of the original html-page). So if you want to output such an unchanged subtree, you could output the original string from the html file. For generating the output from the DOM, xmlc runs through the hole tree, even through these unchanged nodes and generates the html. If it had a refernce to the original html, it could output the part of the orioginal html it was created from. The current time consumption for outputting html is quite high as you might know. But there are other ways to speed up as well. So the thing in question between me and mark is whether it is better to use the original html string or the html produced by the DOM. Is that understandable? Arno _______________________________________________ XMLC mailing list XMLC@xxxxxxxxxxx http://www.enhydra.org/mailman/listinfo.cgi/xmlc

Next Message by Thread: click to view message preview

Re: Thought on future of XMLC

you would need to modify the DOM implementation such that it will be aware if there was any change which the application does to the tree. This is tracked by the LazyDOM already. Also you would need more information from the parsing process: When the DOM is created you would need to store the beginning offset and ending offset (from the original HTML string) within the DOM node. There is no reason to do this. The information is already in the subtree of the node. However, using one long string and keep index into it may not be a bad idea. This would only increase the memory foot print by the size of file * 2 + 4 bytes * number of nodes. Hmm... looks more feasible now. This would be some optimization to performance of LazyDOM. Can we start establishing some benchmark number for DOM so we know how much we gain from using different DOM implementation. Mark, what do you think of this? David Li --- "It spells Mac OS X but pronounces NeXTSTEP"
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by