David,
you would need to modify the DOM implementation such that it will be aware if there was
any change which the application does to the tree. Also you would need more information
from the parsing process: When the DOM is created you would need to store the beginning
offset and ending offset (from the original HTML string) within the DOM node. The
original HTML string must be stored in memory of course (little overhead). In the output
process the for each DOM node we would need to look if it was changed in some way be the
application. If yes, produce the html from the node output as it is done now. If no, take
the substriing from the original html from the beginning offset to the ending offset and
return that as result.
If you look at the changes we really make, (even if we consider URL mapping) mostly we
are changing some leaves. (copying template rows is not really a changing operation on
the node, as you still can use the original html for out putting, because the original
html is of course immutable)
So we would have
1) the size of the html as memory overhead.
2) need to change the parsing process to keep track of the offsets
3) need to have a DOM implementation which has a modified flag and a beginning and ending
offset (probably integer)
And we get
1) quite some speed in spitting out html (over the current process)
2) large parts of the output html will be exactly what the input (the original
html) was.
-Arno
David Li wrote:
Arno,
The problem with the approach you are proposing here is that it's
impossible to predict which part of the HTML pages will be modified and
which part won't. It may be possible for a small projects that only uses
simple get/set methods. A lot of XMLC programming is done with DOM API
which can potentially traverse the entire page.
An alternative is possible with LazyDOM. DOM is a tree structure. At
each node, we can keep a serialized string of the node as how it and its
subtree would look after being serialized. As LazyDOM keep track of
which node is modified, we can assume that its copy of serialized string
is invalid and traverse the subtree to generated the new serialized
string. However, this would cause a large increase in the memory usage
approximately O(filesize * height of the DOM tree * 2). For a 50 K page
with 10 level depth, it comes out to be 2M of memory (ascii goes unicode
in Java). Some smart pruning of tree is necessary to reduce the memory
foot print to make it become feasible solution.
David Li
---
"It spells Mac OS X but pronounces NeXTSTEP"
On Friday, Nov 29, 2002, at 05:10 Asia/Shanghai, Arno Schatz wrote:
Hi Jake,
sorry to not explain properly, I guess some other did understand me
only because I was mentioning this somewhere else before.
When the DOM tree is created, there are a lot of nodes which will not
be changed by the programm. (Mostly a application program only changes
nodes which have an id attribute) So there are whole subtrees of the
created DOM tree, which will never be changed by the application. This
subtree is created from an html-string (a substring of the original
html-page). So if you want to output such an unchanged subtree, you
could output the original string from the html file. For generating
the output from the DOM, xmlc runs through the hole tree, even through
these unchanged nodes and generates the html. If it had a refernce to
the original html, it could output the part of the orioginal html it
was created from.
The current time consumption for outputting html is quite high as you
might know. But there are other ways to speed up as well. So the thing
in question between me and mark is whether it is better to use the
original html string or the html produced by the DOM.
Is that understandable?
Arno
_______________________________________________
XMLC mailing list
XMLC@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/xmlc
|