Hi David,
David Li wrote:
Arno,
Jivan looks great.
thanks, good to see xmlc developers look at it.
It's taking on a slight different approach then XMLC in the problem.
As I have posted in the mailing list, I prefer more pure DOM approach
with XPath support to replace the Java interface with setText.
However, one of the usage pattern of XMLC is as following:
interface NewsItem {
public void setTextTitle;
public void setTextContent;
}
I used this interface quite some times too. You could easily write a dynamic proxy, which
translates a method call 'setTextTitle(..)' into 'setText("Title",...)'. Most of the work
could be done in a generic proxy, which I wanted to provide in Jivan, but haven't gotten
around to write.
The other Way I actually use in my applications nowdays is to use the small code
generator (org.jivan.html.testDOMTest.printIdTree(...)), which produces a java-constant
for each ID in a document. I include the generated code in my program, so i have only to
write:
setText(Constants.TITLE, ..)
I have a requirement in (most) applications, that any IDs are optional. If it is ommited
in the HTML, then the program should silently skip this part.
These two techniques have the same effect, I guess, as the 'traditional' xmlc
approach.
As for performance, the major bottleneck of XMLC is actually the
serialization process. On a good size HTML page (120k), measuring from
the parsing to output without DOM manipulation using Xerces DOM. The
serialization actually takes up about 65%~70% of the time. Several of
the functions have being built in the XMLC serialization to deal with
HTML spec and some specific requirement of particular output format.
For the HTML spec part, as HTML supports Unicode entity (like Ӑ)
notation to make the document output language neutral. So, if I have a
document mixing with Japanese and Chinese, I am free to output in UTF8,
Shift JIS (Japanese), Big5 (Chinese) or any other encoding and the
characters should still show up on the browser correctly. Prior to JDK
1.4's NIO, there is no built-in exception handling to recover encoding
error. Java basically print '?' for the character it couldn't convert to
the output encoding. XMLC's serialization library has to built it's own
table for legal encoding checking and actually does character by
character check while outputing. For any character that doesn't have a
mapping, XMLC outputs the Unicode entity instead.
There are also some modification in the serialization to deal with the
emoticon for cHTML (Compact HTML for Japanese DoCoMo phones). cHTML's
output is Shift-JIS only and DoCoMo mandate to use the Unicode notation
to denote icons like smiley and other icons. These characters falls into
the unused Unicode region but have to support it in order to full
support cHTML with XMLC.
Well... These are the type of things needed to be supported and we
have to make trade off between supporting features and speed. :)
For serialization Jivan uses 'org.apache.xml.serialize.HTMLSerializer'. It looks at every
charachter, if it is printable in the output-encoding. If it is not printable, it escapes
the character (like {).
What Jivan adds is that for untouched nodes it will output the corresponding string from
the input. This means that you need to have the same encoding for the input (while
parsing) and the output (during serialisation). Jivan ensures that they are the same. I
can't imagine this is a limitation.
I found that mid-sized HTML page (~66K) have about 350 Nodes, 20 of them are being
touched from my application. So the penalty of outputting one by one character is reduced
by way over a magnitude!
So Jivan should be able to output UTF8, Japanee and Chinese in one document correctly and
still remain speedy.
Anyway, great to see Jivan and love to learn more about it. I am also
working on a NIO based XMLC serialization.
Sounds interesting, have you looked into other HTML serialisers, like the one from Xalan?
Will this be only usable in JDK 1.4?
-Arno
David Li
On Tuesday, Sep 16, 2003, at 20:08 Asia/Tokyo, Arno Schatz wrote:
Hi,
I have been working on the Jivan Project (www.jivan.org) for a while,
It is an open source project. Similar to XMLC it can parse HTML, lets
you manipulate the HTML DOM and can seialize the result. It is
optimized for use in web applications.
There were 3 reasons to write Jivan and not use xmlc for HTML pages:
- performance: Jivan is more than 3 times faster in table replication
(see www.jivan.org for performance result)
- ease of use: Jivan doesn't compile anything and does not need to be
included in the build process (hence no ant taskdef neccessary)
- dealing with invalid HTML: Jivan leaves the HTML page as it is,
except those places you dynamically change. Jivan parses almost any
HTML without fixing (unlike xmlc). While serialisation, for all nodes
(and subtrees) which are not touched by the programmer, Jivan will
copy the coresponding HTML string from the supplied template directly
to the output.
Jivan uses the latest form Apache.org: Xerces-2 and nekoHTML.
Similar to xmlc, you manipulate the web-page through the standard
interfaces from W3C DOM for HTML.
check it out at www.jivan.org and give me some feedback,
Arno
_______________________________________________
XMLC mailing list
XMLC@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/xmlc
_______________________________________________
Enhydra mailing list
Enhydra@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/enhydra