logo       

Re: Xmlc: [Ann] Jivan 1.0 RC 1: msg#00026

java.enhydra.general

Subject: Re: Xmlc: [Ann] Jivan 1.0 RC 1

Arno,

Jivan looks great.

It's taking on a slight different approach then XMLC in the problem. As I have posted in the mailing list, I prefer more pure DOM approach with XPath support to replace the Java interface with setText.

However, one of the usage pattern of XMLC is as following:

interface NewsItem {
public void setTextTitle;
public void setTextContent;
}

One could use XMLC to generate Java object from HTML/WML/cHTML that implements this interface. Giving the Java program a presentation neutral way to deal with the output and the loading of different DOM is decided by the lower layer.

As for performance, the major bottleneck of XMLC is actually the serialization process. On a good size HTML page (120k), measuring from the parsing to output without DOM manipulation using Xerces DOM. The serialization actually takes up about 65%~70% of the time. Several of the functions have being built in the XMLC serialization to deal with HTML spec and some specific requirement of particular output format.

For the HTML spec part, as HTML supports Unicode entity (like Ӑ) notation to make the document output language neutral. So, if I have a document mixing with Japanese and Chinese, I am free to output in UTF8, Shift JIS (Japanese), Big5 (Chinese) or any other encoding and the characters should still show up on the browser correctly. Prior to JDK 1.4's NIO, there is no built-in exception handling to recover encoding error. Java basically print '?' for the character it couldn't convert to the output encoding. XMLC's serialization library has to built it's own table for legal encoding checking and actually does character by character check while outputing. For any character that doesn't have a mapping, XMLC outputs the Unicode entity instead.

There are also some modification in the serialization to deal with the emoticon for cHTML (Compact HTML for Japanese DoCoMo phones). cHTML's output is Shift-JIS only and DoCoMo mandate to use the Unicode notation to denote icons like smiley and other icons. These characters falls into the unused Unicode region but have to support it in order to full support cHTML with XMLC.

Well... These are the type of things needed to be supported and we have to make trade off between supporting features and speed. :)

Anyway, great to see Jivan and love to learn more about it. I am also working on a NIO based XMLC serialization.

David Li

On Tuesday, Sep 16, 2003, at 20:08 Asia/Tokyo, Arno Schatz wrote:

Hi,

I have been working on the Jivan Project (www.jivan.org) for a while, It is an open source project. Similar to XMLC it can parse HTML, lets you manipulate the HTML DOM and can seialize the result. It is optimized for use in web applications.

There were 3 reasons to write Jivan and not use xmlc for HTML pages:
- performance: Jivan is more than 3 times faster in table replication (see www.jivan.org for performance result)
- ease of use: Jivan doesn't compile anything and does not need to be included in the build process (hence no ant taskdef neccessary)
- dealing with invalid HTML: Jivan leaves the HTML page as it is, except those places you dynamically change. Jivan parses almost any HTML without fixing (unlike xmlc). While serialisation, for all nodes (and subtrees) which are not touched by the programmer, Jivan will copy the coresponding HTML string from the supplied template directly to the output.


Jivan uses the latest form Apache.org: Xerces-2 and nekoHTML.

Similar to xmlc, you manipulate the web-page through the standard interfaces from W3C DOM for HTML.

check it out at www.jivan.org and give me some feedback,
Arno


_______________________________________________
XMLC mailing list
XMLC@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/xmlc


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise