logo       

Re: Re: Xmlc: [Ann] Jivan 1.0 RC 1: msg#00028

java.enhydra.general

Subject: Re: Re: Xmlc: [Ann] Jivan 1.0 RC 1

Hi David,

David Li wrote:

Arno,

Jivan looks great.

thanks, good to see xmlc developers look at it.

It's taking on a slight different approach then XMLC in the problem. As I have posted in the mailing list, I prefer more pure DOM approach with XPath support to replace the Java interface with setText.

However, one of the usage pattern of XMLC is as following:

interface NewsItem {
public void setTextTitle;
public void setTextContent;
}

I used this interface quite some times too. You could easily write a dynamic proxy, which translates a method call 'setTextTitle(..)' into 'setText("Title",...)'. Most of the work could be done in a generic proxy, which I wanted to provide in Jivan, but haven't gotten around to write.
The other Way I actually use in my applications nowdays is to use the small code generator (org.jivan.html.testDOMTest.printIdTree(...)), which produces a java-constant for each ID in a document. I include the generated code in my program, so i have only to write:

setText(Constants.TITLE, ..)

I have a requirement in (most) applications, that any IDs are optional. If it is ommited in the HTML, then the program should silently skip this part.

These two techniques have the same effect, I guess, as the 'traditional' xmlc
approach.

As for performance, the major bottleneck of XMLC is actually the serialization process. On a good size HTML page (120k), measuring from the parsing to output without DOM manipulation using Xerces DOM. The serialization actually takes up about 65%~70% of the time. Several of the functions have being built in the XMLC serialization to deal with HTML spec and some specific requirement of particular output format.

For the HTML spec part, as HTML supports Unicode entity (like Ӑ) notation to make the document output language neutral. So, if I have a document mixing with Japanese and Chinese, I am free to output in UTF8, Shift JIS (Japanese), Big5 (Chinese) or any other encoding and the characters should still show up on the browser correctly. Prior to JDK 1.4's NIO, there is no built-in exception handling to recover encoding error. Java basically print '?' for the character it couldn't convert to the output encoding. XMLC's serialization library has to built it's own table for legal encoding checking and actually does character by character check while outputing. For any character that doesn't have a mapping, XMLC outputs the Unicode entity instead.

There are also some modification in the serialization to deal with the emoticon for cHTML (Compact HTML for Japanese DoCoMo phones). cHTML's output is Shift-JIS only and DoCoMo mandate to use the Unicode notation to denote icons like smiley and other icons. These characters falls into the unused Unicode region but have to support it in order to full support cHTML with XMLC.

Well... These are the type of things needed to be supported and we have to make trade off between supporting features and speed. :)

For serialization Jivan uses 'org.apache.xml.serialize.HTMLSerializer'. It looks at every charachter, if it is printable in the output-encoding. If it is not printable, it escapes the character (like {).

What Jivan adds is that for untouched nodes it will output the corresponding string from the input. This means that you need to have the same encoding for the input (while parsing) and the output (during serialisation). Jivan ensures that they are the same. I can't imagine this is a limitation.

I found that mid-sized HTML page (~66K) have about 350 Nodes, 20 of them are being touched from my application. So the penalty of outputting one by one character is reduced by way over a magnitude!

So Jivan should be able to output UTF8, Japanee and Chinese in one document correctly and still remain speedy.


Anyway, great to see Jivan and love to learn more about it. I am also working on a NIO based XMLC serialization.

Sounds interesting, have you looked into other HTML serialisers, like the one from Xalan? Will this be only usable in JDK 1.4?

-Arno


David Li

On Tuesday, Sep 16, 2003, at 20:08 Asia/Tokyo, Arno Schatz wrote:

Hi,

I have been working on the Jivan Project (www.jivan.org) for a while, It is an open source project. Similar to XMLC it can parse HTML, lets you manipulate the HTML DOM and can seialize the result. It is optimized for use in web applications.

There were 3 reasons to write Jivan and not use xmlc for HTML pages:
- performance: Jivan is more than 3 times faster in table replication (see www.jivan.org for performance result)
- ease of use: Jivan doesn't compile anything and does not need to be included in the build process (hence no ant taskdef neccessary)
- dealing with invalid HTML: Jivan leaves the HTML page as it is, except those places you dynamically change. Jivan parses almost any HTML without fixing (unlike xmlc). While serialisation, for all nodes (and subtrees) which are not touched by the programmer, Jivan will copy the coresponding HTML string from the supplied template directly to the output.


Jivan uses the latest form Apache.org: Xerces-2 and nekoHTML.

Similar to xmlc, you manipulate the web-page through the standard interfaces from W3C DOM for HTML.

check it out at www.jivan.org and give me some feedback,
Arno


_______________________________________________
XMLC mailing list
XMLC@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/xmlc


_______________________________________________
Enhydra mailing list
Enhydra@xxxxxxxxxxx
http://www.enhydra.org/mailman/listinfo.cgi/enhydra



<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise