logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: parsing fragments of a larger file: msg#00224

Subject: Re: parsing fragments of a larger file
On Fri, 2003-08-29 at 02:52, Daniel Veillard wrote:
> On Thu, Aug 28, 2003 at 07:16:54PM -0700, Patrick wrote:
> > Hello.
> > I've been searching through the documentation and archives for some time
> > now and I'm finding it a little hard to get a cohesive picture of what
> > is possible with libxml. I am trying to accomplish the following:
> > 
> > - Pass through the entire XML document recording offsets (and possibly
> > line, column pairs) within each file of each element and where the
> > document is malformed retrieve the malformed portion as text. This seems
> > fairly easy to do with the library but I still have two questions with
> > regard to this:
> 
>   I think what you're asking is not realistic with respect to the
> XML specification:
>      http://www.w3.org/TR/REC-xml#dt-fatal
> 
> "[Definition: An error which a conforming XML processor must detect
>   and report to the application. After encountering a fatal error, the
>   processor may continue processing the data to search for further errors
>   and may report such errors to the application. In order to support
>   correction of errors, the processor may make unprocessed data from the
>   document (with intermingled character data and markup) available to
>   the application. Once a fatal error is detected, however, the processor
>   must not continue normal processing (i.e., it must not continue to pass
>   character data and information about the document's logical structure
>   to the application in the normal way).]
> 
> it's very clear. You cannot get an XML parser to "recover" from a 
> well formedness error. Either something is XML or not and the kind
> of processing you're asking for is clearly special cased from normative
> wording in the spec.

Ok, I have no problem with the spec enforcing the definition of
well-formedness. In fact I think its good. However, my application is
particular in its needs and one which the spec writers might appreciate
regardless of their wording. I'm designing an XML editor of sorts which
tries to cope with malformed documents so that they can be repaired and
made well-formed again.

> > (1) The documentation for xmlParserNodeInfo says the following: "The
> > parser can be asked to collect Node informations, i.e. at what place in
> > the file they were detected. NOTE: This is off by default and not very
> > well tested." Is this still true? Can I rely on this working?
> 
>   This is still true. There is no garantee. You can get line numbers
> from the parser context ctxt->input .
>                                                                               
>   
> > (2) The Parser portion of the library is the quickest, least memory
> > intensive way to parse the document right?
> 
>   A parser is a parser is a parser. You can process stuff faster
> by sending it to /dev/null , libxml follows the specs and operates as
> recommended.
>   You seems to be in the process of "quick recovery of badly formed
> data" and this is not what the XML spec was designed for nor what the 
> library is aiming at. You may have trouble, your project description
> immediately put you in a grey area where you may have a very hard time
> finding software to build upon because you're operating outside the
> boundaries of the XML specification.

I'm satisfied that libxml could do a big bulk of my XML processing
needs, so my efficiency question was only in regard to which of the
libxml modules to use - Parser, SAX, XmlReader, etc. - not what was the
fastest way to parse XML in general. The context of my original question
is that I hope to build some indices with information from the initial
scan of the document to permit random access later.

Do you know of any other XML libraries which are designed to fail softly
and provide enough meta information that they can be used intelligently
by an editor? If not, would you advise me to write my own library rather
than try to coax information out of libxml which it doesn't want to
give? I mean, is it worth trying to figure out a way to make it work in
libxml by extending the API or data structures somewhere or should I
make a custom library suited for my particular purpose?

Thanks,
Patrick


<Prev in Thread] Current Thread [Next in Thread>