logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: lhtml: msg#00064

Subject: Re: lhtml

Ian Bicking wrote:
> Stefan Behnel wrote:
>> Ian Bicking wrote:
>>> Stefan Behnel wrote:
>>> It relies on a different parser from lxml.etree.HTML, and I would guess
>>> that elements created with etree.Element wouldn't necessarily use the
>>> right class.
>>
>> objectify replicates the XML() and Element() factories for exactly this
>> purpose. lxml.html could do likewise.
> 
> Sure.  Presumably at least a parser would be in there (HTML()).  I
> suppose no reason Element can't be too.
> 
> How does this interact with XSLT translations?  When you translate a
> document, it keeps the parser and hence the custom classes?

Exactly. It Does What You'd Expect(TM).


> It would make me more comfortable if at least it was a separate module.
>  So there'd be an lxml.xmldoctest module, and an lxml.usexmldoctest

And since lxml.usexmldoctest and lxml.usehtmldoctest would be the ones you'd
import, the xmldoctest would just be the implementation detail in the 
background.


> There's also some ambiguity between HTML and XML.  When do you parse
> something as HTML, and when only as XML?  It depends on the doctest. You
> can kind of tell by looking for <html>, but I actually spend more time
> looking at HTML snippets than documents when doing testing.

Doing that on trees is possible, but when comparing serialised HT/XML, you've
already lost the information how it was parsed. So, no reason why we shouldn't
have two modules that can be imported.


> With enough work it would probably be possible to use that import to
> selectively activate the checker only during the doctest it was imported
> into.  That would be ideal to me.  Then you could use that to indicate
> if you prefer HTML or XML parsing your checking.  I generally like
> doctests to be standalone, so being able to enable your preferred
> checker directly in the doctest would certainly be nice.

That would be great.

Another point: how do we deal with doctests that mix XML and HTML? That's
likely more rare, so we could still provide some kind of "use()" function in
the two modules that allows to switch between the two if you really need to,
but that would not have to be called if you just import one of them.


>>> It should include parsing HTML fragments too, which
>>> is a little hard (HTML() interprets all text as complete documents, and
>>> adds in elements to make the document valid, which often isn't what
>>> you'd want).
>>
>> Maybe a simple approach here would be to check if a string starts with
>> a known
>> inner HTML tag, then just prefix it with <html><body> before parsing and
>> return their child (or children) after parsing.
> 
> I'm comfortable (probably more comfortable) with different parsing
> functions.  I imagine parse, parse_fragment, and parse_element.  parse
> is like HTML(), parse_fragment returns a list of elements, parse_element
> only returns a single element (and an exception if you give it a
> document with multiple elements).  Leading text for parse_fragment is a
> little awkward.

Sure, sounds reasonable.


> In addition to returning the children, I'd like to break the reference
> to the artificial parent that was added in.  You can get at the parent
> with many kinds of queries, which can be confusing.

That's harder, though. Once they are in the document, it's hard to change the
root node from Python code. Maybe we can come up with a solution that allows
us to hide that in the parse functions.


>> Fredrik wrote a nice factory class for generating (X|HT)ML a while
>> ago, I felt
>> free to add it as "lxml.htmlbuilder" (although I'm still waiting for
>> his reply
>> to see if it can stay there to become part of lxml 1.3). But the other
>> API
>> side of parsing and treating HTML document in a convenient way is much
>> more
>> ambitious.
> 
> How are attributes handled in his version?  That's always the place
> where opinions vary on builders.

Keyword arguments. See

http://online.effbot.org/2006_11_01_archive.htm#et-builder

Stefan


<Prev in Thread] Current Thread [Next in Thread>