Author: scoder
Date: Mon May 7 23:35:00 2007
New Revision: 42840
Modified:
lxml/trunk/doc/parsing.txt
lxml/trunk/doc/validation.txt
lxml/trunk/src/lxml/parser.pxi
Log:
clarifications in parser docs
Modified: lxml/trunk/doc/parsing.txt
==============================================================================
--- lxml/trunk/doc/parsing.txt (original)
+++ lxml/trunk/doc/parsing.txt Mon May 7 23:35:00 2007
@@ -18,26 +18,56 @@
Parsers
--------
+=======
Parsers are represented by parser objects. There is support for parsing both
-XML and (broken) HTML (note that XHTML is best parsed as XML). Both are based
-on libxml2 and therefore only support options that are backed by the library.
-Parsers take a number of keyword arguments. The following is an example for
-namespace cleanup during parsing, first with the default parser, then with a
-parametrized one::
+XML and (broken) HTML. Note that XHTML is best parsed as XML, parsing it with
+the HTML parser can lead to unexpected results. Here is a simple example for
+XML parsing::
>>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
- >>> et = etree.parse(StringIO(xml))
+ >>> et = etree.parse(StringIO(xml))
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b xmlns="test"/></a>
+
+Parser options
+--------------
+
+The parsers accept a number of setup options as keyword arguments. The above
+example is easily extended to clean up namespaces during parsing::
+
>>> parser = etree.XMLParser(ns_clean=True)
>>> et = etree.parse(StringIO(xml), parser)
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b/></a>
+The keyword arguments in the constructor are mainly based on the libxml2
+parser configuration. A DTD will also be loaded if validation or attribute
+default values are requested.
+
+Available boolean keyword arguments:
+
+* attribute_defaults - read the DTD (if referenced by the document) and add
+ the default attributes from it
+
+* dtd_validation - validate while parsing (if a DTD was referenced)
+
+* load_dtd - load and parse the DTD while parsing (no validation is performed)
+
+* no_network - prevent network access when looking up external documents
+
+* ns_clean - try to clean up redundant namespace declarations
+
+* recover - try hard to parse through broken XML
+
+* remove_blank_text - discard blank text nodes between tags
+
+
+Parsing HTML
+------------
+
HTML parsing is similarly simple. The parsers have a ``recover`` keyword
argument that the HTMLParser sets by default. It lets libxml2 try its best to
return something usable without raising an exception. You should use libxml2
@@ -48,15 +78,29 @@
>>> parser = etree.HTMLParser()
>>> et = etree.parse(StringIO(broken_html), parser)
- >>> print etree.tostring(et.getroot())
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
+ >>> print etree.tostring(et.getroot(), pretty_print=True)
+ <html>
+ <head>
+ <title>test</title>
+ </head>
+ <body>
+ <h1>page title</h1>
+ </body>
+ </html>
Lxml has an HTML function, similar to the XML shortcut known from
ElementTree::
>>> html = etree.HTML(broken_html)
- >>> print etree.tostring(html)
- <html><head><title>test</title></head><body><h1>page title</h1></body></html>
+ >>> print etree.tostring(html, pretty_print=True)
+ <html>
+ <head>
+ <title>test</title>
+ </head>
+ <body>
+ <h1>page title</h1>
+ </body>
+ </html>
The support for parsing broken HTML depends entirely on libxml2's recovery
algorithm. It is *not* the fault of lxml if you find documents that are so
@@ -66,6 +110,10 @@
parsing. Especially misplaced meta tags can suffer from this, which may lead
to encoding problems.
+
+Doctype information
+-------------------
+
The use of the libxml2 parsers makes some additional information available at
the API level. Currently, ElementTree objects can access the DOCTYPE
information provided by a parsed document, as well as the XML version and the
@@ -93,7 +141,7 @@
iterparse and iterwalk
-----------------------
+======================
As known from ElementTree, the ``iterparse()`` utility function returns an
iterator that generates parser events for an XML file (or file-like object),
@@ -125,7 +173,7 @@
>>> context.root.tag
'root'
-The other types can be activated with the ``events`` keyword argument::
+The other event types can be activated with the ``events`` keyword argument::
>>> events = ("start", "end")
>>> context = etree.iterparse(StringIO(xml), events=events)
@@ -140,6 +188,32 @@
end {testns}empty-element
end root
+
+Selective tag events
+--------------------
+
+As an extension over ElementTree, lxml.etree accepts a ``tag`` keyword
+argument just like ``element.getiterator(tag)``. This restricts events to a
+specific tag or namespace::
+
+ >>> context = etree.iterparse(StringIO(xml), tag="element")
+ >>> for action, elem in context:
+ ... print action, elem.tag
+ end element
+ end element
+
+ >>> events = ("start", "end")
+ >>> context = etree.iterparse(
+ ... StringIO(xml), events=events, tag="{testns}*")
+ >>> for action, elem in context:
+ ... print action, elem.tag
+ start {testns}empty-element
+ end {testns}empty-element
+
+
+Modifying the tree
+------------------
+
You can modify the element and its descendants when handling the 'end' event.
To save memory, for example, you can remove subtrees that are no longer
needed::
@@ -170,11 +244,12 @@
... if element.getprevious(): # clean up preceding siblings
... del element.getparent()[0]
-You can use ``while`` instead of ``if`` if you skipped siblings using the
-``tag`` keyword argument. The more selective your tag is, however, the more
-thought you will have to put into finding the right way to clean up the
-elements that were skipped. Therefore, it is sometimes easier to traverse all
-elements and do the tag selection by hand in the event handler code.
+You can use ``while`` instead of the ``if`` to delete multiple siblings in a
+row if you skipped over them using the ``tag`` keyword argument. The more
+selective your tag is, however, the more thought you will have to put into
+finding the right way to clean up the elements that were skipped. Therefore,
+it is sometimes easier to traverse all elements and do the tag selection by
+hand in the event handler code.
The 'start-ns' and 'end-ns' events notify about namespace declarations and
generate tuples ``(prefix, URI)``::
@@ -189,28 +264,16 @@
It is common practice to use a list as namespace stack and pop the last entry
on the 'end-ns' event.
-lxml.etree supports two extensions compared to ElementTree. It accepts a
-``tag`` keyword argument just like ``element.getiterator(tag)``. This
-restricts events to a specific tag or namespace.
- >>> context = etree.iterparse(StringIO(xml), tag="element")
- >>> for action, elem in context:
- ... print action, elem.tag
- end element
- end element
+iterwalk
+--------
- >>> events = ("start", "end")
- >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
- >>> for action, elem in context:
- ... print action, elem.tag
- start {testns}empty-element
- end {testns}empty-element
-
-The second extension is the ``iterwalk()`` function. It behaves exactly like
-``iterparse()``, but works on Elements and ElementTrees::
+A second extension over ElementTree is the ``iterwalk()`` function. It
+behaves exactly like ``iterparse()``, but works on Elements and ElementTrees::
- >>> root = context.root
- >>> context = etree.iterwalk(root, events=events, tag="element")
+ >>> root = etree.XML(xml)
+ >>> context = etree.iterwalk(
+ ... root, events=("start", "end"), tag="element")
>>> for action, elem in context:
... print action, elem.tag
start element
@@ -220,7 +283,7 @@
Python unicode strings
-----------------------
+======================
lxml.etree has broader support for Python unicode strings than the ElementTree
library. First of all, where ElementTree would raise an exception, the
@@ -246,6 +309,10 @@
should generally avoid converting XML/HTML data to unicode before passing it
into the parsers. It is both slower and error prone.
+
+Serialising to Unicode strings
+------------------------------
+
To serialize the result, you would normally use the ``tostring`` module
function, which serializes to plain ASCII by default or a number of other
encodings if asked for::
Modified: lxml/trunk/doc/validation.txt
==============================================================================
--- lxml/trunk/doc/validation.txt (original)
+++ lxml/trunk/doc/validation.txt Mon May 7 23:35:00 2007
@@ -4,7 +4,7 @@
Apart from the built-in DTD support in parsers, lxml currently supports three
schema languages: DTD_, `Relax NG`_ and `XML Schema`_. All three provide
-identical APIs in lxml, represented by a validator class with the obvious
+identical APIs in lxml, represented by validator classes with the obvious
names.
.. _DTD: http://en.wikipedia.org/wiki/Document_Type_Definition
Modified: lxml/trunk/src/lxml/parser.pxi
==============================================================================
--- lxml/trunk/src/lxml/parser.pxi (original)
+++ lxml/trunk/src/lxml/parser.pxi Mon May 7 23:35:00 2007
@@ -664,14 +664,9 @@
* recover - try hard to parse through broken XML
* remove_blank_text - discard blank text nodes
- For read-only documents that will not be altered after parsing, you can
- also pass the following keyword arguments:
- * compact - compactly store short element text content
-
- Note that you should avoid sharing parsers between threads. This does not
+ Note that you should avoid sharing parsers between threads. While this is
+ not harmful, it is more efficient to use separate parsers. This does not
apply to the default parser.
-
- You must not modify documents that were parsed with the 'compact' option.
"""
def __init__(self, attribute_defaults=False, dtd_validation=False,
load_dtd=False, no_network=False, ns_clean=False,
@@ -794,12 +789,8 @@
* no_network - prevent network access
* remove_blank_text - discard empty text nodes
- For read-only documents that will not be altered after parsing, you can
- also pass the following keyword arguments:
- * compact - compactly store short element text content
-
- Note that you should avoid sharing parsers between threads. You must not
- modify documents that were parsed with the 'compact' option.
+ Note that you should avoid sharing parsers between threads for parformance
+ reasons.
"""
def __init__(self, recover=True, no_network=False, remove_blank_text=False,
compact=True):
|