logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

r42840 - in lxml/trunk: doc src/lxml: msg#00024

Subject: r42840 - in lxml/trunk: doc src/lxml
Author: scoder
Date: Mon May  7 23:35:00 2007
New Revision: 42840

Modified:
   lxml/trunk/doc/parsing.txt
   lxml/trunk/doc/validation.txt
   lxml/trunk/src/lxml/parser.pxi
Log:
clarifications in parser docs

Modified: lxml/trunk/doc/parsing.txt
==============================================================================
--- lxml/trunk/doc/parsing.txt  (original)
+++ lxml/trunk/doc/parsing.txt  Mon May  7 23:35:00 2007
@@ -18,26 +18,56 @@
 
 
 Parsers
--------
+=======
 
 Parsers are represented by parser objects.  There is support for parsing both
-XML and (broken) HTML (note that XHTML is best parsed as XML).  Both are based
-on libxml2 and therefore only support options that are backed by the library.
-Parsers take a number of keyword arguments.  The following is an example for
-namespace cleanup during parsing, first with the default parser, then with a
-parametrized one::
+XML and (broken) HTML.  Note that XHTML is best parsed as XML, parsing it with
+the HTML parser can lead to unexpected results.  Here is a simple example for
+XML parsing::
 
   >>> xml = '<a xmlns="test"><b xmlns="test"/></a>'
 
-  >>> et     = etree.parse(StringIO(xml))
+  >>> et = etree.parse(StringIO(xml))
   >>> print etree.tostring(et.getroot())
   <a xmlns="test"><b xmlns="test"/></a>
 
+
+Parser options
+--------------
+
+The parsers accept a number of setup options as keyword arguments.  The above
+example is easily extended to clean up namespaces during parsing::
+
   >>> parser = etree.XMLParser(ns_clean=True)
   >>> et     = etree.parse(StringIO(xml), parser)
   >>> print etree.tostring(et.getroot())
   <a xmlns="test"><b/></a>
 
+The keyword arguments in the constructor are mainly based on the libxml2
+parser configuration.  A DTD will also be loaded if validation or attribute
+default values are requested.
+
+Available boolean keyword arguments:
+
+* attribute_defaults - read the DTD (if referenced by the document) and add
+  the default attributes from it
+
+* dtd_validation - validate while parsing (if a DTD was referenced)
+
+* load_dtd - load and parse the DTD while parsing (no validation is performed)
+
+* no_network - prevent network access when looking up external documents
+
+* ns_clean - try to clean up redundant namespace declarations
+
+* recover - try hard to parse through broken XML
+
+* remove_blank_text - discard blank text nodes between tags
+
+
+Parsing HTML
+------------
+
 HTML parsing is similarly simple.  The parsers have a ``recover`` keyword
 argument that the HTMLParser sets by default.  It lets libxml2 try its best to
 return something usable without raising an exception.  You should use libxml2
@@ -48,15 +78,29 @@
   >>> parser = etree.HTMLParser()
   >>> et     = etree.parse(StringIO(broken_html), parser)
 
-  >>> print etree.tostring(et.getroot())
-  <html><head><title>test</title></head><body><h1>page title</h1></body></html>
+  >>> print etree.tostring(et.getroot(), pretty_print=True)
+  <html>
+    <head>
+      <title>test</title>
+    </head>
+    <body>
+      <h1>page title</h1>
+    </body>
+  </html>
 
 Lxml has an HTML function, similar to the XML shortcut known from
 ElementTree::
 
   >>> html = etree.HTML(broken_html)
-  >>> print etree.tostring(html)
-  <html><head><title>test</title></head><body><h1>page title</h1></body></html>
+  >>> print etree.tostring(html, pretty_print=True)
+  <html>
+    <head>
+      <title>test</title>
+    </head>
+    <body>
+      <h1>page title</h1>
+    </body>
+  </html>
 
 The support for parsing broken HTML depends entirely on libxml2's recovery
 algorithm.  It is *not* the fault of lxml if you find documents that are so
@@ -66,6 +110,10 @@
 parsing.  Especially misplaced meta tags can suffer from this, which may lead
 to encoding problems.
 
+
+Doctype information
+-------------------
+
 The use of the libxml2 parsers makes some additional information available at
 the API level.  Currently, ElementTree objects can access the DOCTYPE
 information provided by a parsed document, as well as the XML version and the
@@ -93,7 +141,7 @@
 
 
 iterparse and iterwalk
-----------------------
+======================
 
 As known from ElementTree, the ``iterparse()`` utility function returns an
 iterator that generates parser events for an XML file (or file-like object),
@@ -125,7 +173,7 @@
   >>> context.root.tag
   'root'
 
-The other types can be activated with the ``events`` keyword argument::
+The other event types can be activated with the ``events`` keyword argument::
 
   >>> events = ("start", "end")
   >>> context = etree.iterparse(StringIO(xml), events=events)
@@ -140,6 +188,32 @@
   end {testns}empty-element
   end root
 
+
+Selective tag events
+--------------------
+
+As an extension over ElementTree, lxml.etree accepts a ``tag`` keyword
+argument just like ``element.getiterator(tag)``.  This restricts events to a
+specific tag or namespace::
+
+  >>> context = etree.iterparse(StringIO(xml), tag="element")
+  >>> for action, elem in context:
+  ...     print action, elem.tag
+  end element
+  end element
+
+  >>> events = ("start", "end")
+  >>> context = etree.iterparse(
+  ...             StringIO(xml), events=events, tag="{testns}*")
+  >>> for action, elem in context:
+  ...     print action, elem.tag
+  start {testns}empty-element
+  end {testns}empty-element
+
+
+Modifying the tree
+------------------
+
 You can modify the element and its descendants when handling the 'end' event.
 To save memory, for example, you can remove subtrees that are no longer
 needed::
@@ -170,11 +244,12 @@
   ...     if element.getprevious():      # clean up preceding siblings
   ...         del element.getparent()[0]
 
-You can use ``while`` instead of ``if`` if you skipped siblings using the
-``tag`` keyword argument.  The more selective your tag is, however, the more
-thought you will have to put into finding the right way to clean up the
-elements that were skipped.  Therefore, it is sometimes easier to traverse all
-elements and do the tag selection by hand in the event handler code.
+You can use ``while`` instead of the ``if`` to delete multiple siblings in a
+row if you skipped over them using the ``tag`` keyword argument.  The more
+selective your tag is, however, the more thought you will have to put into
+finding the right way to clean up the elements that were skipped.  Therefore,
+it is sometimes easier to traverse all elements and do the tag selection by
+hand in the event handler code.
 
 The 'start-ns' and 'end-ns' events notify about namespace declarations and
 generate tuples ``(prefix, URI)``::
@@ -189,28 +264,16 @@
 It is common practice to use a list as namespace stack and pop the last entry
 on the 'end-ns' event.
 
-lxml.etree supports two extensions compared to ElementTree.  It accepts a
-``tag`` keyword argument just like ``element.getiterator(tag)``.  This
-restricts events to a specific tag or namespace.
 
-  >>> context = etree.iterparse(StringIO(xml), tag="element")
-  >>> for action, elem in context:
-  ...     print action, elem.tag
-  end element
-  end element
+iterwalk
+--------
 
-  >>> events = ("start", "end")
-  >>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
-  >>> for action, elem in context:
-  ...     print action, elem.tag
-  start {testns}empty-element
-  end {testns}empty-element
-
-The second extension is the ``iterwalk()`` function.  It behaves exactly like
-``iterparse()``, but works on Elements and ElementTrees::
+A second extension over ElementTree is the ``iterwalk()`` function.  It
+behaves exactly like ``iterparse()``, but works on Elements and ElementTrees::
 
-  >>> root = context.root
-  >>> context = etree.iterwalk(root, events=events, tag="element")
+  >>> root = etree.XML(xml)
+  >>> context = etree.iterwalk(
+  ...             root, events=("start", "end"), tag="element")
   >>> for action, elem in context:
   ...     print action, elem.tag
   start element
@@ -220,7 +283,7 @@
 
 
 Python unicode strings
-----------------------
+======================
 
 lxml.etree has broader support for Python unicode strings than the ElementTree
 library.  First of all, where ElementTree would raise an exception, the
@@ -246,6 +309,10 @@
 should generally avoid converting XML/HTML data to unicode before passing it
 into the parsers.  It is both slower and error prone.
 
+
+Serialising to Unicode strings
+------------------------------
+
 To serialize the result, you would normally use the ``tostring`` module
 function, which serializes to plain ASCII by default or a number of other
 encodings if asked for::

Modified: lxml/trunk/doc/validation.txt
==============================================================================
--- lxml/trunk/doc/validation.txt       (original)
+++ lxml/trunk/doc/validation.txt       Mon May  7 23:35:00 2007
@@ -4,7 +4,7 @@
 
 Apart from the built-in DTD support in parsers, lxml currently supports three
 schema languages: DTD_, `Relax NG`_ and `XML Schema`_.  All three provide
-identical APIs in lxml, represented by a validator class with the obvious
+identical APIs in lxml, represented by validator classes with the obvious
 names.
 
 .. _DTD:          http://en.wikipedia.org/wiki/Document_Type_Definition

Modified: lxml/trunk/src/lxml/parser.pxi
==============================================================================
--- lxml/trunk/src/lxml/parser.pxi      (original)
+++ lxml/trunk/src/lxml/parser.pxi      Mon May  7 23:35:00 2007
@@ -664,14 +664,9 @@
     * recover            - try hard to parse through broken XML
     * remove_blank_text  - discard blank text nodes
 
-    For read-only documents that will not be altered after parsing, you can
-    also pass the following keyword arguments:
-    * compact            - compactly store short element text content
-
-    Note that you should avoid sharing parsers between threads.  This does not
+    Note that you should avoid sharing parsers between threads.  While this is
+    not harmful, it is more efficient to use separate parsers.  This does not
     apply to the default parser.
-
-    You must not modify documents that were parsed with the 'compact' option.
     """
     def __init__(self, attribute_defaults=False, dtd_validation=False,
                  load_dtd=False, no_network=False, ns_clean=False,
@@ -794,12 +789,8 @@
     * no_network         - prevent network access
     * remove_blank_text  - discard empty text nodes
 
-    For read-only documents that will not be altered after parsing, you can
-    also pass the following keyword arguments:
-    * compact            - compactly store short element text content
-
-    Note that you should avoid sharing parsers between threads.  You must not
-    modify documents that were parsed with the 'compact' option.
+    Note that you should avoid sharing parsers between threads for parformance
+    reasons.
     """
     def __init__(self, recover=True, no_network=False, remove_blank_text=False,
                  compact=True):


<Prev in Thread] Current Thread [Next in Thread>