logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

r28707 - lxml/trunk: msg#00158

Subject: r28707 - lxml/trunk
Author: scoder
Date: Mon Jun 12 15:05:55 2006
New Revision: 28707

Modified:
   lxml/trunk/TODO.txt
Log:
copied notes on iterparse() implementation to TODO.txt

Modified: lxml/trunk/TODO.txt
==============================================================================
--- lxml/trunk/TODO.txt (original)
+++ lxml/trunk/TODO.txt Mon Jun 12 15:05:55 2006
@@ -17,8 +17,6 @@
 
 * will namespaces nodes of unknown namespaces be added (and never freed?)
 
-* iterparse support would be nice.
-
 Top level
 ---------
 
@@ -42,3 +40,42 @@
   integrating this:
 
   http://www.gnosis.cx/download/relax/
+
+Notes on implementing iterparse
+-------------------------------
+
+"iterparse" will be (or will return) an iterable object, let's call it
+IterParse for clarity. A class is basically the only way of implementing
+iterators in Pyrex. For the internal SAX part, IterParse will likely work a
+lot like lxml.sax.ElementTreeContentHandler.
+
+We'd need a custom wrapper to the default libxml2 SAX handler to intercept the
+parse events (this means implementing C helper functions for the SAX events)
+/after/ they were processed by libxml2. See xmlSAXVersion (SAX2.c) on how to
+retrieve the SAX2 default parser structure.
+
+IterParse should pass chunks into the parser and buffer the events it
+receives. When its __next__() method is called, it returns one event or passes
+new chunks until there is an event to return. This is needed as IterParse has
+to convert between libxml2 push (SAX) and Python pull (iter).
+
+As for the input to the libxml2 parser, there are two possible ways: one is to
+pass data chunks in through xmlParseChunk and the other is to use
+xmlCreateIOParserCtxt and implement xmlInputReadCallback (xmlio.h) to have
+libxml2 request data by itself. However, xmlParseChunk allows us to control
+how far libxml2 parses in advance, so this is preferable.
+
+Python events (start, end, start-ns, end-ns) are created as follows:
+
+* "*-ns" events must be extracted from the libxml2 xmlSAX2StartElementNs call
+(passed in arguments "prefix"/"URI" and the char* array "namespaces"). They
+must be stored on a stack to build the respective "end-ns" events.
+
+* "start" is somewhat tricky, as it would be a bad idea to allow modifications
+of the XML structure during that iterator cycle. Maybe it's enough to document
+that, but there may be ways to crash lxml with certain tree operations. Note
+also that care has to be taken to prevent Python from garbage collecting the
+element before the "end" event. The best way to do that is to store a Python
+reference to that element on a stack.
+
+* "end" is simple then: pop the element from the stack and return it.


<Prev in Thread] Current Thread [Next in Thread>