[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Parsing Nested List

On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote:

> I am trying to parse a Python nested list that is the result of the
> getOutlines() function of module PyPFD2 using pyparsing module.

pyparsing parses strings, not lists.

I fear that you have completely misunderstood what pyparsing does: it 
isn't a general-purpose parser of arbitrary Python objects like lists. 
Like most parsers (actually, all parsers that I know of...) it takes text 
as input and produces some sort of machine representation:


So your code is not working because you are calling parseString() with a 
list argument:


The name of the function, parseString(), should have been a hint that it 
requires a *string* as argument.

You have generated an outline:

    List = pdfReader.getOutlines()

but do you know what the format of that list is? I'm going to assume that 
it looks something like this:

['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...]

since that matches the template you gave to pyparsing. Notice that:

- words are separated by spaces;

- the first word is any arbitrary word, made up of just letters;

- followed by EXACTLY two digits;

- followed by the word "of";

- followed by EXACTLY two digits.

Furthermore, I'm assuming it is a simple, non-nested list. If that is not 
the case, you will need to explain precisely what the format of the 
outline actually is.

To parse this list is simple and pyparsing is not required:

for item in List:
    words = item.split()
    if len(words) != 4:
        raise ValueError('bad input data: %r' % item)
    first, number, x, total = words
    number = int(number)
    assert x == 'of'
    total = int(total)
    print(first, number, total)

Hope this helps.

(Please keep any replies on the list.)