osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract sentences in nested parentheses using Python


A S wrote:

I think I've seen this question before ;)

> I am trying to extract all strings in nested parentheses (along with the
> parentheses itself) in my .txt file. Please see the sample .txt file that
> I have used in this example here:
> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> 
> I have tried and done up three different codes but none of them seems to
> be able to extract all the nested parentheses. They can only extract a
> portion of the nested parentheses. Any advice on what I've done wrong
> could really help!
> 
> Here are the three codes I have done so far:
> 
> 1st attempt:
> 
> import re
> from os.path import join
> 
> def balanced_braces(args):
>     parts = []
>     for arg in args:
>         if '(' not in arg:
>             continue

There could still be a ")" that you miss

>         chars = []
>         n = 0
>         for c in arg:
>             if c == '(':
>                 if n > 0:
>                     chars.append(c)
>                 n += 1
>             elif c == ')':
>                 n -= 1
>                 if n > 0:
>                     chars.append(c)
>                 elif n == 0:
>                     parts.append(''.join(chars).lstrip().rstrip())
>                     chars = []
>             elif n > 0:
>                 chars.append(c)
>     return parts

It's probably easier to understand and implement when you process the 
complete text at once. Then arbitrary splits don't get in the way of your 
quest for ( and ). You just have to remember the position of the first 
opening ( and number of opening parens that have to be closed before you 
take the complete expression:

level:  00011112222100
text:   abc(def(gh))ij
   when we are here^
    we need^

A tentative implementation:

$ cat parse.py
import re

NOT_SET = object()

def scan(text):
    level = 0
    start = NOT_SET
    for m in re.compile("[()]").finditer(text):
        if m.group() == ")":
            level -= 1
            if level < 0:
                raise ValueError("underflow: more closing than opening 
parens")
            if level == 0:
                # outermost closing parenthesis:
                # deliver enclosed string including parens.
                yield text[start:m.end()]
                start = NOT_SET
        elif m.group() == "(":
            if level == 0:
                # outermost opening parenthesis: remember position.
                assert start is NOT_SET
                start = m.start()
            level += 1
        else:
            assert False
    if level > 0:
        raise ValueError("unclosed parens remain")


if __name__ == "__main__":
    with open("lan sample text file.txt") as instream:
        text = instream.read()
    for chunk in scan(text):
        print(chunk)
$ python3 parse.py
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")