osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract sentences in nested parentheses using Python


A S wrote:

> On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten  wrote:
>> A S wrote:
>> 
>> I think I've seen this question before ;)
>> 
>> > I am trying to extract all strings in nested parentheses (along with
>> > the parentheses itself) in my .txt file. Please see the sample .txt
>> > file that I have used in this example here:
>> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
>> > 
>> > I have tried and done up three different codes but none of them seems
>> > to be able to extract all the nested parentheses. They can only extract
>> > a portion of the nested parentheses. Any advice on what I've done wrong
>> > could really help!
>> > 
>> > Here are the three codes I have done so far:
>> > 
>> > 1st attempt:
>> > 
>> > import re
>> > from os.path import join
>> > 
>> > def balanced_braces(args):
>> >     parts = []
>> >     for arg in args:
>> >         if '(' not in arg:
>> >             continue
>> 
>> There could still be a ")" that you miss
>> 
>> >         chars = []
>> >         n = 0
>> >         for c in arg:
>> >             if c == '(':
>> >                 if n > 0:
>> >                     chars.append(c)
>> >                 n += 1
>> >             elif c == ')':
>> >                 n -= 1
>> >                 if n > 0:
>> >                     chars.append(c)
>> >                 elif n == 0:
>> >                     parts.append(''.join(chars).lstrip().rstrip())
>> >                     chars = []
>> >             elif n > 0:
>> >                 chars.append(c)
>> >     return parts
>> 
>> It's probably easier to understand and implement when you process the
>> complete text at once. Then arbitrary splits don't get in the way of your
>> quest for ( and ). You just have to remember the position of the first
>> opening ( and number of opening parens that have to be closed before you
>> take the complete expression:
>> 
>> level:  00011112222100
>> text:   abc(def(gh))ij
>>    when we are here^
>>     we need^
>> 
>> A tentative implementation:
>> 
>> $ cat parse.py
>> import re
>> 
>> NOT_SET = object()
>> 
>> def scan(text):
>>     level = 0
>>     start = NOT_SET
>>     for m in re.compile("[()]").finditer(text):
>>         if m.group() == ")":
>>             level -= 1
>>             if level < 0:
>>                 raise ValueError("underflow: more closing than opening
>> parens")
>>             if level == 0:
>>                 # outermost closing parenthesis:
>>                 # deliver enclosed string including parens.
>>                 yield text[start:m.end()]
>>                 start = NOT_SET
>>         elif m.group() == "(":
>>             if level == 0:
>>                 # outermost opening parenthesis: remember position.
>>                 assert start is NOT_SET
>>                 start = m.start()
>>             level += 1
>>         else:
>>             assert False
>>     if level > 0:
>>         raise ValueError("unclosed parens remain")
>> 
>> 
>> if __name__ == "__main__":
>>     with open("lan sample text file.txt") as instream:
>>         text = instream.read()
>>     for chunk in scan(text):
>>         print(chunk)
>> $ python3 parse.py
>> ("xE'", PUT(xx.xxxx.),"'")
>> ("TRUuuuth")
> 
> Hello Peter! I tried this on my actual working files and it returned this
> error: "unclosed parens remain". In this case, how can I continue to parse
> through my text files by only extracting those with balanced parentheses
> and ignore those that are incomplete?

filenames = ...
for filename in filenames:
    with open(filename) as instream:
        text = instream.read()
        try:
            chunks = list(scan(text))
        except ValueError as err:
            print(f"{err} in file {filename!r}", file=sys.stderr)
        else:
           for chunk in chunks:
               print(chunk)