osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract sentences in nested parentheses using Python


On 3/12/19 6:00 AM, Peter Otten wrote:
> A S wrote:
> I think I've seen this question before ;)

In addition to 'other reasons' for @Peter's comment, it is a common 
ComSc worked-problem or assignment. (in which case, we'd appreciate 
being told that you/OP is asking for help with "homework")


>> I am trying to extract all strings in nested parentheses (along with the
>> parentheses itself) in my .txt file. Please see the sample .txt file that
>> I have used in this example here:
>> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
>>
>> I have tried and done up three different codes but none of them seems to
>> be able to extract all the nested parentheses. They can only extract a
>> portion of the nested parentheses. Any advice on what I've done wrong
>> could really help!

One approach is to research in the hope that there are already existing 
tools or techniques which may help/save you from 'reinventing the wheel' 
- when you think about it, a re-statement of open-source objectives.

How does the Python interpreter break-down Python (text) code into its 
constituent parts ("tokens") *including* parentheses? Such are made 
available in (perhaps) a lesser-known corner of the PSL (Python Standard 
Library). Might you be able to use one such tool?

The ComSc technique which sprang to mind involves "stacks" (a LIFO data 
structure) and "RPN" (Reverse Polish Notation). Whereas we like people 
to take their turn when it comes to limited resources, eg to form a 
"queue" to purchase/pay for goods at the local store, which is "FIFO" 
(first-in, first-out); a "stack"/LIFO (last-in, first-out) can be 
problematic to put into practical application. There are plenty of 
Python implementations or you can 'roll your own' with a list. Again, 
I'd likely employ a "deque" from the PSL's Collections library (although 
as a "stack" rather than as a "double-ended queue"), because the 
optimisation comes "free". (to my laziness, but after some kind soul 
sweated-bullets to make it fast (in both senses) for 'the rest of us'!)


> It's probably easier to understand and implement when you process the
> complete text at once. Then arbitrary splits don't get in the way of your
> quest for ( and ). You just have to remember the position of the first
> opening ( and number of opening parens that have to be closed before you
> take the complete expression:

+1
but as a 'silver surfer', I don't like to be asked to "remember" anything!


> level:  00011112222100
>      we need^


Consider:
original_text (the contents of the .txt file - add buffering if volumes 
are huge)
current_text (the characters we have processed/"recognised" so-far)
stack (what an original name for such a data-structure! Which will 
contain each of the partial parenthetical expressions found - but yet to 
be proven/made complete)

set current_text to NULSTRING
for each current_character in original_text:
	if current_character is LEFT_PARENTHESIS:
		push current_text to stack
		set current_text to LEFT_PARENTHESIS
	concatenate current_character with current_text
	if current_character is RIGHT_PARENTHESIS:
		# current_text is a parenthetical expression
		# do with it what you will
		pop the stack
		set current_text to the ex-stack string \
			concat current_text's p-expn

Once working: cover 'special cases' (after above loop), eg original_text 
which doesn't begin and/or end with parentheses; and error cases, eg 
pop-ping a NULSTRING, or thinking things are finished but the stack is 
not yet empty - likely events from unbalanced parentheses!

original text = "abc(def(gh))ij"

event 1: in-turn, concatenate characters "abc" as current_text
event 2: locate (first) left-parenthesis, push current_text to stack(&)
event 3: concatenate "(def"
event 4: push, likewise
event 5: concatenate "(gh"
event 6: locate (first) right-parenthesis (matches to left-parenthesis 
begining the current_string!)
result?: ?print current_text?
event 7: pop stack and redefine current_text as "(def(gh)"
event 8: repeat, per event 6
event 9: current_text will now become "(def(gh))"
event 10: (special case) run-out of input at "(def(gh)ij"
event 11: (special case) pop (!any) stack and report "abc(def(gh)"


NB not sure of need for a "level" number; but if required, you can infer 
that at any time, from the depth/len() of the stack!

NBB being a simple-boy, my preference is for a 'single layer' of code, 
cf @Peter's generator. Regardless the processes are "tokenisation" and 
"recognition".

At the back of my mind, was the notion that you may (eventually) be 
required to work with more than parentheses, eg pair-wise 
square-brackets and/or quotation-marks. In which case, you will need to 
also 'push' the token and check for token-pairs when 'pop-ping', as well 
as (perhaps) recognising lists of tokens to tokenise instead of the two 
parenthesis characters alone. In which case, I'd take a serious look at 
the Python Language Services rather than taking a 'roll your own' approach!

Contrarily, if this spec is 'it', then you might consider optimising the 
search processes which 'trigger' the two stack operations, by re-working 
the for-loop and utilising string.find() - prioritising whichever 
parenthesis is left-most/comes-first - assuming LtR text. (apologies if 
you have already tried this in one of your previous approaches) 
Unfortunately, such likely results in 'layers' of code, and a generator 
might well become the tool-of-choice (I say this before @Peter comes 
back and (quite deservedly) flays me alive!).


WebRefs:
Python Language Services: https://docs.python.org/3/library/language.html
collections ? Container datatypes: 
https://docs.python.org/3/library/collections.html

See also your ComSc text/reference materials.
-- 
Regards =dn