Extract sentences in nested parentheses using Python
On Tuesday, 3 December 2019 06:22:52 UTC+8, DL Neil wrote:
> On 3/12/19 6:00 AM, Peter Otten wrote:
> > A S wrote:
> > I think I've seen this question before ;)
> In addition to 'other reasons' for @Peter's comment, it is a common
> ComSc worked-problem or assignment. (in which case, we'd appreciate
> being told that you/OP is asking for help with "homework")
> >> I am trying to extract all strings in nested parentheses (along with the
> >> parentheses itself) in my .txt file. Please see the sample .txt file that
> >> I have used in this example here:
> >> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> >> I have tried and done up three different codes but none of them seems to
> >> be able to extract all the nested parentheses. They can only extract a
> >> portion of the nested parentheses. Any advice on what I've done wrong
> >> could really help!
> One approach is to research in the hope that there are already existing
> tools or techniques which may help/save you from 'reinventing the wheel'
> - when you think about it, a re-statement of open-source objectives.
> How does the Python interpreter break-down Python (text) code into its
> constituent parts ("tokens") *including* parentheses? Such are made
> available in (perhaps) a lesser-known corner of the PSL (Python Standard
> Library). Might you be able to use one such tool?
> The ComSc technique which sprang to mind involves "stacks" (a LIFO data
> structure) and "RPN" (Reverse Polish Notation). Whereas we like people
> to take their turn when it comes to limited resources, eg to form a
> "queue" to purchase/pay for goods at the local store, which is "FIFO"
> (first-in, first-out); a "stack"/LIFO (last-in, first-out) can be
> problematic to put into practical application. There are plenty of
> Python implementations or you can 'roll your own' with a list. Again,
> I'd likely employ a "deque" from the PSL's Collections library (although
> as a "stack" rather than as a "double-ended queue"), because the
> optimisation comes "free". (to my laziness, but after some kind soul
> sweated-bullets to make it fast (in both senses) for 'the rest of us'!)
> > It's probably easier to understand and implement when you process the
> > complete text at once. Then arbitrary splits don't get in the way of your
> > quest for ( and ). You just have to remember the position of the first
> > opening ( and number of opening parens that have to be closed before you
> > take the complete expression:
> but as a 'silver surfer', I don't like to be asked to "remember" anything!
> > level: 00011112222100
> > we need^
> original_text (the contents of the .txt file - add buffering if volumes
> are huge)
> current_text (the characters we have processed/"recognised" so-far)
> stack (what an original name for such a data-structure! Which will
> contain each of the partial parenthetical expressions found - but yet to
> be proven/made complete)
> set current_text to NULSTRING
> for each current_character in original_text:
> if current_character is LEFT_PARENTHESIS:
> push current_text to stack
> set current_text to LEFT_PARENTHESIS
> concatenate current_character with current_text
> if current_character is RIGHT_PARENTHESIS:
> # current_text is a parenthetical expression
> # do with it what you will
> pop the stack
> set current_text to the ex-stack string \
> concat current_text's p-expn
> Once working: cover 'special cases' (after above loop), eg original_text
> which doesn't begin and/or end with parentheses; and error cases, eg
> pop-ping a NULSTRING, or thinking things are finished but the stack is
> not yet empty - likely events from unbalanced parentheses!
> original text = "abc(def(gh))ij"
> event 1: in-turn, concatenate characters "abc" as current_text
> event 2: locate (first) left-parenthesis, push current_text to stack(&)
> event 3: concatenate "(def"
> event 4: push, likewise
> event 5: concatenate "(gh"
> event 6: locate (first) right-parenthesis (matches to left-parenthesis
> begining the current_string!)
> result?: ?print current_text?
> event 7: pop stack and redefine current_text as "(def(gh)"
> event 8: repeat, per event 6
> event 9: current_text will now become "(def(gh))"
> event 10: (special case) run-out of input at "(def(gh)ij"
> event 11: (special case) pop (!any) stack and report "abc(def(gh)"
> NB not sure of need for a "level" number; but if required, you can infer
> that at any time, from the depth/len() of the stack!
> NBB being a simple-boy, my preference is for a 'single layer' of code,
> cf @Peter's generator. Regardless the processes are "tokenisation" and
> At the back of my mind, was the notion that you may (eventually) be
> required to work with more than parentheses, eg pair-wise
> square-brackets and/or quotation-marks. In which case, you will need to
> also 'push' the token and check for token-pairs when 'pop-ping', as well
> as (perhaps) recognising lists of tokens to tokenise instead of the two
> parenthesis characters alone. In which case, I'd take a serious look at
> the Python Language Services rather than taking a 'roll your own' approach!
> Contrarily, if this spec is 'it', then you might consider optimising the
> search processes which 'trigger' the two stack operations, by re-working
> the for-loop and utilising string.find() - prioritising whichever
> parenthesis is left-most/comes-first - assuming LtR text. (apologies if
> you have already tried this in one of your previous approaches)
> Unfortunately, such likely results in 'layers' of code, and a generator
> might well become the tool-of-choice (I say this before @Peter comes
> back and (quite deservedly) flays me alive!).
> Python Language Services: https://docs.python.org/3/library/language.html
> collections ? Container datatypes:
> See also your ComSc text/reference materials.
> Regards =dn
Hey DL Neil, this is rather sophisticated for me as I am still learning the basics of Python...But I truly appreciate your help and effort! I did try to read through what you said, but some parts I could not register!