osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract sentences in nested parentheses using Python


On Tuesday, 3 December 2019 06:22:52 UTC+8, DL Neil  wrote:
> On 3/12/19 6:00 AM, Peter Otten wrote:
> > A S wrote:
> > I think I've seen this question before ;)
> 
> In addition to 'other reasons' for @Peter's comment, it is a common 
> ComSc worked-problem or assignment. (in which case, we'd appreciate 
> being told that you/OP is asking for help with "homework")
> 
> 
> >> I am trying to extract all strings in nested parentheses (along with the
> >> parentheses itself) in my .txt file. Please see the sample .txt file that
> >> I have used in this example here:
> >> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> >>
> >> I have tried and done up three different codes but none of them seems to
> >> be able to extract all the nested parentheses. They can only extract a
> >> portion of the nested parentheses. Any advice on what I've done wrong
> >> could really help!
> 
> One approach is to research in the hope that there are already existing 
> tools or techniques which may help/save you from 'reinventing the wheel' 
> - when you think about it, a re-statement of open-source objectives.
> 
> How does the Python interpreter break-down Python (text) code into its 
> constituent parts ("tokens") *including* parentheses? Such are made 
> available in (perhaps) a lesser-known corner of the PSL (Python Standard 
> Library). Might you be able to use one such tool?
> 
> The ComSc technique which sprang to mind involves "stacks" (a LIFO data 
> structure) and "RPN" (Reverse Polish Notation). Whereas we like people 
> to take their turn when it comes to limited resources, eg to form a 
> "queue" to purchase/pay for goods at the local store, which is "FIFO" 
> (first-in, first-out); a "stack"/LIFO (last-in, first-out) can be 
> problematic to put into practical application. There are plenty of 
> Python implementations or you can 'roll your own' with a list. Again, 
> I'd likely employ a "deque" from the PSL's Collections library (although 
> as a "stack" rather than as a "double-ended queue"), because the 
> optimisation comes "free". (to my laziness, but after some kind soul 
> sweated-bullets to make it fast (in both senses) for 'the rest of us'!)
> 
> 
> > It's probably easier to understand and implement when you process the
> > complete text at once. Then arbitrary splits don't get in the way of your
> > quest for ( and ). You just have to remember the position of the first
> > opening ( and number of opening parens that have to be closed before you
> > take the complete expression:
> 
> +1
> but as a 'silver surfer', I don't like to be asked to "remember" anything!
> 
> 
> > level:  00011112222100
> >      we need^
> 
> 
> Consider:
> original_text (the contents of the .txt file - add buffering if volumes 
> are huge)
> current_text (the characters we have processed/"recognised" so-far)
> stack (what an original name for such a data-structure! Which will 
> contain each of the partial parenthetical expressions found - but yet to 
> be proven/made complete)
> 
> set current_text to NULSTRING
> for each current_character in original_text:
> 	if current_character is LEFT_PARENTHESIS:
> 		push current_text to stack
> 		set current_text to LEFT_PARENTHESIS
> 	concatenate current_character with current_text
> 	if current_character is RIGHT_PARENTHESIS:
> 		# current_text is a parenthetical expression
> 		# do with it what you will
> 		pop the stack
> 		set current_text to the ex-stack string \
> 			concat current_text's p-expn
> 
> Once working: cover 'special cases' (after above loop), eg original_text 
> which doesn't begin and/or end with parentheses; and error cases, eg 
> pop-ping a NULSTRING, or thinking things are finished but the stack is 
> not yet empty - likely events from unbalanced parentheses!
> 
> original text = "abc(def(gh))ij"
> 
> event 1: in-turn, concatenate characters "abc" as current_text
> event 2: locate (first) left-parenthesis, push current_text to stack(&)
> event 3: concatenate "(def"
> event 4: push, likewise
> event 5: concatenate "(gh"
> event 6: locate (first) right-parenthesis (matches to left-parenthesis 
> begining the current_string!)
> result?: ?print current_text?
> event 7: pop stack and redefine current_text as "(def(gh)"
> event 8: repeat, per event 6
> event 9: current_text will now become "(def(gh))"
> event 10: (special case) run-out of input at "(def(gh)ij"
> event 11: (special case) pop (!any) stack and report "abc(def(gh)"
> 
> 
> NB not sure of need for a "level" number; but if required, you can infer 
> that at any time, from the depth/len() of the stack!
> 
> NBB being a simple-boy, my preference is for a 'single layer' of code, 
> cf @Peter's generator. Regardless the processes are "tokenisation" and 
> "recognition".
> 
> At the back of my mind, was the notion that you may (eventually) be 
> required to work with more than parentheses, eg pair-wise 
> square-brackets and/or quotation-marks. In which case, you will need to 
> also 'push' the token and check for token-pairs when 'pop-ping', as well 
> as (perhaps) recognising lists of tokens to tokenise instead of the two 
> parenthesis characters alone. In which case, I'd take a serious look at 
> the Python Language Services rather than taking a 'roll your own' approach!
> 
> Contrarily, if this spec is 'it', then you might consider optimising the 
> search processes which 'trigger' the two stack operations, by re-working 
> the for-loop and utilising string.find() - prioritising whichever 
> parenthesis is left-most/comes-first - assuming LtR text. (apologies if 
> you have already tried this in one of your previous approaches) 
> Unfortunately, such likely results in 'layers' of code, and a generator 
> might well become the tool-of-choice (I say this before @Peter comes 
> back and (quite deservedly) flays me alive!).
> 
> 
> WebRefs:
> Python Language Services: https://docs.python.org/3/library/language.html
> collections ? Container datatypes: 
> https://docs.python.org/3/library/collections.html
> 
> See also your ComSc text/reference materials.
> -- 
> Regards =dn

Hey DL Neil, this is rather sophisticated for me as I am still learning the basics of Python...But I truly appreciate your help and effort! I did try to read through what you said, but some parts I could not register!