osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Extract all words between two keywords in .txt file (Python)


On Thursday, 12 December 2019 04:55:46 UTC+8, Joel Goldstick  wrote:
> On Wed, Dec 11, 2019 at 1:31 PM Ben Bacarisse <ben.usenet at bsb.me.uk> wrote:
> >
> > A S <aishan0403 at gmail.com> writes:
> >
> > > I would like to extract all words within specific keywords in a .txt
> > > file. For the keywords, there is a starting keyword of "PROC SQL;" (I
> > > need this to be case insensitive) and the ending keyword could be
> > > either "RUN;", "quit;" or "QUIT;". This is my sample .txt file.
> > >
> > > Thus far, this is my code:
> > >
> > > with open('lan sample text file1.txt') as file:
> > >     text = file.read()
> > >     regex = re.compile(r'(PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;)')
> > >     k = regex.findall(text)
> > >     print(k)
> >
> > Try
> >
> >   re.compile(r'(?si)(PROC SQL;.*(?:QUIT|RUN);)')
> >
> > Read up one what (?si) means and what (?:...) means..  You can do the
> > same by passing flags to the compile method.
> >
> > > Output:
> > >
> > > [('quit;', ''), ('quit;', ''), ('PROC SQL;', '')]
> >
> > Your main issue is that | binds weakly.  Your whole pattern tries to
> > match any one of just four short sub-patterns:
> >
> > PROC SQL;
> > proc sql;(.*?)RUN;
> > quit;
> > QUIT;
> >
> > --
> > Ben.
> > --
> > https://mail.python.org/mailman/listinfo/python-list
> 
> Consider using python string functions.
> 
> 1. read your string, lets call it s.
> 2 . start = s.find("PROC SQL:"
>  This will find the starting index point.  It returns and index
> 3. DO the same for each of the three possible ending  strings.  Use if/else
> 4. This will give you your ending index.
> 5 slice the included string, taking into account the start is start +
> len("PROC SQL;") and the end is the ending index - the length of
> whichever string ended in your case
> 
> Regular expressions are powerful, but not so easy to read unless you
> are really into them.
> -- 
> Joel Goldstick
> http://joelgoldstick.com/blog
> http://cc-baseballstats.info/stats/birthdays

Hey Joel, not too sure if i get the idea of your code implementation