Extract all words between two keywords in .txt file (Python)
On Wed, Dec 11, 2019 at 1:31 PM Ben Bacarisse <ben.usenet at bsb.me.uk> wrote:
> A S <aishan0403 at gmail.com> writes:
> > I would like to extract all words within specific keywords in a .txt
> > file. For the keywords, there is a starting keyword of "PROC SQL;" (I
> > need this to be case insensitive) and the ending keyword could be
> > either "RUN;", "quit;" or "QUIT;". This is my sample .txt file.
> > Thus far, this is my code:
> > with open('lan sample text file1.txt') as file:
> > text = file.read()
> > regex = re.compile(r'(PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;)')
> > k = regex.findall(text)
> > print(k)
> re.compile(r'(?si)(PROC SQL;.*(?:QUIT|RUN);)')
> Read up one what (?si) means and what (?:...) means.. You can do the
> same by passing flags to the compile method.
> > Output:
> > [('quit;', ''), ('quit;', ''), ('PROC SQL;', '')]
> Your main issue is that | binds weakly. Your whole pattern tries to
> match any one of just four short sub-patterns:
> PROC SQL;
> proc sql;(.*?)RUN;
Consider using python string functions.
1. read your string, lets call it s.
2 . start = s.find("PROC SQL:"
This will find the starting index point. It returns and index
3. DO the same for each of the three possible ending strings. Use if/else
4. This will give you your ending index.
5 slice the included string, taking into account the start is start +
len("PROC SQL;") and the end is the ending index - the length of
whichever string ended in your case
Regular expressions are powerful, but not so easy to read unless you
are really into them.