Finding lines in .txt file that contain keywords from two different set()
On 9/09/19 4:02 AM, A S wrote:
> My problem is seemingly profound but I hope to make it sound as simplified as possible.....Let me unpack the details..:
> These are the folders used for a better reference ( https://drive.google.com/open?id=1_LcceqcDhHnWW3Nrnwf5RkXPcnDfesq ). The files are found in the folder.
The link resulted in a 404 page (for me - but then I don't use Google).
So, without any sample data...
> 1. I have one folder of Excel (.xlsx) files that serve as a data
> -In Cell A1, the data source name is written in between brackets
> -In Cols C:D, it contains the data field names (It could be in either
col C or D in my actual Excel sheet. So I had to search both columns
> -*Important: I need to know which data source the field names come from
> 2. I have another folder of Text (.txt) files that I need to parse
through to find these keywords.
Recommend you start with a set of test data/directories. For the first
run, have one of each type of file, where the keywords correlate. Thus
prove that the system works when you know it should.
Next, try the opposite, to ensure that it equally-happily ignores, when
Then expand to having multiple records, so that you can see what happens
when some files correlate, and some don't.
ie take a large problem and break it down into smaller units. This is a
An alternate design approach (which works very well in Python - see also
"PyTest") is to embrace the principles of TDD (Test-Driven Development).
This is a process that builds 'from the ground, up'. In this, we design
a small part of the process - let's call it a function/method: first we
code some test data *and* the expected answer, eg if one input is 1 and
another is 2 is their addition 3? (running such a test at this stage
will fail - badly!); and then we write some code - and keep perfecting
it until it passes the test.
Repeat, stage-by-stage, to build the complete program - meantime, every
change you make to the code should be tested against not just 'its own'
test, but all of the tests which originally related to some other
smaller unit of the whole. In this way, 'new code' can be shown to break
(or not - hopefully) previously implemented, tested, and 'proven' code!
Notice how you have broken-down the larger problem in the description
(points 1 to 5, above)! Design the tests similarly, to *only* test one
small piece of the puzzle (often you will have to 'fake' or "mock"
data-inputs to the process, particularly if code to produce that unit's
input has yet to be written, but regardless 'mock data' is thoroughly
controlled and thus produces (more) predictable results) - plus, it's
much easier to spot errors and omissions when you don't have to wade
through a mass of print-outs that (attempt to) cover *everything*! (IMHO)
Plus, when a problem is well-confined, there's less example code and
data to insert into list questions, and the responses will be
Referring back to the question: it seems that the issue is either that
the keywords are not being (correctly) picked-out of the sets of files
(easy tests - for *only* those small section of the code!), or that the
logic linking the key-words is faulty (another *small* test, easily
coded - and at first fed with 'fake' key-words which prove the various
test cases, and thus, when run, (attempt to) prove your logic and code!)