Finding sentinel text when using a thread pool...
I'm developing a web scraper script. It takes 25 minutes to process 590
pages and ~9,000 comments. I've been told that the script is taking too
The way the script currently works is that the page requester is a
generator function that requests a page, checks if the page contains the
sentinel text (i.e., "Sorry, no more comments."), and either yields the
page and request the next page or exits the function. Every yielded page
is parsed by Beautiful Soup and saved to disk.
Timing the page requester separately from the rest of the script and the
end value set to 590, each page request takes 1.5 seconds.
If I use a thread pool of 16 threads, each request takes 0.1 seconds.
(Higher thread numbers will result in the server forcibly closing the
I'm trying to figure out how I would find the sentinel text by using a
thread pool. Seems like I need to request an arbitrary number of pages
(perhaps one page per thread), evaluate the contents of each page for
the sentinel text, and either request another set of pages or exit the
Is that the most efficient approach for using a thread pool?
I'm using this article for the thread pool coding example.