[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Error getting data from website

On Sat, Dec 7, 2019 at 1:21 PM DL Neil via Python-list
<python-list at python.org> wrote:
> On 7/12/19 1:51 PM, Chris Angelico wrote:
> > On Sat, Dec 7, 2019 at 11:46 AM Michael Torrie <torriem at gmail.com> wrote:
> >>
> >> On 12/6/19 5:31 PM, DL Neil via Python-list wrote:
> >>> If you read the HTML data that the REPL has happily splattered all over
> >>> your terminal's screen (scroll back) (NB "soup" is easier to read than
> >>> is "content"!) you will observe that what you saw in your web-browser is
> >>> not what Amazon served in response to the Python "requests.get()"!
> >>
> >> Sadly it's likely that Amazon's page is largely built from javascript.
> >> So scraping static html is probably not going to get you where you want
> >> to go.  There are heavier tools, such as Selenium that uses a real
> >> browser to grab a page, and the result of that you can parse and search
> >> perhaps.
> >
> > Or look for an API instead.
> Both +1
> However, Selenium is possibly less-manageable for a 'beginner'.
> (NB my poorly-based assumption of OP)
> Amazon's HTML-response actually says this/these, but I left it open as a
> (learning) exercise for the OP. They likely prefer the API approach,
> because it can be measured...

Yes, and because it's way WAY easier to guarantee API stability than
Selenium-based page parseability.

But even when there's no *actual* API, you can sometimes delve into
the page and find the actual useful content, perhaps as a big blob of
JSON inside a <script> tag. There'll be no guarantees, of course (but
there aren't any with parsing the HTML either), but it'll be way
easier to parse.