My wife and I are big fans of the late film critic Roger Ebert. We also share an Amazon prime membership.
I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!
In this article, I will:
- Show my not so pretty scraping code
- Discuss some roadblocks / gotchas I ran into along the way
- Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?
PS: If you just want to see the list of movies, just jump to the end of this article.
Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it mostly works… for now.
I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.
Performing a regular
GET with an
Accept: text/html header (which I think is the default for the
requests library) against the url assigned to the variable
ebert_url will always return the first page of movies (regardless of what you set the
page query parameter to).
Accept header field needs to be set to
application/json for the server to return JSON containing movies for that specific page.
No public API
First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.
Not automation friendly
I started off using the
requests library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:
Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.
I got frustrated and switched over to using a more stateful HTTP tool: mechanize.
That worked… 80% of the time? I noticed that if I run my scraper repeatedly it starts to get the anti-robot message again. Maybe there’s some pattern detection going on on the amazon servers?
Bad HTML …
You’ll notice that I’m using some regex in the function
amazon_search to parse out the movie title search results on the page. The reason is that when I tried using
find_all function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the
html.parser parser which isn’t super lenient.
Turns out, rather than using regex, I could have switched over to use the
html5lib parser is the most lenient parser - much more lenient than
html.parser. So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.
Without further ado, here’s a table of all the great movies movies that are included with prime (sorted by most recent release).
If you want the full dataset, I’ve shared it via this google spreadsheet.
|Title||Year Released||Review URL||Prime URL|
|Nosferatu the Vampyre||1979||Link||Link|
|The Long Goodbye||1973||Link||Link|
|“Aguirre, the Wrath of God”||1972||Link||Link|
|“The Good, the Bad and the Ugly”||1968||Link||Link|
|Gospel According to St. Matthew||1964||Link||Link|
|The Man Who Shot Liberty Valance||1962||Link||Link|
|Some Like It Hot||1959||Link||Link|
|Paths of Glory||1957||Link||Link|
|The Sweet Smell of Success||1957||Link||Link|
|The Night of the Hunter||1955||Link||Link|
|Beat the Devil||1954||Link||Link|
|It’s a Wonderful Life||1946||Link||Link|
|My Man Godfrey||1936||Link||Link|
Lots of really neat discussion happened when I submitted this to hacker news. I’ll just highlight a few additional resources / things I learned that are useful.
- More streaming info on the rogerebert site itself: https://www.rogerebert.com/features/where-to-find-roger-eberts-great-movies-streaming
requests.session()is another way to get a more stateful HTTP client
- available movies on amazon can differ substantially between countries! I did not know that. This list I made has only been tested with the U.S Amazon
- Roger ebert bio pic: https://www.rogerebert.com/reviews/life-itself-2014
And, of course, that there are fans of roger ebert everywhere. I’m glad some of you found this useful. Thank you.