Scraping Roger Ebert’s reviews and finding his favorite movies on Amazon Prime

My wife and I are big fans of the late film critic Roger Ebert. We also share an Amazon prime membership.

I wondered: which of Roger Ebert’s favorite movies are available to watch for free on prime? Since there are hundreds of reviews by Roger Ebert, I had the perfect excuse for writing a web scraper!

In this article, I will:

Show my not so pretty scraping code
Discuss some roadblocks / gotchas I ran into along the way
Share with you the list of movies rated as great by Roger Ebert. That’s what you’re here for, right?

PS: If you just want to see the list of movies, just jump to the end of this article.

Code Quality Warning: I hacked this together as fast as I could without much refactoring, so it’s not the most readable or optimized. But it mostly works… for now.

Roadblocks

I hit a few roadblocks while working on this that I think are worth calling out and will clarify some of the decisions I made in the implementation.

scraping rogerebert.com

Performing a regular GET with an Accept: text/html header (which I think is the default for the requests library) against the url assigned to the variable ebert_url will always return the first page of movies (regardless of what you set the page query parameter to).

Solution? The Accept header field needs to be set to application/json for the server to return JSON containing movies for that specific page.

scraping amazon.com

No public API

First, there is no publicaly available Amazon API for their catalog search. It seems like you could email them to get authorization, but I didn’t want to waste my time doing that.

Not automation friendly

I started off using the requests library. Turns out that if you don’t set a proper browser agent, you’ll get a 503 and some message about how automation isn’t welcome. If you do fake a proper agent but you’re not setting cookies from the server respond, you’ll get:

Sorry, we just need to make sure you’re not a robot. For best results, please make sure your browser is accepting cookies.

I got frustrated and switched over to using a more stateful HTTP tool: mechanize.

That worked… 80% of the time? I noticed that if I run my scraper repeatedly it starts to get the anti-robot message again. Maybe there’s some pattern detection going on on the amazon servers?

Bad HTML …

You’ll notice that I’m using some regex in the function amazon_search to parse out the movie title search results on the page. The reason is that when I tried using beautifulsoup‘s find_all function on the search result tags, I got nothing. My guess is that there’s some invalid HTML on the page and confused the beautifulsoup html.parser parser which isn’t super lenient.

Turns out, rather than using regex, I could have switched over to use the html5lib parser.

For example: BeautifulSoup(match, features="html5lib").

The html5lib parser is the most lenient parser – much more lenient than html.parser. So if I needed to make additional changes to this function, I’d refactor it to use that parser and get rid of the nasty looking regex.

Results

Without further ado, here’s a table of all the great movies movies that are included with prime (sorted by most recent release).

If you want the full dataset, I’ve shared it via this google spreadsheet.

Title	Year Released	Review URL	Prime URL
Moonstruck	1987	Link	Link
Fitzcarraldo	1982	Link	Link
Atlantic City	1980	Link	Link
Nosferatu the Vampyre	1979	Link	Link
The Long Goodbye	1973	Link	Link
“Aguirre, the Wrath of God”	1972	Link	Link
“The Good, the Bad and the Ugly”	1968	Link	Link
Gospel According to St. Matthew	1964	Link	Link
The Man Who Shot Liberty Valance	1962	Link	Link
Some Like It Hot	1959	Link	Link
Paths of Glory	1957	Link	Link
The Sweet Smell of Success	1957	Link	Link
The Night of the Hunter	1955	Link	Link
Johnny Guitar	1954	Link	Link
Beat the Devil	1954	Link	Link
Sunset Boulevard	1950	Link	Link
It’s a Wonderful Life	1946	Link	Link
Detour	1945	Link	Link
My Man Godfrey	1936	Link	Link
The General	1927	Link	Link

Enjoy.

Update (2020-6-10)

Lots of really neat discussion happened when I submitted this to hacker news. I’ll just highlight a few additional resources / things I learned that are useful.

More streaming info on the rogerebert site itself: https://www.rogerebert.com/features/where-to-find-roger-eberts-great-movies-streaming
requests.session() is another way to get a more stateful HTTP client
available movies on amazon can differ substantially between countries! I did not know that. This list I made has only been tested with the U.S Amazon
Roger ebert bio pic: https://www.rogerebert.com/reviews/life-itself-2014

And, of course, that there are fans of roger ebert everywhere. I’m glad some of you found this useful. Thank you.