A beginner guide for scraping data from IMDB for User Reviews using Selenium and BeautifulSoup

An ELI5 version for beginners to use Selenium and BeautifulSoup for data scraping.

Recently I had a project that required user review data from IMDB but couldn’t find any guide on the web for it. So I decided to learn Selenium and BS4 to implement them in this small project.

First thing first, let’s quickly review how Selenium and BS4 work and how are they different, as both are common tools to scrape data on the web for data analytic.

For this guide, I will assume that you (just like me) have no prior knowledge about HTML code but familiar with basic python functinalities.

The way that both use to scrap data is to load the whole webpage in its HTML form, then we can search and extract elements from the webpage using their HTML tag.

That being said, there are a few key differences between these two (Selenium has many other applications, so we will only talk about web scraping aspect here):

Selenium:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Selenium’s main application is web automation, which means we can program it to replicate what we want to do with a webpage, such as:

  • Open it (.get())
  • Find something within the web page (.find_element_by…())
  • Click on something (.click())
  • Close it (.quit())
  • Etc…We can do much more with Selenium, but that is out of the scope of this article.

BeautifulSoup:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  • BeautifulSoup(BS4) on the other hand, required the help from `requests` library to load the page, and only then extract the elements from the webpage.
  • BS4 is faster compared with selenium when extracting data.

So back to our main topic, our goal here is to extract user reviews of the top 50 highest rating movies. We can get these lists easily using google.

For this project, we will use this link:

https://www.imdb.com/search/title/?groups=top_100

The process, step-by-step would be:

  • Use Selenium to open the top 100 pages, and grab the link to each individual movie (We only need 50 so the first page would be enough).
  • Go to each movie’s page and grab the link to their respective `User Reviews` page.
  • Then we can go to each movie's ‘User Reviews’ page and extract the content of the reviews, store them into a list and create our dataframe.

All set, let’s check out some code:

** Note that I will put a link to github for the full notebook at the end, the embedded code here is for illustration purpose only **

1. Install the packages:

Both can be installed with pip in the command line:

2. Import the libraries:

3. Setup webdriver for Selenium and load the webpage

To use selenium, we will need to download the webdriver file corresponding with our browser. We’re using Chrome here: https://chromedriver.chromium.org/downloads

Make sure that our Chrome version is the same with the webdriver (for example, we will use the 89.0 version here, just the first 2 digit of the version is enough).

So from now on, we can treat the driver object here as the webpage itself and do all sorts of stuff as discussed above to it. If you run the code up until this point, you will see the website pop up.

4. Extract data

One of the most common tasks we want to do when scraping data is to locate and extract a specific element from the webpage. We can manually find it by inspecting the HTML code of the page.

We want to extract the name and the hyperlink to an individual movie, which can be found with the class name ‘lister-item’, then within this class, we can find the name and link of this movie.

It sounds complicated at first, but you can just try to inspect the page, hover the mouse over each element and it will make much more sense.

Note that getting the hyperlink required a little bit of cleaning after getting the movie’s name. After that, we can put them all into a for loop to run through all 50.

Now that we have the links to each and every movie that we want to check, we can use BS4 to go to these movies’ page and grab the link to their user review page. The BS4 concept of find element by HTLM tag is pretty similar to Selenium, except for a few small differences in syntax.

Great, now we have a full list of all the movies' names, their individual page’s links, and user reviews links. If we want to make a dataframe from them, we would get something like this:

Now is when the fun begins. We then can go to the User Reviews page and grab all the content. But by default, IDMB only shows 25 reviews and we will have to click on the load more button to extend it.

Well, to be able to automatically click on something is the reason we want to use Selenium in the first place.

After that, it’s the same method as before:

Finally, after getting the list of all the reviews’ content, we can store them into separated files to use later:

If we want a master file with all the reviews, we can simply concatenate them from individual files:

Our final data frame should look like this:

Final dataframe for IMDB User Reviews

Conclusion

So with Selenium and BeautifulSoup, depend on the data that you want, you can use either one or mix and match both libraries as the example above. However, the scraping process normally follows these similar steps:

  • Open the website that you want to scrap, inspect it to find the HTLM tag of the element you want.
  • If you need to automate some tasks (click somewhere, fill out a search field, etc…), use Selenium.
  • Then store the webpage into an object and extract the desired element with its HTML tag.
  • If you need multiple items, you can run a loop to store them into a list, and create a dataframe base on that.

On a final note, data scrapping is quite time-consuming as finding a correct HTML tag involved a lot of trials and errors if you don't have prior knowledge of HTML structure. Also, text data scraped are (most of the time) quite messy and required extra cleaning steps.

If you are a beginner like me, I hope that this guide provided and clearly explained some concepts of scraping data, as well as a practical use of it.

You can find the full code of the notebook on my Github here:

https://github.com/hungpham89/IMDb_scraper

Happy coding!

Data Scientist | Gamer | Thinker | Adventurer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store