A beginner guide for scraping data from IMDB for User Reviews using Selenium and BeautifulSoup

Hung Pham
7 min readApr 13, 2021

An ELI5 version for beginners to use Selenium and BeautifulSoup for data scraping.

Recently I had a project that required user review data from IMDB but couldn’t find any guide on the web for it. So I decided to learn Selenium and BS4 to implement them in this small project.

First thing first, let’s quickly review how Selenium and BS4 work and how are they different, as both are common tools to scrape data on the web for data analytic.

For this guide, I will assume that you (just like me) have no prior knowledge about HTML code but familiar with basic python functinalities.

The way that both use to scrap data is to load the whole webpage in its HTML form, then we can search and extract elements from the webpage using their HTML tag.

That being said, there are a few key differences between these two (Selenium has many other applications, so we will only talk about web scraping aspect here):

Selenium:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Selenium’s main application is web automation, which means we can program it to replicate what we want to do with a webpage, such as:

  • Open it (.get())
  • Find something within the web page (.find_element_by…())
  • Click on something (.click())
  • Close it (.quit())
  • Etc…We can do much more with Selenium, but that is out of the scope of this article.

BeautifulSoup:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  • BeautifulSoup(BS4) on the other hand, required the help from `requests` library to load the page, and only then extract the elements from the webpage.
  • BS4 is faster compared with selenium when extracting data.

So back to our main topic, our goal here is to extract user reviews of the top 50 highest rating movies. We can get these lists easily using google.

For this project, we will use this link:

https://www.imdb.com/search/title/?groups=top_100

The process, step-by-step would be:

  • Use Selenium to open the top 100 pages, and grab the link to each individual movie (We only need 50 so the first page would be enough).
  • Go to each movie’s page and grab the link to their respective `User Reviews` page.
  • Then we can go to each movie's ‘User Reviews’ page and extract the content of the reviews, store them into a list and create our dataframe.

All set, let’s check out some code:

** Note that I will put a link to github for the full notebook at the end, the embedded code here is for illustration purpose only **

1. Install the packages:

Both can be installed with pip in the command line:

pip install seleniumpip install beautifulsoup4

2. Import the libraries:

import selenium#webdriver is our tool to interact with the webpage
from selenium import webdriver
import requests #needed to load the page for BS4from bs4 import BeautifulSoupimport pandas as pd #Using panda to create our dataframe

3. Setup webdriver for Selenium and load the webpage

To use selenium, we will need to download the webdriver file corresponding with our browser. We’re using Chrome here: https://chromedriver.chromium.org/downloads

Make sure that our Chrome version is the same with the webdriver (for example, we will use the 89.0 version here, just the first 2 digit of the version is enough).

#path to the webdriver file
PATH = r"C:\chromedriver.exe"
#tell selenium to use Chrome and find the webdriver file in this location
driver = webdriver.Chrome(PATH)
#Set the url link and load the webpage
url = 'https://www.imdb.com/search/title/?groups=top_100'
driver.get(url)

So from now on, we can treat the driver object here as the webpage itself and do all sorts of stuff as discussed above to it. If you run the code up until this point, you will see the website pop up.

4. Extract data

One of the most common tasks we want to do when scraping data is to locate and extract a specific element from the webpage. We can manually find it by inspecting the HTML code of the page.

We want to extract the name and the hyperlink to an individual movie, which can be found with the class name ‘lister-item’, then within this class, we can find the name and link of this movie.

It sounds complicated at first, but you can just try to inspect the page, hover the mouse over each element and it will make much more sense.

Note that getting the hyperlink required a little bit of cleaning after getting the movie’s name. After that, we can put them all into a for loop to run through all 50.

#Set initial empty list for each element:
title = []
link = []
year = []
#Grab the block of each individual movie
block = driver.find_elements_by_class_name('lister-item')
#Set up for loop to run through all 50 movies
for i in range(0,50):
#Extracting title
ftitle = block[i].find_element_by_class_name('lister-item-header').text

#The extracted title has extra elements, so we will have to do some cleaning

#Remove the order in front of the title
forder = block[i].find_element_by_class_name('lister-item-index').text
#Extract the year at the end
fyear = ftitle[-6:]
#Drop the order and year and only keep the movie's name
ftitle = ftitle.replace(forder+' ', '')[:-7 ]
#Then extract the link with cleaned title
flink = block[i].find_element_by_link_text(ftitle).get_attribute('href')
#Add item to the respective lists
title.append(ftitle)
year.append(fyear)
link.append(flink)

Now that we have the links to each and every movie that we want to check, we can use BS4 to go to these movies’ page and grab the link to their user review page. The BS4 concept of find element by HTLM tag is pretty similar to Selenium, except for a few small differences in syntax.

#Set an empty list to store user review link
user_review_links = []
for url in link:
url = url
#setup user agent for BS4, except some rare case, it would be the same for most browser
user_agent = {'User-agent': 'Mozilla/5.0'}
#Use request.get to load the whole page
response = requests.get(url, headers = user_agent)
#Parse the requests object to BS4 to transform it into html structure
soup = BeautifulSoup(response.text, 'html.parser')
#Find the link marked by the USER REVIEWS link text.
review_link = 'https://www.imdb.com'+soup.find('a', text = 'USER REVIEWS').get('href')
#Append the newly grabed link into its list
user_review_links.append(review_link)

Great, now we have a full list of all the movies' names, their individual page’s links, and user reviews links. If we want to make a dataframe from them, we would get something like this:

Now is when the fun begins. We then can go to the User Reviews page and grab all the content. But by default, IDMB only shows 25 reviews and we will have to click on the load more button to extend it.

Well, to be able to automatically click on something is the reason we want to use Selenium in the first place.

page = 1
#We want at least 1000 review, so get 50 at a safe number
while page<50:
try:
#find the load more button on the webpage
load_more = driver.find_element_by_id('load-more-trigger')
#click on that button
load_more.click()
page+=1
except:
#If couldn't find any more button to click, stop
break

After that, it’s the same method as before:

for n in range(0,1200):
try:
#Some reviewers only give review text or rating without the other,
#so we use try/except here to make sure each block of content must has all the element before append them to the list
#Check if each review has all the elements
ftitle = review[n].find_element_by_class_name('title').text
#For the review content, some of them are hidden as spoiler,
#so we use the attribute 'textContent' here after extracting the 'content' tag

fcontent = review[n].find_element_by_class_name('content').get_attribute("textContent").strip()
frating = review[n].find_element_by_class_name('rating-other-user-rating').text
fdate = review[n].find_element_by_class_name('review-date').text
fname = review[n].find_element_by_class_name('display-name-link').text
#Then add them to the respective list
title.append(ftitle)
content.append(fcontent)
rating.append(frating)
date.append(fdate)
user_name.append(fname)
except:
continue

Finally, after getting the list of all the reviews’ content, we can store them into separated files to use later:

#Build data dictionary for dataframe
data = {'User_name': user_name,
'Review title': title,
'Review Rating': rating,
'Review date' : date,
'Review_body' : content
}
#Build dataframe to export
review = pd.DataFrame(data = data)
movie = top50['Movie_name'][i] #grab the movie name from the master list
review['Movie_name'] = movie #create new column with the same movie name column
review.to_csv(f'data/{i}.csv') #store them into respective file
driver.quit() #tell Selenium to close the webpage

If we want a master file with all the reviews, we can simply concatenate them from individual files:

df_raw = pd.read_csv('data/0.csv') #Create a base dataframe from the first movie review file
for i in range(1,50):
add = pd.read_csv(f'data/{i}.csv')
df_raw = pd.concat([df_raw,add], ignore_index=True) #Concat the rest into main dataframe

Our final data frame should look like this:

Final dataframe for IMDB User Reviews

Conclusion

So with Selenium and BeautifulSoup, depend on the data that you want, you can use either one or mix and match both libraries as the example above. However, the scraping process normally follows these similar steps:

  • Open the website that you want to scrap, inspect it to find the HTLM tag of the element you want.
  • If you need to automate some tasks (click somewhere, fill out a search field, etc…), use Selenium.
  • Then store the webpage into an object and extract the desired element with its HTML tag.
  • If you need multiple items, you can run a loop to store them into a list, and create a dataframe base on that.

On a final note, data scrapping is quite time-consuming as finding a correct HTML tag involved a lot of trials and errors if you don't have prior knowledge of HTML structure. Also, text data scraped are (most of the time) quite messy and required extra cleaning steps.

If you are a beginner like me, I hope that this guide provided and clearly explained some concepts of scraping data, as well as a practical use of it.

You can find the full code of the notebook on my Github here:

https://github.com/hungpham89/IMDb_scraper

Happy coding!

--

--