A beginner guide for scraping data from IMDB for User Reviews using Selenium and BeautifulSoup

An ELI5 version for beginners to use Selenium and BeautifulSoup for data scraping.

  • Open it (.get())
  • Find something within the web page (.find_element_by…())
  • Click on something (.click())
  • Close it (.quit())
  • Etc…We can do much more with Selenium, but that is out of the scope of this article.
  • BeautifulSoup(BS4) on the other hand, required the help from `requests` library to load the page, and only then extract the elements from the webpage.
  • BS4 is faster compared with selenium when extracting data.
  • Use Selenium to open the top 100 pages, and grab the link to each individual movie (We only need 50 so the first page would be enough).
  • Go to each movie’s page and grab the link to their respective `User Reviews` page.
  • Then we can go to each movie's ‘User Reviews’ page and extract the content of the reviews, store them into a list and create our dataframe.
pip install seleniumpip install beautifulsoup4
import selenium#webdriver is our tool to interact with the webpage
from selenium import webdriver
import requests #needed to load the page for BS4from bs4 import BeautifulSoupimport pandas as pd #Using panda to create our dataframe
#path to the webdriver file
PATH = r"C:\chromedriver.exe"
#tell selenium to use Chrome and find the webdriver file in this location
driver = webdriver.Chrome(PATH)
#Set the url link and load the webpage
url = 'https://www.imdb.com/search/title/?groups=top_100'
driver.get(url)
#Set initial empty list for each element:
title = []
link = []
year = []
#Grab the block of each individual movie
block = driver.find_elements_by_class_name('lister-item')
#Set up for loop to run through all 50 movies
for i in range(0,50):
#Extracting title
ftitle = block[i].find_element_by_class_name('lister-item-header').text

#The extracted title has extra elements, so we will have to do some cleaning

#Remove the order in front of the title
forder = block[i].find_element_by_class_name('lister-item-index').text
#Extract the year at the end
fyear = ftitle[-6:]
#Drop the order and year and only keep the movie's name
ftitle = ftitle.replace(forder+' ', '')[:-7 ]
#Then extract the link with cleaned title
flink = block[i].find_element_by_link_text(ftitle).get_attribute('href')
#Add item to the respective lists
title.append(ftitle)
year.append(fyear)
link.append(flink)
#Set an empty list to store user review link
user_review_links = []
for url in link:
url = url
#setup user agent for BS4, except some rare case, it would be the same for most browser
user_agent = {'User-agent': 'Mozilla/5.0'}
#Use request.get to load the whole page
response = requests.get(url, headers = user_agent)
#Parse the requests object to BS4 to transform it into html structure
soup = BeautifulSoup(response.text, 'html.parser')
#Find the link marked by the USER REVIEWS link text.
review_link = 'https://www.imdb.com'+soup.find('a', text = 'USER REVIEWS').get('href')
#Append the newly grabed link into its list
user_review_links.append(review_link)
page = 1
#We want at least 1000 review, so get 50 at a safe number
while page<50:
try:
#find the load more button on the webpage
load_more = driver.find_element_by_id('load-more-trigger')
#click on that button
load_more.click()
page+=1
except:
#If couldn't find any more button to click, stop
break
for n in range(0,1200):
try:
#Some reviewers only give review text or rating without the other,
#so we use try/except here to make sure each block of content must has all the element before append them to the list
#Check if each review has all the elements
ftitle = review[n].find_element_by_class_name('title').text
#For the review content, some of them are hidden as spoiler,
#so we use the attribute 'textContent' here after extracting the 'content' tag

fcontent = review[n].find_element_by_class_name('content').get_attribute("textContent").strip()
frating = review[n].find_element_by_class_name('rating-other-user-rating').text
fdate = review[n].find_element_by_class_name('review-date').text
fname = review[n].find_element_by_class_name('display-name-link').text
#Then add them to the respective list
title.append(ftitle)
content.append(fcontent)
rating.append(frating)
date.append(fdate)
user_name.append(fname)
except:
continue
#Build data dictionary for dataframe
data = {'User_name': user_name,
'Review title': title,
'Review Rating': rating,
'Review date' : date,
'Review_body' : content
}
#Build dataframe to export
review = pd.DataFrame(data = data)
movie = top50['Movie_name'][i] #grab the movie name from the master list
review['Movie_name'] = movie #create new column with the same movie name column
review.to_csv(f'data/{i}.csv') #store them into respective file
driver.quit() #tell Selenium to close the webpage
df_raw = pd.read_csv('data/0.csv') #Create a base dataframe from the first movie review file
for i in range(1,50):
add = pd.read_csv(f'data/{i}.csv')
df_raw = pd.concat([df_raw,add], ignore_index=True) #Concat the rest into main dataframe
Final dataframe for IMDB User Reviews
  • Open the website that you want to scrap, inspect it to find the HTLM tag of the element you want.
  • If you need to automate some tasks (click somewhere, fill out a search field, etc…), use Selenium.
  • Then store the webpage into an object and extract the desired element with its HTML tag.
  • If you need multiple items, you can run a loop to store them into a list, and create a dataframe base on that.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hung Pham

Hung Pham

Data Scientist | Gamer | Thinker | Adventurer