Data Scraping and Data Wrangling using Python BeautifulSoup
Data Scraping from a website is one of the way to get valuable data about present trends especially because most of the data in this age really comes from the datas coming from different website especially Youtube, Facebook, Twitter and other Social Media sites.
Now, if one wants to analyze what's the trend movie currently and use that data for personal or business reasons, scraping data from the popular movie website like IMDb is the way to go.
We use Python because it is one of the most used languages in data science and also because it is the language that I am most familiar with.
(If you're already done in this part, just skip this)
Note: Make sure that you already installed the Python before pip-installing the following packages.
Open Command Prompt or cmd
Type the following:
pip install lxml
pip install numpy
pip install pandas
pip install bs4
pip install requests
Note: Make sure that you already installed the Python before aptget-installing the following packages.
Open the Terminal
Type the following:
apt-get install lxml
apt-get install numpy
apt-get install pandas
apt-get install bs4
apt-get install requests
- Go to IMDb movie website
- Hover your mouse to the Watchlist
- Click the Popular Movies section
If you want to use my Jupyter notebook, use this link: IMDb-most-popular-2019
Use Google Chrome Developer Tools to inspect elements or the data of the website. Right Click mouse then Inspect Find the Elements that correspond to the data we're getting Keep in mind that one need to find the source of the data before getting it. In the case of the IMDb movie website, the structure of the data of the 1st movie is similar to the structure of those remaining 99 movies. We can take advantage of that later.
Step-by-Step-Jupyter-Notebook-Code
Import modules
import pandas as pd
import numpy as np
import re
import lxml
from bs4 import BeautifulSoup
from requests import get
%matplotlib inline
Get the page link
url= "https://www.imdb.com/search/title?count=100&title_type=feature,tv_series&ref_=nv_wl_img_2"
- Get page data
- - Get page using requests.get
- - Parse page using BeautifulSoup and lxml
page = get(url)
soup = BeautifulSoup(page.content, 'lxml')
Get the Element or tag that holds the movie contents
content = soup.find(id="main")
soup.find("h1", class_="header")** finds the first line that has h1 tag and has a class header.
.text** gets the text of that line or that element.
.replace("\n","")** just erases \n.
articleTitle = soup.find("h1", class_="header").text.replace("\n","")
Find_all returns a list of all instances that has the tags specified (i.e. "div", "class")
To get the first movie only, use movieFrame[0]
movieFrame = content.find_all("div", class_="lister-item mode-advanced")
We need to first get the line where the title and the date contains because the tags that holds those values are too common and using find might not get it to appear.
.find("a") returns the first line that has a tag "a"
.find_all("span") returns all lines that has a tag of "span". Because we only want the date, we only return the second line denoted by ("span")[-1]
.text returns the text value of that line.
movieFirstLine = movieFrame[0].find("h3", class_="lister-item-header")
movieTitle = movieFirstLine.find("a").text
movieDate = re.sub(r"[()]","", movieFirstLine.find_all("span")[-1].text)
Find the other datas are just the same as what we did with the first ones. Just take note that be more specific in describing the attributes (i.e. class, id, etc.) so that the it will directly return the line that we want to get.
movieRunTime = movieFrame[0].find("span", class_="runtime").text[:-4]
movieGenre = movieFrame[0].find("span", class_="genre").text.rstrip().replace("\n","").split(",")
movieRating = movieFrame[0].find("strong").text
movieScore = movieFrame[0].find("span", class_="metascore unfavorable").text.rstrip()
movieDesc = movieFrame[0].find_all("p", class_="text-muted")[-1].text.lstrip()
Movies w/o including the directors are troublesome and we need to anticipate that by making that missing value into NaN using np.nan.
Getting the movie casts is a bit tricky because there is an indefinite number of casts that can be included in each movie, sometimes none, sometimes a few. That is the reason why we need to anticipate those three scenarios.
Take a look at the code:
#Movie Director and Movie Stars
try:
casts = movieCast.text.replace("\n","").split('|')
casts = [x.strip() for x in casts]
casts = [casts[i].replace(j, "") for i,j in enumerate(["Director:", "Stars:"])]
movieDirector = casts[0]
movieStars = [x.strip() for x in casts[1].split(",")]
except:
casts = movieCast.text.replace("\n","").strip()
movieDirector = np.nan
movieStars = [x.strip() for x in casts.split(",")]
We can get an attribute by including it to the attrs dictionary and adding its value to it.
movieNumbers = movieFrame[0].find_all("span", attrs={"name": "nv"})
if len(movieNumbers) == 2:
movieVotes = movieNumbers[0].text
movieGross = movieNumbers[1].text
else:
movieVotes = movieNumbers[0].text
movieGross = np.nan
'''
Author: Reljod T. Oreta PUP-Manila
BSECE 5th year
'''
import lxml
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
url1 = "https://www.imdb.com/search/title?count=100&title_type=feature,tv_series&ref_=nv_wl_img_2"
class IMDB(object):
"""docstring for IMDB"""
def __init__(self, url):
super(IMDB, self).__init__()
page = get(url)
self.soup = BeautifulSoup(page.content, 'lxml')
def articleTitle(self):
return self.soup.find("h1", class_="header").text.replace("\n","")
def bodyContent(self):
content = self.soup.find(id="main")
return content.find_all("div", class_="lister-item mode-advanced")
def movieData(self):
movieFrame = self.bodyContent()
movieTitle = []
movieDate = []
movieRunTime = []
movieGenre = []
movieRating = []
movieScore = []
movieDescription = []
movieDirector = []
movieStars = []
movieVotes = []
movieGross = []
for movie in movieFrame:
movieFirstLine = movie.find("h3", class_="lister-item-header")
movieTitle.append(movieFirstLine.find("a").text)
movieDate.append(re.sub(r"[()]","", movieFirstLine.find_all("span")[-1].text))
try:
movieRunTime.append(movie.find("span", class_="runtime").text[:-4])
except:
movieRunTime.append(np.nan)
movieGenre.append(movie.find("span", class_="genre").text.rstrip().replace("\n","").split(","))
try:
movieRating.append(movie.find("strong").text)
except:
movieRating.append(np.nan)
try:
movieScore.append(movie.find("span", class_="metascore unfavorable").text.rstrip())
except:
movieScore.append(np.nan)
movieDescription.append(movie.find_all("p", class_="text-muted")[-1].text.lstrip())
movieCast = movie.find("p", class_="")
try:
casts = movieCast.text.replace("\n","").split('|')
casts = [x.strip() for x in casts]
casts = [casts[i].replace(j, "") for i,j in enumerate(["Director:", "Stars:"])]
movieDirector.append(casts[0])
movieStars.append([x.strip() for x in casts[1].split(",")])
except:
casts = movieCast.text.replace("\n","").strip()
movieDirector.append(np.nan)
movieStars.append([x.strip() for x in casts.split(",")])
movieNumbers = movie.find_all("span", attrs={"name": "nv"})
if len(movieNumbers) == 2:
movieVotes.append(movieNumbers[0].text)
movieGross.append(movieNumbers[1].text)
elif len(movieNumbers) == 1:
movieVotes.append(movieNumbers[0].text)
movieGross.append(np.nan)
else:
movieVotes.append(np.nan)
movieGross.append(np.nan)
movieData = [movieTitle, movieDate, movieRunTime, movieGenre, movieRating, movieScore, movieDescription,
movieDirector, movieStars, movieVotes, movieGross]
return movieData
id1 = IMDB(url1)
#Get Article Title
print(id1.articleTitle())
#Get the first 5 movie data using for loop
for i in range(5):
print(movieData[i][:5])
The data we extracted from the website should be cleaned first before using it for data analyzation or machine learning but it will be done on my next project using exactly the data that we've been extracted so far.
- Reljod T. Oreta- Reljod