There are lots of information on the internet, and tons of data is generated every second. Web scraping, or web content extraction , is the skill that extracts data from websites , and it became an essential part of every business. In this project, I’d like to show you some of libraries as building blocks for web scraping purposes.
Urllib is a package that combines several modules to preprocess the URLs. Urllib3 is one of the widely downloaded packages on PyPi, and it is the first execute in any web scraping script.
Before coding the web scraper, I need to identify what data I need. Since I’m a mad movie-goer, I’d like to see what movies have been released this year(2022) .
Let’s check ‘Metacritic’ website.
It shows 100 movies by user score. I can see I missed several movies from the list. Yikes! I would like to see the whole list of 100 movies along with some basic information including release date, score, plots, and thumbnails.
To quick start, we import the urllib3 along with ‘certifi’ to validate the trustworthiness of SSL certificates and verify the identity of TLS hosts, and ‘re’ to provide regular expression.
Then, I need to fetch the URL.
And then, python request pool needs to be constructed.
Once the pool was built, I can initiate a request.
I’d like to convert the data into text format.
Now, all the setup is done. Let me capture the data I’d like to get.
I captured titles, released dates, metascores, description, and thumbnail urls of each movie. Next, I need to create a dataframe with these captured data.
It’s time to check if the data is scraped correctly. Let’s print it out.
Bam! I got it. To use this data, I’d like to export this dataframe to EXCEL.
I noticed that a new Excel file was created at the location that I specified.
And, here is the table I made in EXCEL.
If you click the one of the thumbnails, it will open a new tab with a image.
All done! Now I can see all of the information I want at a glance.
I’d like to import the file to the Power BI to create a dashboard. There’s not much information but it would be a good practice.
The dashboard has ‘Number of Movies Released’ by month, ‘Average Metascores’, and the table of movie titles and scores.
The second quarter (April, May, and June) has more movies released among top 100 movies. In the line chart, movies in October have the highest average metascores, 88.8. A dotted line shows the average score of all movies in this dataset. The table on the right side shows the movie lists sorted by metascores in descending order.
I created a table-type tooltip to show a release date of each movie.
Web scraping is used in data science project to boost business growth in every sector. There are many other tools to do web scraping, and I explored one of them in this project. It wouldn’t be a very complicated work as long as you’ve learned a regular expression. (If you would like to learn more about a regex, here is a quick reference.) When you code a web scraper, it’s important to be as specific as possible about what you want to collect. This is because, if you keep things too vague, you will end up with too much data that you do not need. That’s why it took for me a long time to figure out the optimal regular expression for each data I wanted. For the visualization, I could’ve collected more data to do more insightful analysis even though the project was to demonstrate a web-scraping skill. Since I love to use Power BI for creating a dashboard, I’d like to create more project using the tool in the next post.
Thank you for reading a post. Please share any comments to improve my work. I’d appreciate it.
- 11 reasons why you should use web scraping (https://www.captaindata.co/blog/11-reasons-why-use-web-scraping )
- Essential of Web scraping urllib & Requests With Python (https://analyticsindiamag.com/web-scraping-frameworks/#:~:text=Urllib3%20is%20one%20of%20the%20widely%20downloaded%20packages,exceptions%20and%20errors%20raised%20by%20the%20urllib.request%20command. )
- How to Export Pandas DataFrame to an Excel File (https://datatofish.com/export-dataframe-to-excel/)