Web Scraping refers to the process of extracting HTML source code of websites which is further used to extract useful information. It is an automation task that can quickly retrieve and display data according to our needs. And when it comes to automation, our hero is Python. In this article we will be using python’s beautiful soup library for web scraping.
Python’s Beautiful Soup library provides useful methods that can get entire HTML source code of the website and tools to extract information according to our use. Before we dive into coding, we need to examine the website URL. An URL consists of two parts, base URL and the query parameters. We will understand them using an example:
Above URL is of Google Translate website. Here, the base URL is:
And rest is query parameters. They begin with a ‘?’ and are separated by an ‘&’. In this URL, there are four query parameters:
- sl = en
- tl = es
- text = hello
- op = translate
Here, we note that ‘sl’ and ‘tl’ specify the language codes. text specifies the text that is to be translated and op stands for operation which in this case is translate. We can change these parameters in the URL and we will get new results. Let us change our text to bye and see the results:
We can see our text has changed from hello to bye in the website as well. So we have seen how query parameters can change the content on the website. For web scraping this is the first step where we analyse the URL and decide how each parameter changes the content being displayed on the website.
So, lets get started with installation of beautiful soup. For installation run:
pip install beautifulsoup4
Along with this, we will also install requests library to open url. To install requests run:
pip install requests
We will perform web scraping on the following website. This is an online website for the famous book “The Canterville Ghost” by “Oscar Wilde”:
If we navigate through various pages of the website, we will realise that in the URL, the number increments by 1 every time we move to the next page. We will use this information to extract content of multiple pages. But before that we need to examine the HTML source code to identify exactly where is our main content being stored. For that right click on the page and click on inspect. It will open a side window which shows all the source code.
We can notice that all the content of book is wrapped under <p> tag under <div> which has class = ‘text text-margin’. So lets get started with the python code:
We begin by importing the required libraries. We want to extract content from first 10 pages of the book, so we will run a for loop and recursively extract and print the content. We get the HTML source code using get method of requests library and pass our code through a HTML parser. If we print this, it will show the HTML code of entire webpage. But we require only text from book, so for that we will use the find method and search for ‘text text_margin’ class. Yayy!! our work is done! In this way we can extract a book in matter of a few seconds. Further we can write this content to a file and convert it to pdf format.
With a few lines of code we can create our own pdf of our favourite book. This article covered a basic tutorial of scrapping a static website, we can extract useful information from source code using many more tools in beautiful soup or using regex. This was just a very small example use of web scrapping, we can use it in many diverse ways and extract useful content. We may also import the content to a file or an excel sheet. This excel sheet can be converted to csv/json format for advanced uses in Data Science and Machine Learning. Read my previous articles here and for more interesting programming and automation content, stay tuned :)