Learning Outcomes

  • Understand the benefits and use cases of web scraping.
  • Learn how to parse the HTML content of a webpage using BeautifulSoup to extract specific elements.
  • Learn how to scan the HTML for specific keywords.
  • Learn how to scrape multiple web pages.
  • Learn how to store your web scraped data into a pandas dataframe.
  • Learn how to save the web scraped data as a local .csv file.

The following installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol

!pip install beautifulsoup4
!pip install requests
# Library Imports
import pandas as pd
from bs4 import BeautifulSoup
import requests

Why Learn Web Scraping?

Learning web scraping is a useful skill, whether you work as a programmer, marketer or analyst.

It’s a fantastic way for you to analyse websites. Web scraping should never replace a tool such as ScreamingFrog, however when you’re creating data pipelines with Python or JavaScript scripts, then you’ll likely want to write a custom scraper.

Because what’s the point of doing a website crawl if you only need a few pieces of information per page?


Once you have acquired advanced web scraping skills, you can:

  • Accurately monitor your competitors.
  • Create data pipelines that push fresh HTML data into a data warehouse such as BigQuery.
  • Allow you to blend it with other data sources such as Google Search Console or Google Analytics data.
  • Create your own APIs for websites that don’t publicly expose an API.

There are many other uses for why web scraping is a powerful skill to possess.


Challenges of Web Scraping

Firstly every website is different, this means it can be difficult to build a robust web scraper that will work on every website. You’ll likely need to create unique selectors for each website which can be time-consuming.

Secondly, your scripts are more likely to fail over time because websites change. Whenever a marketer, owner or developer makes changes to their website, it could lead to your script breaking. Therefore for larger proejcts its essential that you create a monitoring system so that you can fix these problems as they arise.


How To Web Scrape A Single HTML Page:

In order to scrape a web page in python or any programming language, we will need to download the HTML content.

The library that we’ll be using is requests.

url = 'https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411'​
response = requests.get(url)
print(response)

As long as the status code is 200 (which means Ok), then we’ll be able to access the web page. You can always check the status code with:

print(response.status_code)
if response.status_code == 200:
   print(response)

To access the content of a request, simply use:

response.content
# This will store the HTML content as a stream of bytes:
html_content = response.content
# This will store the HTML content as a string:
html_content_string = response.text​

Parsing the HTML Content to a Parser

Simply downloading the HTML page is not enough, particularly if we would like to extract elements from it. Therefore we will use a python package called BeautifulSoup. BeautifulSoup provides us with a large amount of DOM (document object model) parsing methods.

In order to parse the DOM of a page, simply use:

soup = BeautifulSoup(html_content, 'html.parser')
help(soup)

We can now see that instead of a HTML bytes string, we have a BeautifulSoup object, that has many functions on it!


In our example, we’ll be web scraping indeed and extracting job information from Indeed.co.uk

  • The job will be: data scientist.
  • The area will be london.

Investigate The URL

url = ‘https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411′

There can be a lot of information inside of a URL.

Its important for you to be able to identify the structure of URLs and to reverse engineer how they might have been created.

  1. The base URL means the path to the jobs functionality of the website which in this case is: https://www.index.co.uk/
  2. Query Parameters are a way for the jobs search to be dynamic, in the above example they are: ?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411′

Query parameters consist of:

  • The start of the query at q
  • A key and value for each query parameter (i.e. l = london or start=40)
  • A separator which is an ampersand symbol (&) that separates all of the key + value query parameters.

Visually Inspect The Webpage In Google Chrome Dev Tools

Before jumping straight into coding, its worthwhile visually inspecting the HTML page content within your browser. This will give you a sense of how the website is constructed and what repeating patterns you can see within the HTML.

Google Chrome Developer tools is a free available tool that allows you to visually inspect the HTML code.

Navigate to it by:

  1. Opening up Google Chrome.
  2. Right clicking on a webpage.
  3. Clicking inspect.


Find Element By HTML ID

It is possible to select specific HTML elements by using the #id CSS selector.

appPromoBanner = soup.find('div', {'id':'appPromoBanner'})

Find Element By HTML Class Name

Alternatively, you can find elements by their class selector.

container_div = soup.find('div', class_='tab-container')
len(container_div)
10

How To Extract Text From HTML Elements

As well as selecting the entire HTML element, you can also easily extract the text using BeautifulSoup.

Let’s see how this might work whilst scraping a single job advertisement:


job_url = 'https://www.indeed.co.uk/viewjob?cmp=Crowd-Link-Consulting&t=Business+Intelligence+Engineer&jk=9129263166da1718&q=data+engineer&vjs=3'
resp = requests.get(job_url)
soup = BeautifulSoup(resp.content, 'html.parser')

Extracting The Title Tag

Firstly let’s extract the title tag and then use .text to obtain the text:

title_tag_text = soup.title.text
print(title_tag_text)
Business Intelligence Engineer - Woking - Indeed.co.uk

Or we can extract the first paragraph on the webpage, then get the text for that element:

first_paragraph = soup.find('p')
print(first_paragraph)

Business Intelligence Engineer – Woking, Surrey


How To Extract Multiple HTML Elements

Sometimes you’ll want to store multiple elements, for example if there is a list of job advertisements on the same page. The following method will return a list of elements rather than just the first element:

soup.findAll(some_element)
all_paragraphs = soup.findAll('p')
print(all_paragraphs[0:3])
[

Business Intelligence Engineer – Woking, Surrey

,

Objective

,

This role needs to work closely with our client’s customers to turn data into critical information and knowledge that can be used to make sound business decisions. They provide data that is accurate, congruent, reliable and is easily accessible.

]

If we wanted to extract the text of every paragraph element, we could just do a list compehension:

all_paragraphs_text = [paragraph.text.strip() for paragraph in all_paragraphs]

It’s also possible to remove paragraph tags if they contain empty strings, by only including paragraphs which are truthy (don’t have empty strings).

# This will only return paragraphs that don't have empty strings!
full_paragraphs = [paragraph for paragraph in all_paragraphs_text if paragraph]
print(len(full_paragraphs))
12

How To Web Scrape Multiple HTML Pages:

If you’d like to web scrape multiple pages, then we’ll simply create a for loop and multiple beautifulsoup objects.

The important things are:

  • Have a results dictionary or list(s) that is outside of the loop.
  • Extract either the result or N/A or a NaN (not a number), this is especially important when you’re using python lists as it ensures that all of your python lists will always be the same length.
urls = ['http://understandingdata.com/', 'https://understandingdata.com/about-me/', 'https://understandingdata.com/contact/']

1. Create a results list to store all of the web scraped data:
results = []
for url in urls:
   # 2. Obtain the HTML response:
   response = requests.get(url)
   # 3. Create a BeautifulSoup object:
   soup = BeautifulSoup(response.content, 'html.parser')
   # 4. Extract the elements per URL:
   title_tag = soup.title
   results.append(title_tag.text)
print(results)

. . .

website titles extracted with beautifulsoup and python

How To Scan HTML Content For Specific Keywords

Particularly in a marketing context, if one of your web pages is ranking for 5 keywords it would be beneficial to know:

  • If every keyword was on a given HTML page.
  • If there were keywords on / missing from the HTML page.

By writing a web scraper we can easily answer these questions at scale.


Let’s say that our keyword is Understanding Data, we will normalise this to be lowercase with .lower()

url_dict = {}

keyword = 'Understanding Data'.lower()

for url in urls:
    # Creating a new item in the dictionary:
    url_dict[url] = {'in_title': False, 'in_html': False}
    
    # Obtaining the HTML page with python requests:
    response = requests.get(url)
    
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extract the HTML content into a string and normalise it to be lowercase:
        cleaned_html_text = response.text.lower()
        # Extract the HTML elements using BeautifulSoup:
        title_tag = soup.title
        # Checking to see if the keyword is present in the HTML and the  tag and the HTML content:
        if keyword in title_tag:
            url_dict[url]['in_title'] = True
        if keyword in cleaned_html_text:
            url_dict[url]['in_html'] = True</code></pre>
                <pre class="wp-block-preformatted">print(url_dict)</pre>
                <p>. . .</p>
                <figure class="wp-block-image">
                  <img class="lazy lazy-hidden" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-lazy-type="image" data-lazy-src="/wp-content/uploads/2020/11/keyword_in_html.png" alt="keyword in HTML content detection with python"><noscript><img src="/wp-content/uploads/2020/11/keyword_in_html.png" alt="keyword in HTML content detection with python"></noscript>
                </figure>
                <hr class="wp-block-separator">
                <p>Notice above, how easily it is to web scrape multiple pages and search the HTML content as well as the title tag.</p>
                <p>This can be extended to search many more HTML elements rather than just two.</p>
                <p>If we would like to do 30 or 50 it would be better to use this structure:</p>
                <pre class="wp-block-code"><code>for item_name, data in zip(['in_html', 'in_title'],[cleaned_html_text, title_tag]):
    if keyword in data:
        url_dict[url][item_name] = True</code></pre>
                <hr class="wp-block-separator">
                <h2 id="How-To-Create-A-Pandas-Dataframe-From-Web-Scraped-Data">
<span class="ez-toc-section" id="How_To_Create_A_Pandas_Dataframe_From_Web_Scraped_Data"></span>How To Create A Pandas Dataframe From Web Scraped Data<span class="ez-toc-section-end"></span>
</h2>
                <p>After web scraping and collecting your data from many web pages, it’s ideal to store it within a pandas dataframe. From there you’ll be able to push it directly to BigQuery or store it locally as a .csv.</p>
                <pre class="wp-block-preformatted"><strong>!</strong>pip install pandas</pre>
                <pre class="wp-block-preformatted">Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.7/site-packages (1.1.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/lib/python3.7/site-packages (from pandas) (1.19.1)
Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/lib/python3.7/site-packages (from pandas) (2020.1)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
</pre>
                <pre class="wp-block-preformatted"><strong>import</strong> pandas <strong>as</strong> pd</pre>
                <pre class="wp-block-preformatted">master_df <strong>=</strong> pd.DataFrame()
master_df <strong>=</strong> master_df.from_dict(url_dict, orient<strong>=</strong>'index')
<em># Resetting the index:</em>
master_df.reset_index(drop<strong>=False</strong>, inplace<strong>=True</strong>)
master_df.rename(columns<strong>=</strong>{'index': 'URL'}, inplace<strong>=True</strong>)</pre>
                <pre class="wp-block-preformatted">master_df.head()</pre>
                <p>. . .</p>
                <figure class="wp-block-image">
                  <img class="lazy lazy-hidden" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-lazy-type="image" data-lazy-src="/wp-content/uploads/2020/11/how-to-store-data-into-a-pandas-dataframe.png" alt="how to store web scraped data in a pandas dataframe"><noscript><img src="/wp-content/uploads/2020/11/how-to-store-data-into-a-pandas-dataframe.png" alt="how to store web scraped data in a pandas dataframe"></noscript>
                </figure>
                <hr class="wp-block-separator">
                <h2 id="How-To-Save-The-Web-Scraped-Data-To-A-.CSV">
<span class="ez-toc-section" id="How_To_Save_The_Web_Scraped_Data_To_A_CSV"></span>How To Save The Web Scraped Data To A .CSV<span class="ez-toc-section-end"></span>
</h2>
                <p>Now that the data is inside of a pandas dataframe, we can easily save it with the following method:</p>
                <pre class="wp-block-code"><code>master_df.to_csv(file_name.csv)</code></pre>
                <pre class="wp-block-preformatted">master_df.to_csv('web_scraped_data.csv')</pre>
                <hr class="wp-block-separator">
                <h2 id="Conclusion">
<span class="ez-toc-section" id="Conclusion"></span>Conclusion<span class="ez-toc-section-end"></span>
</h2>
                <p>Hopefully this tutorial has sparked your curiosity with web scraping, I’d recommend reviewing the following resources to learn more:</p>
                <ul>
                  <li>Beautiful Soup documentation</li>
                  <li>Python Requests documentation</li>
                </ul>
                <div class="post-info-bottom">
                  <div class="social_icon">
                    share:
                    <ul class="list-unstyled">
                      <li><i class="social_facebook"></i></li>
                      <li><i class="social_twitter"></i></li>
                      <li><i class="social_pinterest"></i></li>
                      <li><i class="social_linkedin"></i></li>
                    </ul>
                  </div>
<a class="post-info-comments" href="#comments"><i class="icon_comment_alt" aria-hidden="true"></i> <span>No Comments</span></a>
                </div>
              </div>
            </div>
            <div class="media post_author_two">
              <img alt="James Phoenix" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-lazy-type="image" data-lazy-src="https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?#038;r=g" data-lazy-srcset="https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?#038;r=g 2x" class="lazy lazy-hidden avatar avatar-90 photo img_rounded" height="90" width="90" loading="lazy"><noscript><img alt="James Phoenix" src="https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?#038;r=g" srcset="https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?#038;r=g 2x" class="avatar avatar-90 photo img_rounded" height="90" width="90" loading="lazy"></noscript>
              <div class="media-body">
                <div class="comment_info">
                  <h3>James Phoenix</h3>
                </div>
                <p>A digital marketer turned data scientist. I love data, statistics, marketing and want to help you use analytics to drive actionable change.</p>
              </div>
            </div>
          </div>
          <div class="col-lg-4">
            <div class="blog-sidebar">
              <div id="subscribe-2" class="widget sidebar_widget company_widget">
                <h3 class="widget_title_two">Boost Your Organic Search Traffic 🚀</h3>
                <p>Leverage Data Science, Statistics & A/B Testing With Your Google Search Search Console Data.</p>
                <form class="f_subscribe_two mailchimp" method="post" novalidate="true">
                  <input type="text" name="EMAIL" class="form-control memail" placeholder="Email"> <button class="btn btn_get btn_get_two" type="submit">Request To Join Beta</button>
                  <p class="mchimp-errmessage" style="display: none;"></p>
                  <p class="mchimp-sucmessage" style="display: none;"></p>
                </form>
              </div>
            </div>
          </div>
        </div>
      </div>
    </section>
    <div data-elementor-type="wp-post" data-elementor-id="3189" class="elementor elementor-3189" data-elementor-settings="[]">
      <div class="elementor-inner">
        <div class="elementor-section-wrap">
          <section class="elementor-section elementor-top-section elementor-element elementor-element-7adf0727 elementor-section-boxed elementor-section-height-default elementor-section-height-default" data-id="7adf0727" data-element_type="section" data-settings='{"background_background":"gradient"}'>
            <div class="elementor-container elementor-column-gap-default">
              <div class="elementor-row">
                <div class="elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-67a119ba" data-id="67a119ba" data-element_type="column">
                  <div class="elementor-column-wrap elementor-element-populated">
                    <div class="elementor-widget-wrap">
                      <div class="elementor-element elementor-element-6a0f82da elementor-widget elementor-widget-image" data-id="6a0f82da" data-element_type="widget" data-widget_type="image.default">
                        <div class="elementor-widget-container">
                          <div class="elementor-image">
                            <img class="lazy lazy-hidden" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-lazy-type="image" data-lazy-src="//18.217.196.80/wp-content/uploads/2019/10/new_logo.png" title="" alt=""><noscript><img src="//18.217.196.80/wp-content/uploads/2019/10/new_logo.png" title="" alt=""></noscript>
                          </div>
                        </div>
                      </div>
                      <div class="elementor-element elementor-element-67cc80ad elementor-widget elementor-widget-text-editor" data-id="67cc80ad" data-element_type="widget" data-widget_type="text-editor.default">
                        <div class="elementor-widget-container">
                          <div class="elementor-text-editor elementor-clearfix">
                            <p>Copyright © 2019</p>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
                <div class="elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-1cd9df11" data-id="1cd9df11" data-element_type="column">
                  <div class="elementor-column-wrap elementor-element-populated">
                    <div class="elementor-widget-wrap">
                      <div class="elementor-element elementor-element-599891cf elementor-widget elementor-widget-heading" data-id="599891cf" data-element_type="widget" data-widget_type="heading.default">
                        <div class="elementor-widget-container">
                          <h2 class="elementor-heading-title elementor-size-default">About Us</h2>
                        </div>
                      </div>
                      <div class="elementor-element elementor-element-47484d6d elementor-widget elementor-widget-text-editor" data-id="47484d6d" data-element_type="widget" data-widget_type="text-editor.default">
                        <div class="elementor-widget-container">
                          <div class="elementor-text-editor elementor-clearfix">
                            <p><span style="color: white;"><a style="color: white;" href="#">Sign Up For Beta</a></span></p>
                            <p><span style="color: white;"><a style="color: white;" href="//18.217.196.80/about-us/">About Us</a></span></p>
                            <p><span style="color: white;"><a style="color: white;" href="//18.217.196.80/contact/">Contact</a></span></p>
                            <p><span style="color: white;"><a style="color: white;" href="//18.217.196.80/blog/">Blog</a></span></p>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
                <div class="elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-2a824f7b" data-id="2a824f7b" data-element_type="column">
                  <div class="elementor-column-wrap elementor-element-populated">
                    <div class="elementor-widget-wrap">
                      <div class="elementor-element elementor-element-12feec79 elementor-widget elementor-widget-heading" data-id="12feec79" data-element_type="widget" data-widget_type="heading.default">
                        <div class="elementor-widget-container">
                          <h2 class="elementor-heading-title elementor-size-default">Follow Us</h2>
                        </div>
                      </div>
                      <div class="elementor-element elementor-element-47e396f elementor-shape-circle elementor-grid-0 elementor-widget elementor-widget-social-icons" data-id="47e396f" data-element_type="widget" data-widget_type="social-icons.default">
                        <div class="elementor-widget-container">
                          <div class="elementor-social-icons-wrapper elementor-grid">
                            <div class="elementor-grid-item">
                              <span class="elementor-screen-only">Facebook-f</span> <i class="fab fa-facebook-f"></i>
                            </div>
                            <div class="elementor-grid-item">
                              <span class="elementor-screen-only">Twitter</span> <i class="fab fa-twitter"></i>
                            </div>
                            <div class="elementor-grid-item">
                              <span class="elementor-screen-only">Instagram</span> <i class="fab fa-instagram"></i>
                            </div>
                            <div class="elementor-grid-item">
                              <span class="elementor-screen-only">Linkedin</span> <i class="fab fa-linkedin"></i>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </section>
        </div>
      </div>
    </div>
  </div>
  <link rel="stylesheet" id="elementor-post-3189-css" href="/wp-content/cache/autoptimize/css/autoptimize_single_4782c5252a77882b9167449a2673400e.css" type="text/css" media="all">
  <link rel="stylesheet" id="elementor-post-5187-css" href="/wp-content/cache/autoptimize/css/autoptimize_single_63a91e691f5e0294a70ffcb2ce4f3d91.css" type="text/css" media="all">
  <link rel="stylesheet" id="google-fonts-1-css" href="//fonts.googleapis.com/css?family=Poppins%3A100%2C100italic%2C200%2C200italic%2C300%2C300italic%2C400%2C400italic%2C500%2C500italic%2C600%2C600italic%2C700%2C700italic%2C800%2C800italic%2C900%2C900italic%7CRoboto%3A100%2C100italic%2C200%2C200italic%2C300%2C300italic%2C400%2C400italic%2C500%2C500italic%2C600%2C600italic%2C700%2C700italic%2C800%2C800italic%2C900%2C900italic%7CRoboto+Slab%3A100%2C100italic%2C200%2C200italic%2C300%2C300italic%2C400%2C400italic%2C500%2C500italic%2C600%2C600italic%2C700%2C700italic%2C800%2C800italic%2C900%2C900italic&ver=5.5.3" type="text/css" media="all">
  <script defer src="/wp-content/cache/autoptimize/js/autoptimize_eedb0b7f0dbecf478604db1e4355bab8.js"></script>
</body>
</html>