How to retrieve a static webpage and parse information using Python

My experience with Python is quite limited - about 100 years ago I used quite a lot of Python to build software in an integration department. After setting up the environment for Python3, my Mac should be able to run a Python script. A lot of people who want to pick up Python are writing a script to rip a webpage, so that’s exactly what I am doing here.

Let’s download a webpage and list all image URLs in there!

What is a static webpage?

A static webpage is a page where the contents are rendered on the server and sent to the client. This means that the content is not loaded through Javascript, but all the tags and text are constructed on the server.

This website is completely static. Nothing runs on the client, so it is an ideal candidate for ripping. Even if the content is not that interesting yet :).

What do we need?

Aside from an text editor (or IDE), Python and some kind of terminal, we need to get a library called BeautifulSoup:

pip3 install beautifulsoup4 html5lib requests

After the command above gives us, once has finished rattling, a library that allows us to parse the HTML (aka The Beautiful Soup) without worrying too much about tags that are closed or are not closed, encodings and any other quirks HTML has.

Note that the BeautifulSoup package name ends with a 4. Omitting this results in errors. Don’t ask me how I know this.

Time to write some stuff!

Once everything is in place, fire up your favourite text editor (might be VS Code) and create a new file called ripper.py and import all the modules we need to list all the image URLs:

import requests
from bs4 import BeautifulSoup

URL = "https://www.stephanpoelwijk.com"       # This website

Next up, it gets more interesting - we’re getting something!

pageRequest = requests.get(URL)
soup = BeautifulSoup(pageRequest.content, "html5lib")

The first line makes a request to the server and downloads the HTML. The images, style information and other external references are not automatically downloaded. Also links are not followed. Only the HTML of the homepage is retrieved.

Since we are only listing the image URLs and are not downloading the actual images, this will do for now.

Because the soup variable contains all the HTML tags, we can run a query to find all the img tags and print them out:

allImages = soup.findAll("img")

for image in allImages:
    print(image["src"])

This will print the following output (currently there are only three images on the home page):

/img/logo.png
/img/hero-bg.jpeg
/img/pal.png

Conclusion

That’s it for now. Next step is to download the images and store them locally in a tmp directory.

Please keep in mind that not all websites allow ripping content. It is often forbidden in the Terms & Conditions. Website administrators often take measures to prevent any form of ripping to an extent.