To effectively analyse websites, knowing how to download all of the sitemap.xml files for a particular website is an incredibly useful skill.

Forunately, there are python packages that allow us to easily download all of sitemap.xml file’s with brute force!


NB: If you’re using a standard python environment, then simply exclude the ! symbol. The reason for using !pip install is because this guide is written in a jupyter notebook.

!pip install ultimate-sitemap-parser
!pip install requests
from usp.tree import sitemap_tree_for_homepage
import requests

Download all of the Sitemap.xml files based upon the URL of the homepage:

After running the following method, we’ve found all of the sitemap files and have saved them to a variable called tree:

tree = sitemap_tree_for_homepage('https://understandingdata.com/')
print(tree)

sitemap_tree_for_homepage() returns a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on a given website.

To find all of the pages we can simply do:

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

Also, you can save of the URLs to a new variable via a list comprehension:

urls = [page.url for page in tree.all_pages()]
print(len(urls), urls[0:2])

Conclusion

Now you’ll hopefully be able to easily find all of the sitemap.xml files and the web pages in just a few lines of python code!

Copyright © 2019

Follow Us

Facebook-f
Twitter
Instagram
Linkedin