Tiny basic multi-threaded web crawler in Python

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

Features:

  • Multi-threaded – for fastness
  • Duplication elimination (kinda) – for link uniqueness
  • Saves both source and its link – for the purpose it was built
  • FREE

Enjoy,

Ran

SPREAD THE NEWS

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.
Subscribe to our newsletter for more news and updates!

Ready to Explore Web Data at Scale?

Speak with a data expert to learn more about Webz.io’s solutions
Create your API account and get instant access to millions of web sources