On this page

Tiny basic multi-threaded web crawler in Python

August 12, 2015 3 min

Tiny basic multi-threaded web crawler in Python

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

<br /><br /><br />
$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

1	$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

<br /><br /><br />
import sys, thread, Queue, re, urllib, urlparse, time, os, sys<br /><br /><br />
dupcheck = set()<br /><br /><br />
q = Queue.Queue(100)<br /><br /><br />
q.put(sys.argv[1])<br /><br /><br />
def queueURLs(html, origLink):<br /><br /><br />
    for url in re.findall(”’&lt;a[^&gt;]+href=[“‘](.[^”‘]+)[“‘]”’, html, re.I):<br /><br /><br />
        link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]<br /><br /><br />
        if link in dupcheck:<br /><br /><br />
            continue<br /><br /><br />
        dupcheck.add(link)<br /><br /><br />
        if len(dupcheck) &gt; 99999:<br /><br /><br />
            dupcheck.clear()<br /><br /><br />
        q.put(link)<br /><br /><br />
def getHTML(link):<br /><br /><br />
    try:<br /><br /><br />
        html = urllib.urlopen(link).read()<br /><br /><br />
        open(str(time.time()) + “.html”, “w”).write(“” % link  + “n” + html)<br /><br /><br />
        queueURLs(html, link)<br /><br /><br />
    except (KeyboardInterrupt, SystemExit):<br /><br /><br />
        raise<br /><br /><br />
    except Exception:<br /><br /><br />
        pass<br /><br /><br />
while True:<br /><br /><br />
    thread.start_new_thread( getHTML, (q.get(),))<br /><br /><br />
    time.sleep(0.5)<br /><br /><br />

import sys, thread, Queue, re, urllib, urlparse, time, os, sys

dupcheck = set()

q = Queue.Queue(100)

q.put(sys.argv[1])

def queueURLs(html, origLink):

for url in re.findall(”‘<a[^>]+href=[“‘](.[^“‘]+)[“‘]’”, html, re.I):

link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]

if link in dupcheck:

continue

dupcheck.add(link)

if len(dupcheck) > 99999:

dupcheck.clear()

q.put(link)

def getHTML(link):

try:

html = urllib.urlopen(link).read()

open(str(time.time()) + “.html”, “w”).write(“” % link + “n” + html)

queueURLs(html, link)

except (KeyboardInterrupt, SystemExit):

raise

except Exception:

pass

while True:

thread.start_new_thread( getHTML, (q.get(),))

time.sleep(0.5)

Features:

Multi-threaded – for fastness
Duplication elimination (kinda) – for link uniqueness
Saves both source and its link – for the purpose it was built
FREE

Enjoy,

Ran

Ran Geva

CEO

Spread the news

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.

Tiny basic multi-threaded web crawler in Python

Ran Geva

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

Power Your Insights with Data You Can Trust

Ready to Explore Web Data at Scale?