Tiny basic multi-threaded web crawler in Python

August 12, 2015

Ran Geva

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

1	$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

import sys, thread, Queue, re, urllib, urlparse, time, os, sys
dupcheck = set()  
q = Queue.Queue(100) 
q.put(sys.argv[1]) 
def queueURLs(html, origLink): 
    for url in re.findall('''&lt;a[^&gt;]+href=["'](.[^"']+)["']''', html, re.I): 
        link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] 
        if link in dupcheck:
            continue
        dupcheck.add(link)
        if len(dupcheck) &gt; 99999: 
            dupcheck.clear()
        q.put(link) 
def getHTML(link): 
    try:
        html = urllib.urlopen(link).read() 
        open(str(time.time()) + ".html", "w").write("" % link  + "n" + html) 
        queueURLs(html, link) 
    except (KeyboardInterrupt, SystemExit): 
        raise
    except Exception:
        pass
while True:
    thread.start_new_thread( getHTML, (q.get(),)) 
    time.sleep(0.5)

import sys, thread, Queue, re, urllib, urlparse, time, os, sys

dupcheck = set()

q = Queue.Queue(100)

q.put(sys.argv[1])

def queueURLs(html, origLink):

for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I):

link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0]

if link in dupcheck:

continue

dupcheck.add(link)

if len(dupcheck) > 99999:

dupcheck.clear()

q.put(link)

def getHTML(link):

try:

html = urllib.urlopen(link).read()

open(str(time.time()) + ".html", "w").write("" % link + "n" + html)

queueURLs(html, link)

except (KeyboardInterrupt, SystemExit):

raise

except Exception:

pass

while True:

thread.start_new_thread( getHTML, (q.get(),))

time.sleep(0.5)

Features:

Multi-threaded – for fastness
Duplication elimination (kinda) – for link uniqueness
Saves both source and its link – for the purpose it was built
FREE

Enjoy,

Ran

Ran Geva

See author's posts

SPREAD THE NEWS

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Tiny basic multi-threaded web crawler in Python

Ran Geva

Ran Geva

Subscribe to our newsletter for more news and updates!

Ready to Explore Web Data at Scale?