If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.
Usage:
1 |
$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com |
Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.
My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).
tinyDirtyIffyGoodEnoughWebCrawler.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import sys, thread, Queue, re, urllib, urlparse, time, os, sys dupcheck = set() q = Queue.Queue(100) q.put(sys.argv[1]) def queueURLs(html, origLink): for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I): link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] if link in dupcheck: continue dupcheck.add(link) if len(dupcheck) > 99999: dupcheck.clear() q.put(link) def getHTML(link): try: html = urllib.urlopen(link).read() open(str(time.time()) + ".html", "w").write("" % link + "n" + html) queueURLs(html, link) except (KeyboardInterrupt, SystemExit): raise except Exception: pass while True: thread.start_new_thread( getHTML, (q.get(),)) time.sleep(0.5) |
Features:
- Multi-threaded – for fastness
- Duplication elimination (kinda) – for link uniqueness
- Saves both source and its link – for the purpose it was built
- FREE
Enjoy,
Ran