On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website.
Use the following script to extract specific information from any website (i.e prices, ids, titles, phone numbers etc..). Populate the “fields” parameter with the names and the patterns (regular expression) of the data you want to extract. In this specific example, I extract the product names, prices, ratings and images from Amazon.com.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
import sys, thread, Queue, re, urllib2, urlparse, time, csv <span style="color: #f9fb04;">### Set the site you want to crawl & the patterns of the fields you want to extract ###</span> siteToCrawl = "<span style="color: #56ff00;">https://www.amazon.com/</span>" fields = {} fields["Title"] = '<span style="color: #56ff00;"><title>(.*?)</title></span>' fields["Rating"] = '<span style="color: #56ff00;">title="(S+) out of 5 stars"</span>' fields["Price"] = '<span style="color: #56ff00;">data-price="(.*?)"</span>' fields["Image"] = '<span style="color: #56ff00;">src="(https://ecx.images-amazon.com/images/I/.*?)"</span>' <span style="color: #f9fb04;">########################################################################</span> dupcheck = set() q = Queue.Queue(25) q.put(siteToCrawl) csvFile = open("output.csv", "w",0) csvTitles = dict(fields) csvTitles["Link"] = "" writer = csv.DictWriter(csvFile, fieldnames=csvTitles) writer.writeheader() def queueURLs(html, origLink): for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I): try: if url.startswith("http") and urlparse.urlparse(url).netlock != urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain continue except Exception: continue link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] if link in dupcheck: continue dupcheck.add(link) if len(dupcheck) > 99999: dupcheck.clear() q.put(link) def analyzePage(html,link): print "Analyzing %s" % link output = {} for key, value in fields.iteritems(): m = re.search(fields[key],html, re.I | re.S) if m: output[key] = m.group(1) output["Link"] = link writer.writerow(output) def getHTML(link): try: request = urllib2.Request(link) request.add_header('User-Agent', 'Structured Data Extractor') html = urllib2.build_opener().open(request).read() analyzePage(html,link) queueURLs(html, link) except (KeyboardInterrupt, SystemExit): raise except Exception, e: print e while True: thread.start_new_thread( getHTML, (q.get(),)) time.sleep(0.5) |
Some notes:
- I have set a user agent name, as some websites block crawling if no user agent is present
- No external imports are required
- You can define as many fields to extract as you’d like. The field name is the “key” in the “fields” parameter
- As I use regular expressions to define where the content is, no DOM parsing is performed, so malformed HTML pages are none issue.
- Each time you run the script it will overwrite the content in output.csv
Enjoy,
Ran