On this page

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

August 16, 2015 6 min

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website.

Use the following script to extract specific information from any website (i.e prices, ids, titles, phone numbers etc..). Populate the “fields” parameter with the names and the patterns (regular expression) of the data you want to extract. In this specific example, I extract the product names, prices, ratings and images from Amazon.com.

<br /><br /><br />
import sys, thread, Queue, re, urllib2, urlparse, time, csv<br /><br /><br />
<span style=”color: #f9fb04;”>### Set the site you want to crawl &amp; the patterns of the fields you want to extract ###</span><br /><br /><br />
siteToCrawl = “<span style=”color: #56ff00;”>https://www.amazon.com/</span>”<br /><br /><br />
fields = {}<br /><br /><br />
fields[“Title”] = ‘<span style=”color: #56ff00;”>&lt;title&gt;(.*?)&lt;/title&gt;</span>'<br /><br /><br />
fields[“Rating”] = ‘<span style=”color: #56ff00;”>title=”(S+) out of 5 stars”</span>'<br /><br /><br />
fields[“Price”] = ‘<span style=”color: #56ff00;”>data-price=”(.*?)”</span>'<br /><br /><br />
fields[“Image”] = ‘<span style=”color: #56ff00;”>src=”(https://ecx.images-amazon.com/images/I/.*?)”</span>'<br /><br /><br />
<span style=”color: #f9fb04;”>########################################################################</span><br /><br /><br />
dupcheck = set()<br /><br /><br />
q = Queue.Queue(25)<br /><br /><br />
q.put(siteToCrawl)<br /><br /><br />
csvFile = open(“output.csv”, “w”,0)<br /><br /><br />
csvTitles = dict(fields)<br /><br /><br />
csvTitles[“Link”] = “”<br /><br /><br />
writer = csv.DictWriter(csvFile, fieldnames=csvTitles)<br /><br /><br />
writer.writeheader()<br /><br /><br />
def queueURLs(html, origLink):<br /><br /><br />
    for url in re.findall(”’&lt;a[^&gt;]+href=[“‘](.[^”‘]+)[“‘]”’, html, re.I):<br /><br /><br />
        try:<br /><br /><br />
            if url.startswith(“http”) and urlparse.urlparse(url).netlock !=  urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain<br /><br /><br />
                continue<br /><br /><br />
        except Exception:<br /><br /><br />
            continue<br /><br /><br />
        link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]<br /><br /><br />
        if link in dupcheck:<br /><br /><br />
            continue<br /><br /><br />
        dupcheck.add(link)<br /><br /><br />
        if len(dupcheck) &gt; 99999:<br /><br /><br />
            dupcheck.clear()<br /><br /><br />
        q.put(link)<br /><br /><br />
def analyzePage(html,link):<br /><br /><br />
    print “Analyzing %s” % link<br /><br /><br />
    output = {}<br /><br /><br />
    for key, value in fields.iteritems():<br /><br /><br />
        m = re.search(fields[key],html, re.I | re.S)<br /><br /><br />
        if m:<br /><br /><br />
            output[key] = m.group(1)<br /><br /><br />
    output[“Link”] = link<br /><br /><br />
    writer.writerow(output)<br /><br /><br />
def getHTML(link):<br /><br /><br />
    try:<br /><br /><br />
        request = urllib2.Request(link)<br /><br /><br />
        request.add_header(‘User-Agent’, ‘Structured Data Extractor’)<br /><br /><br />
        html = urllib2.build_opener().open(request).read()<br /><br /><br />
        analyzePage(html,link)<br /><br /><br />
        queueURLs(html, link)<br /><br /><br />
    except (KeyboardInterrupt, SystemExit):<br /><br /><br />
        raise<br /><br /><br />
    except Exception, e:<br /><br /><br />
        print e<br /><br /><br />
while True:<br /><br /><br />
    thread.start_new_thread( getHTML, (q.get(),))<br /><br /><br />
    time.sleep(0.5)<br /><br /><br />

import sys, thread, Queue, re, urllib2, urlparse, time, csv

### Set the site you want to crawl & the patterns of the fields you want to extract ###

siteToCrawl = “https://www.amazon.com/”

fields = {}

fields[“Title”] = ‘<title>(.*?)</title>’

fields[“Rating”] = ‘title=”(S+) out of 5 stars”’

fields[“Price”] = ‘data-price=”(.*?)”’

fields[“Image”] = ‘src=”(https://ecx.images-amazon.com/images/I/.*?)”’

########################################################################

dupcheck = set()

q = Queue.Queue(25)

q.put(siteToCrawl)

csvFile = open(“output.csv”, “w”,0)

csvTitles = dict(fields)

csvTitles[“Link”] = “”

writer = csv.DictWriter(csvFile, fieldnames=csvTitles)

writer.writeheader()

def queueURLs(html, origLink):

for url in re.findall(”‘<a[^>]+href=[“‘](.[^“‘]+)[“‘]’”, html, re.I):

try:

if url.startswith(“http”) and urlparse.urlparse(url).netlock != urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain

continue

except Exception:

continue

link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]

if link in dupcheck:

continue

dupcheck.add(link)

if len(dupcheck) > 99999:

dupcheck.clear()

q.put(link)

def analyzePage(html,link):

print “Analyzing %s” % link

output = {}

for key, value in fields.iteritems():

m = re.search(fields[key],html, re.I | re.S)

if m:

output[key] = m.group(1)

output[“Link”] = link

writer.writerow(output)

def getHTML(link):

try:

request = urllib2.Request(link)

request.add_header(‘User-Agent’, ‘Structured Data Extractor’)

html = urllib2.build_opener().open(request).read()

analyzePage(html,link)

queueURLs(html, link)

except (KeyboardInterrupt, SystemExit):

raise

except Exception, e:

print e

while True:

thread.start_new_thread( getHTML, (q.get(),))

time.sleep(0.5)

Some notes:

I have set a user agent name, as some websites block crawling if no user agent is present
No external imports are required
You can define as many fields to extract as you’d like. The field name is the “key” in the “fields” parameter
As I use regular expressions to define where the content is, no DOM parsing is performed, so malformed HTML pages are none issue.
Each time you run the script it will overwrite the content in output.csv

Enjoy,

Ran

Ran Geva

CEO

Spread the news

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Ran Geva

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

Power Your Insights with Data You Can Trust

Ready to Explore Web Data at Scale?