How to Automate Supply Chain Risk Reports: A Guide for Developers
Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.
On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website.
Use the following script to extract specific information from any website (i.e prices, ids, titles, phone numbers etc..). Populate the “fields” parameter with the names and the patterns (regular expression) of the data you want to extract. In this specific example, I extract the product names, prices, ratings and images from Amazon.com.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
import sys, thread, Queue, re, urllib2, urlparse, time, csv
<span style=“color: #f9fb04;”>### Set the site you want to crawl & the patterns of the fields you want to extract ###</span>
siteToCrawl = “<span style=”color: #56ff00;”>https://www.amazon.com/</span>”
fields = {}
fields[“Title”] = ‘<span style=”color: #56ff00;”><title>(.*?)</title></span>’
fields[“Rating”] = ‘<span style=”color: #56ff00;”>title=”(S+) out of 5 stars”</span>’
fields[“Price”] = ‘<span style=”color: #56ff00;”>data-price=”(.*?)”</span>’
fields[“Image”] = ‘<span style=”color: #56ff00;”>src=”(https://ecx.images-amazon.com/images/I/.*?)”</span>’
<span style=“color: #f9fb04;”>########################################################################</span>
dupcheck = set()
q = Queue.Queue(25)
q.put(siteToCrawl)
csvFile = open(“output.csv”, “w”,0)
csvTitles = dict(fields)
csvTitles[“Link”] = “”
writer = csv.DictWriter(csvFile, fieldnames=csvTitles)
writer.writeheader()
def queueURLs(html, origLink):
for url in re.findall(”‘<a[^>]+href=[“‘](.[^“‘]+)[“‘]’”, html, re.I):
try:
if url.startswith(“http”) and urlparse.urlparse(url).netlock != urlparse.urlparse(siteToCrawl).netlock: # Make sure we keep crawling the same domain
continue
except Exception:
continue
link = url.split(“#”, 1)[0] if url.startswith(“http”) else ‘{uri.scheme}://{uri.netloc}’.format(uri=urlparse.urlparse(origLink)) + url.split(“#”, 1)[0]
if link in dupcheck:
continue
dupcheck.add(link)
if len(dupcheck) > 99999:
dupcheck.clear()
q.put(link)
def analyzePage(html,link):
print “Analyzing %s” % link
output = {}
for key, value in fields.iteritems():
m = re.search(fields[key],html, re.I | re.S)
if m:
output[key] = m.group(1)
output[“Link”] = link
writer.writerow(output)
def getHTML(link):
try:
request = urllib2.Request(link)
request.add_header(‘User-Agent’, ‘Structured Data Extractor’)
html = urllib2.build_opener().open(request).read()
analyzePage(html,link)
queueURLs(html, link)
except (KeyboardInterrupt, SystemExit):
raise
except Exception, e:
print e
while True:
thread.start_new_thread( getHTML, (q.get(),))
time.sleep(0.5)
|
Some notes:
Enjoy,
Ran
Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.
Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.
A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.