Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website.

Use the following script to extract specific information from any website (i.e prices, ids, titles, phone numbers etc..). Populate the “fields” parameter with the names and the patterns (regular expression) of the data you want to extract. In this specific example, I extract the product names, prices, ratings and images from Amazon.com.

Some notes:

  • I have set a user agent name, as some websites block crawling if no user agent is present
  • No external imports are required
  • You can define as many fields to extract as you’d like. The field name is the “key” in the “fields” parameter
  • As I use regular expressions to define where the content is, no DOM parsing is performed, so malformed HTML pages are none issue.
  • Each time you run the script it will overwrite the content in output.csv

Enjoy,

Ran

SPREAD THE NEWS

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.
Has Your Data Been Breached?

Find breaches, stolen credentials and malware risks on the deep and dark web.

Subscribe to our newsletter for more news and updates!

Ready to Explore Web Data at Scale?

Speak with a data expert to learn more about Webz.io’s solutions
Create your API account and get instant access to millions of web sources