Crawling Horrors – RSS Crawlers

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless.

Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take for example the RSS feed from TechCrunch (Many times you can find the RSS feed URL by reading the <link rel=”alternate” type=”application/rss+xml”…> tag from the main html page. In TechCrunch’s case, it’s https://techcrunch.com/feed/). The output is an XML that includes an <item> element within you can find the author name, the post date, images and even part of the content.

So why is this wrong you ask? Because getting only part of the content, misses the purpose of a good crawler. Getting 2-3 lines out of the complete article is useless, not to mention that you don’t get the comments for the article (some sites provides a comments feed, but again it contains a fraction of the comment content)

True, it’s fast, simple, very low on bandwidth, and you get structured data, but you don’t get the complete data, and in my book it disqualifies this method as a valid crawling option. You can use an RSS crawler as a starting point to discover article URLs, but not as a content extractor.

Ran Geva

CEO

Spread the news

Subscribe to our blog for more news and updates!

Read Up

How to Automate Supply Chain Risk Reports: A Guide for Developers

Do you use Python? If so, this guide will help you automate supply chain risk reports using AI Chat GPT and our News API.

How to Automate Supply Chain Risk Reports: A Guide for Product Managers

Use this guide to learn how to easily automate supply chain risk reports with Chat GPT and news data.

How to Automate Mergers and Acquisitions Reports: A Guide for Developers

A quick guide for developers to automate mergers and acquisitions reports with Python and AI. Learn to fetch data, analyze content, and generate reports automatically.