Crawling Horrors – RSS Crawlers

Crawling Horrors – RSS Crawlers

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless.

Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take for example the RSS feed from TechCrunch (Many times you can find the RSS feed URL by reading the <link rel=”alternate” type=”application/rss+xml”…> tag from the main html page. In TechCrunch’s case, it’s https://techcrunch.com/feed/). The output is an XML that includes an <item> element within you can find the author name, the post date, images and even part of the content.

So why is this wrong you ask? Because getting only part of the content, misses the purpose of a good crawler. Getting 2-3 lines out of the complete article is useless, not to mention that you don’t get the comments for the article (some sites provides a comments feed, but again it contains a fraction of the comment content)

 True, it’s fast, simple, very low on bandwidth, and you get structured data, but you don’t get the complete data, and in my book it disqualifies this method as a valid crawling option. You can use an RSS crawler as a starting point to discover article URLs, but not as a content extractor.

Spread the News

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

GET STARTED
Subscribe to our newsletter for more news and updates!

Ready to Explore Web Data at Scale?

Speak with a data expert to learn more about Webz.io’s solutions
Create your API account and get instant access to millions of web sources