So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting pages visually as a human being might.
Computer vision crawlers present some great advantages over RSS/Browser or even code based crawlers. They offer simplicity when it comes to DIY crawlers, i.e letting non-developers teach the system what content needs to be extracted. In many cases it does a decent job at extracting structured content from sources it has no knowledge about.
So how am I going to ruin this one for you? Well it suffers from some of the downfalls of the browser bases crawlers:
- Slow, heavy and resource hogger, as it has to download all the content and render the page to “see” it.
- It won’t know what to do if the content is revealed by an action (like clicking on a comment to show it)
- In many cases if the page is complicated enough, i.e a discussion thread, a site with dynamic ads, it can get “confused” and extract the wrong content.
Like with many machine learning systems, there is a precision and recall tradeoff, which means that if you want high precision (and you do), your recall will be low, which means that for many pages you won’t be able to extract the right content.
Computer vision crawlers are great for DIY-missions, for specific sites that look about the same, but I’m afraid not for large scale, precise crawling.