Crawling Horrors – Browser Scraping
In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content.
A headless browser is a web browser without a graphical user interface. The logic behind using a browser is solid. The browser will do all the rendering, manage AJAX requests and parse the DOM. Once the DOM is parsed, we could use XQuery or HTQL to extract the content we want. Simple? Yes. Easy? Kinda. Good practice? Nope!
I wouldn’t recommend this technique for the following reasons:
- It’s unreliable for two reasons:
- If the HTML isn’t valid, or written in a poor manner, many DOM parsers will fail and you won’t be able to use a query language to extract the content.
- Even the slightest change in the site’s layout will break your DOM query.
One last thing, even if you are only downloading the HTML, and using just a DOM parser to parse it, you will still face the same problems described in clause #3 above.
Conclusion: don’t rely on headless browser for crawling!