Crawling Horrors – Browser Scraping
In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content.
A headless browser is a web browser without a graphical user interface. The logic behind using a browser is solid. The browser will do all the rendering, manage AJAX requests and parse the DOM. Once the DOM is parsed, we could use XQuery or HTQL to extract the content we want. Simple? Yes. Easy? Kinda. Good practice? Nope!
I wouldn’t recommend this technique for the following reasons:
- It’s heavy and slow – remember, you are running a whole browser with javascript engine, DOM parser, event handlers and many other features you don’t really use. It consume a lot of memory & CPU and has to parse and render each page you are loading.
- It’s expensive in bandwidth both for you and the website you are crawling, as the browser is also downloading all the images, javascripts and CSS files.
- It’s unreliable for two reasons:
- If the HTML isn’t valid, or written in a poor manner, many DOM parsers will fail and you won’t be able to use a query language to extract the content.
- Even the slightest change in the site’s layout will break your DOM query.
One last thing, even if you are only downloading the HTML, and using just a DOM parser to parse it, you will still face the same problems described in clause #3 above.
Conclusion: don’t rely on headless browser for crawling!