After bashing various crawling techniques, I would like to describe the technique we use here, at Webz.io, a technology that was developed over the past 8 years.
Our crawlers were developed with the following demands in mind:
- Efficient on server resources, i.e CPU & bandwidth
- Fast in fetching and extracting content
- Easily add new sites to the crawling cycle
- Simple but powerful way of “teaching” the crawler about new sites structure
- Robust to site’s layout changes
We started by developing our crawlers in Python due to its dynamic module loading. It was important, as we wanted to easily write new parsers and quickly add or fix them, without the need to restart the system.
The crawler downloads only the HTML content and not all the images/js/CSS files. It doesn’t wander around the site, but chooses the exact links to fetch, and by doing so, it takes the bandwidth consumption to a minimum.
We don’t use headless browsers to parse the content, nor use a DOM parser. We extract the content by using regular expressions and various heuristic functions, resulting in a robust solution to HTML structure change.
We established knowledge about multiple content platforms, and we leverage this knowledge to easily add new sources without the need to write new parsers, as the system recognizes the basic structure of the platform.
Since the crawlers are written in Python, writing a parser can take from a few minutes, when you only need to fill out a template with a regular expression, to a very powerful parser that can deal with a combination of JSONs retrieved via AJAX utilizing cookies, and different HTTP headers.
True, our solution requires basic knowledge in Python and regular expressions, but in return it provides power and efficiency unmatched by any other technique.