Good News: Doubled its Web News Coverage

If you have been using our news API, you may have noticed a steady increase in the number of posts you get. This significant increase is the result of a major update to our crawling technology that more than doubled the amount of news articles we download per day. This new technology is called Adaptive Crawling.

Graph 1 News Coverage over the past 18 months

The Benefits of Adaptive Crawling

Over the past 12 months, we more than doubled the amount of news articles we download per day (see the table above). Where in July 2020, we averaged about 750,000 news articles a day, these days the average is around 1,600,000 articles (excluding comments!). 

The reason is simple, the better our crawlers are at anticipating when an article is going to be published, the higher the chances are that we won’t miss it in the clutter. 

How did we do that? We recently launched our Adaptive Crawler. Based on a unique machine learning algorithm, the Adaptive Crawler learns the trends of every section in each domain, and predicts the behavior of activity on the site. It assesses the frequency of publications on the site, and recognizes peaks in which the majority of posts in a certain time frame are posted.

By doing so, it is able to efficiently direct and control where parsers are at any given time, and optimize the latency and time spent on site.

This optimization also means that we can dive deeper into the domain and extract more information without harming our crawling cycles, and thereby not only reduce latency but also enhance quantity. 

This technological upgrade resulted in two other major benefits:

  • Lower Latency – Similarly to the explanation above, the better we are at anticipating when an article will be published, the better we are at crawling it as soon as it’s being published – which drastically decreases our latency.
  • Efficiency – We don’t need to visit websites at times when they don’t publish content. This allows a more efficient resource allocation, but more importantly, reduces our footprint in the crawling process.

These are major improvements that already help us improve the extent of our coverage while retaining high quality, and low latency.

So what’s next?

Chasing the internet, and trying to keep up, is not only impossible but is also extremely resource-intensive. Resources, which are costly, and if freed, can be put to work elsewhere.

Our latest upgrade is only the beginning. The increase is ongoing and we expect it to keep growing as we migrate more sites to this new technology and our crawler gets smarter. 

We are deploying the new Adaptive Crawlers to our blogs and discussions data vertical as well, so expect to see an increase in content and an improvement in latency for these verticals in the coming months.

