Common Crawl vs. Webz.io

Common Crawl vs. Webz.io

Web archives are an important resource for both academic and commercial research. Getting access to historical web data is crucial for political events analysis, fake news detection, financial trends correlation and training machine learning models, among other things. 

If you would like to conduct large-scale data mining research and explore questions about the linking structure of the web or analyze the textual content of pages, you will need access to a web archive. 

In this post we will compare between two leading archive solutions: Webz.io and Common Crawl. But before we dive into the detailed comparison, a brief overview of both Common Crawl and Webz.

Common Crawl crawls the web and freely provides its archives and datasets to the public. The Common Crawl corpus contains petabytes of data collected since 2011. It also contains raw web page data, extracted metadata and plain text extractions. Amazon Web Services began hosting Common Crawl’s archive through its Public Datasets Program in 2012. 

Webz offers an easy and cost effective way to access segmented and structured web data. Webz.io provides access to pre-defined data verticals such as news, blogs, forums, reviews and dark nets. This includes access to both a live data stream and an archive going back to 2008.

Common Crawl vs. Webz

 Common CrawlWebz
Archive Time Frame2011 – Present2008 – Present
Site TypesUnclassified HTML pages from all around the web regardless of site type.Data from pre-defined verticals: news, blogs, forums and review sites.
Data StructureURL, raw HTML, HTML & server metadata, extracted plaintextNo HTML, rather clean structured data extracted out of the HTML: title, publication date, post text, comments, author, language, post URL, Section URL & title, country, entities, external links, # likes/shares
Data FormatWARC file format and also contains metadata (WAT) and text data (WET) extractsNDJSON
Method to data accessBulk download of all the data per crawled monthFiltered by Boolean keywords in the title/text or by any extracted metadata such as language, country, site type, date etc.
Support for present live dataNo support for live data. Data is available at the end of the crawled month.Live access to crawled data going back 30 days
API AccessRESTful API
Support for AJAX based sitesNoYes
   

Here’s the bottom line: Common Crawl has a huge archive available for free for anyone to download. The downside is that the data isn’t structured, cannot be filtered, and is only available in bulk. In comparison, Webz.io provides an affordable commercial solution for clean and structured data spanning over 10 years. Unlike Common Crawl data, which isn’t limited to certain types of sites, Webz.io’s crawled data is available in pre-defined verticals (news, blogs, forums and reviews). 

If you require access to free historical web data in bulk, Common Crawl is most likely your best solution. If you need filtered granular structured data, then Webz.io is probably a better tool for the job.

Spread the News

Subscribe to our newsletter for more news and updates!

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

GET STARTED