Optimize LLM Data Preprocessing with Structured Historical Web Data

The race is on to build the next greatest large language model (LLM), with quite a few tech giants competing, including OpenAI, Google, Meta, and Baidu. All these companies have announced a new LLM or a new version of one in the past few months. Building an LLM requires a LOT of historical web data, data that teams must preprocess for model training. Preprocessing is a crucial step in ensuring your large language model performs well. And the data you choose to preprocess plays a key role in whether you will have quality training data for your model.

What is data preprocessing?

When referring to LLMs, data preprocessing is where you take raw or unstructured data and transform it into a format that algorithms can understand. An algorithm learns from the preprocessed data (training data), and what it learns is saved as a model. Data preprocessing forms the foundation for every advanced LLM. 

Main challenges of LLM data preprocessing

If you work on a team responsible for LLM data preprocessing, you face some daunting challenges, some of which involve:

  • The amount of historical web data needed to train and scale an LLM
  • Figuring out where and how to collect historical web data for model training
  • Ensuring the data is cleaned properly
LLMs historical data image02

You need to preprocess a LOT of historical web data for an LLM

LLMs need massive amounts of historical web data to learn from. How much training data LLMs need varies from model to model. In most cases, you need gigabytes of training data at a minimum. Larger models train on datasets terabytes or even exabytes in size. Let’s look at a few examples of training data sizes. Here are some training data details for several of OpenAI’s GPT models: 

  • GPT-2 — OpenAI’s research paper says that GPT-2 was trained on a preliminary version of WebText, with the data cutting off around December 2017. WebText is a dataset created by OpenAI from about 45 million outbound Reddit links with 3+ karma. After cleaning, the dataset consists of approximately 40GB of text from more than 8 million documents.  
  • GPT-3 — This LLM uses several different datasets for training, which includes Common Crawl, WebText2, and Wikipedia. Common Crawl had 60% weight in the training mix and 410 billion tokens. OpenAI’s paper on GPT-3 says they downloaded roughly 45TB of compressed plaintext from Common Crawl, and after filtering, the training dataset was 570GB in size. The training data cuts off sometime in 2021.
  • GPT-4 — OpenAI is vague on the details regarding the training data for GPT-4. Their research paper says GPT-4 was pre-trained with publicly available data, like web data, and data licensed from third-party providers. The paper also says the training data cuts off in September 2021. 

OpenAI trains each model using large sets of historical web data, and it looks like the training data size increases dramatically with each new GPT version. You can also see the trend of increasingly larger training datasets for LLMs through Google’s AI research. In January 2020, Google announced Meena, an end-to-end, neural conversational model. According to Google’s announcement post, the model was trained on 341 GB of text. And when compared to OpenAI’s GPT-2, Meena “has 1.7x greater model capacity and was trained on 8.5x more data.” In May 2021, Google introduced LaMDA, a family of language models for dialog applications. Google’s research paper says that LaMDA’s pre-training dataset had a total of 1.56T words while Meena’s training set had 40B total words. The total number of words in LaMDA’s training dataset was almost 40X larger than Meena’s.

You have to figure out where and how to collect historical web data

You need to find huge volumes of historical web data and then you need to figure out how to collect it. Fortunately, you can find publicly available historical data just about everywhere on the internet. Some LLM developers scrape data from public sites like Reddit and Wikipedia. Others rely on massive web datasets from repositories like Common Crawl (check out our recent review!) Common Crawl data goes back to 2008, and it is available in huge raw data files you can download. You can also find some public historical data available via an API (free and paid). You may have to use multiple methods to collect enough diverse training data for your LLM.

You must clean the data properly

Much of the preprocessing work revolves around cleaning the data you’ve collected so that it will work well for language model training. You must spend a significant amount of your time cleaning and structuring data because most publicly available historical web data is non-curated and unstructured. Cleaning the data takes a lot of time and effort because you often run into issues you must fix, such as:

  • Junk data (e.g., gibberish, boilerplate)
  • Inconsistent data (e.g., typos, incorrect spellings)
  • Incorrect data types (e.g., float, string, integer)
  • Noisy data (e.g., spam data, raw HTML, offensive content)
  • Data outliers (extreme data points compared to the rest of the data)
  • Missing values, duplicate data, irrelevant data
  • Difficulty sorting by specific attributes or limited customization
  • Not representative (bias)

If your job responsibilities include cleaning and preparing data for model training, you should expect to spend about 38% of your time doing just that. Expect to spend even more time on these tasks if the data you’ve collected has a multitude of problems. 

The size of the training data impacts LLM model performance, but not nearly as much as the quality of that data. You could feed LLM exabytes of training data, but if that data is riddled with problems, the model will perform poorly or may not work at all.

Without a doubt, big tech companies like Google, Microsoft, and OpenAI have teams spending a lot of their time preprocessing data for their LLMs. However, the LLMs these companies have developed to date still sometimes produce inaccurate and unpredictable results. 

Structured historical web data feeds can help improve data collection and preprocessing

Webz.io provides structured historical web data feeds that help reduce many issues with data cleaning. We provide our archived web data feeds via a RESTful API or Firehose, making it easy to collect and integrate the data with models and applications. Structured web data allows you to speed up data preprocessing because you don’t have to spend as much time fixing issues with the data. You can also quickly filter the data so that you can train your model with high-quality, relevant data.

ProblemsHow Webz.io Structured Historical Web Data Feeds Help
Difficult to collect and prepare the massive amount of historical web data needed to train and scale an LLMYou need massive volumes of quality historical web data to train and scale an LLM. We provide large-scale historical web data in multiple languages through a RESTful API. Nearly all of our archived web data goes back to 2008 and has already been cleaned, structured, and enriched. You only need to plug the API into your application or system. Our data is available in pre-defined verticals, including news, blogs, forums, and reviews.
Difficult to sort or customize the dataYou can easily sort our historical datasets by topics, organizations, time frames, social engagement (e.g., likes, shares), domain, country, and more.
Junk data, inconsistent data,
missing values, incorrect data types
We’ve already cleaned and structured the data, so you don’t have to spend time fixing these data issues.
Noisy dataWe only crawl useful data sites — you won’t see offensive content, data from spam websites, raw HTML and code, boilerplate text, or lorem Ipsum text. You don’t have to spend time removing unwanted content.
Data outliersIn most cases, we clean the data to eliminate outliers, and you can choose to get higher accuracy if required.
Duplicate dataWe will only index the same URL once. However, the exact text on different domains can still be present. If you use our data from online discussion forums, you can use a filter to exclude duplication of the text of the original posts that could sometimes appear in the responses (on the same thread). Note: If you use Webz.io along with other sources, you’ll need to de-duplicate on your end.
Irrelevant dataYou can filter the data based on specific criteria, e.g., language, location, keywords, and sentiment. Filtering helps improve the relevance and quality of the data.
Not representative (bias)We provide archived web data from domains worldwide. It’s easier to reduce bias in your model if you leverage clean data from more domains and diverse datasets.

Optimized data preprocessing means better LLM training results

When you use messy and flawed historical web data, the more difficult and time-consuming data preprocessing becomes. And the greater the risk you wind up with training data that leads to inaccurate or unexpected outputs from the model. Structured historical web data lets you speed up and improve data preprocessing, which means you get better training data and a higher-performing LLM.

Interested in learning more about how structured historical web data can help you optimize data preprocessing for your LLM? Schedule a chat with one of our web data experts.


Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.
Subscribe to our newsletter for more news and updates!

Ready to Explore Web Data at Scale?

Speak with a data expert to learn more about Webz.io’s solutions
Create your API account and get instant access to millions of web sources