The race is on to build the next greatest large language model (LLM), with quite a few tech giants competing, including OpenAI, Google, and Meta. All these companies have announced a new LLM or a new version of one in the past few months. Building an LLM requires a LOT of historical web data, data that teams must preprocess for model training. Preprocessing is a crucial step in ensuring your large language model performs well. And the data you choose to preprocess plays a key role in whether you will have quality training data for your model.
What is data preprocessing?
When referring to LLMs, data preprocessing is where you take raw or unstructured data and transform it into some format that algorithms can understand. In addition to changing the format, preprocessing could involve removing the ad text from a news article or Reddit posts where people write random numbers. Data preprocessing forms the foundation for every advanced LLM since an algorithm learns from the preprocessed data (training data). The training data is like the lesson plan while the algorithm is the student. What it learns is encoded as a model.
Main challenges of LLM data preprocessing
If you work on a team responsible for LLM data preprocessing, you face some daunting challenges, some of which involve:
- The large amount of historical web data necessary to train and scale an LLM.
- Figuring out where and how to collect historical web data for model training.
- Ensuring the data is cleaned properly.
- Removing typos, irrelevant data, and other aspects of the cleaning process are time consuming and laborious.
- Not having the capacity to adjust data for certain LLM training needs.
You need to preprocess a lot of historical web data for an LLM
LLMs need massive amounts of historical web data to learn from. How much training data LLMs need varies from model to model. In most cases, you need hundreds of gigabytes of training data. Larger models train on datasets that are terabytes – or even exabytes in size. – Look at a few examples of training data sizes for several of OpenAI’s GPT models and Meta’s Llama:
- GPT-3 — This LLM uses several different datasets for training, which includes Common Crawl, WebText2, and Wikipedia. Common Crawl had 60% weight in the training mix and 410 billion tokens. GPT-3 was trained on around 570 gigabytes of text data and contained 175 billion parameters. The data has a cut off date of September 2021.
- GPT-4 — OpenAI does not disclose the size and parameters of the training data for GPT-4. Except to say that it is larger than what they used for GPT-3. Their research paper says GPT-4 was pre-trained with publicly available data, like web data, and data licensed from third-party providers. The training data cuts off in September 2021.
- Llama 3.1 — According to Meta, LLama was trained on 15 trillion tokens. A token is the smallest unit of text that can be processed independently. A token can be a word or part of a word
OpenAI trains each model using large sets of historical web data, and it looks like the training data size increases dramatically with each new GPT version. You can also see the trend of increasingly larger training datasets for LLMs through Google’s AI research. In January 2020, Google announced Meena, an end-to-end, neural conversational model. According to Google’s announcement post, the model was trained on 341 GB of text. And when compared to OpenAI’s GPT-2, Meena “has 1.7x greater model capacity and was trained on 8.5x more data.” In May 2021, Google introduced LaMDA, a family of language models for dialog applications. Google’s research paper says that LaMDA’s pre-training dataset had a total of 1.56T words while Meena’s training set had 40B total words. The total number of words in LaMDA’s training dataset was almost 40X larger than Meena’s.
You have to figure out where and how to collect historical web data
You need to find huge volumes of historical web data and then you need to figure out how to collect it. Fortunately, you can find publicly available historical data just about everywhere on the internet. Some LLM developers scrape data from public sites like Reddit and Wikipedia. Others rely on massive web datasets from repositories like Common Crawl (check out our recent review!) Common Crawl data goes back to 2008, and it is available in huge raw data files you can download. You can also find some public historical data available via an API (free and paid). You may have to use multiple methods to collect enough diverse training data for your LLM.
You must clean the data properly
Much of the preprocessing work revolves around cleaning the data you’ve collected so that it will work well for language model training. You must spend a significant amount of your time cleaning and structuring data because most publicly available historical web data is non-curated and unstructured. Cleaning the data takes a lot of time and effort because you often run into issues you must fix, such as:
- Junk data (e.g., gibberish, boilerplate)
- Inconsistent data (e.g., typos, incorrect spellings)
- Noisy data (e.g., spam data, raw HTML, offensive content)
- Data outliers (extreme data points compared to the rest of the data)
- Missing values, duplicate data, irrelevant data
- Difficulty sorting by specific attributes or limited customization
- Not representative (bias)
If your job responsibilities include cleaning and preparing data for model training, you should expect to spend about 38% of your time doing just that. Expect to spend even more time on these tasks if the data you’ve collected has a multitude of problems.
The size of the training data impacts LLM model performance, but not nearly as much as the quality of that data. You could feed LLM exabytes of training data, but if that data is riddled with problems, the model will perform poorly or may not work at all.
Without a doubt, big tech companies like Google, Microsoft, and OpenAI have teams spending a lot of their time preprocessing data for their LLMs. However, the LLMs these companies have developed to date still sometimes produce inaccurate and unpredictable results.
Benefits of Using Structured Historical Web Data for LLM Preprocessing
The quantity and quality of the training data play a key role in determining the LLM’s performance. While massive amounts of historical web data are essential, the data’s structure and cleanliness play a significant role in optimizing the preprocessing process and ultimately, the model’s effectiveness.
The type of data you use depends on your training goals. Different forms of structured LLM data.
- News articles
- Social media posts
- Blog content
For example, news articles are ideal for chatbots focused on current events. Selecting a reputable provider is crucial for data quality, accuracy, and consistency. The cost of structured historical web data can vary depending on factors such as volume, data type, and provider. Some providers offer subscription-based plans or pay-as-you-go options.
Structured historical web data feeds, like those provided by Webz.io, offer a significant advantage over raw, unstructured data in the training stage and beyond. Some benefits include the following:
- The feeds come pre-cleaned and organized.
- Eliminates the need for developers to spend time on extensive data cleaning.
- developers can focus on more critical aspects of LLM training and development.
- Structured data allows for easier filtering and customization.
- Ensures that the LLM is trained on relevant and high-quality data.
- Helps mitigate the potential for bias in the LLM’s outputs
- Serves as a valuable foundation for fine-tuning with Retrieval-Augmented Generation (RAG) techniques.
- RAG relies on high-quality, relevant data for both the retriever and generative components.
- Enhance the effectiveness of your RAG system by using well-structured and informative preprocessed data tailored to your specific domain.
Structured historical web data feeds can help improve data collection and preprocessingWebz.io provides structured historical web data feeds that help reduce many issues with data cleaning. We provide our archived web data feeds via a RESTful API or Firehose, making it easy to collect and integrate the data with models and applications. Structured web data allows you to speed up data preprocessing because you don’t have to spend as much time fixing issues with the data. You can also quickly filter the data so that you can train your model with high-quality, relevant data.ata. You can also quickly filter the data so that you can train your model with high-quality, relevant data.
Problems | How Webz.io Structured Historical Web Data Feeds Help |
Difficult to collect and prepare the massive amount of historical web data needed to train and scale an LLM | You need massive volumes of quality historical web data to train and scale an LLM. We provide large-scale historical web data in multiple languages through a RESTful API. Nearly all of our archived web data goes back to 2008 and has already been cleaned, structured, and enriched. You only need to plug the API into your application or system. Our data is available in pre-defined verticals, including news, blogs, forums, and reviews. |
Difficult to sort or customize the data | You can easily sort our historical datasets by topics, organizations, time frames, social engagement (e.g., likes, shares), domain, country, and more. |
Junk data, inconsistent data, missing values, incorrect data types | We’ve already cleaned and structured the data, so you don’t have to spend time fixing these data issues. |
Noisy data | We only crawl useful data sites — you won’t see offensive content, data from spam websites, raw HTML and code, boilerplate text, or lorem Ipsum text. You don’t have to spend time removing unwanted content. |
Data outliers | In most cases, we clean the data to eliminate outliers, and you can choose to get higher accuracy if required. |
Duplicate data | We will only index the same URL once. However, the exact text on different domains can still be present. If you use our data from online discussion forums, you can use a filter to exclude duplication of the text of the original posts that could sometimes appear in the responses (on the same thread). Note: If you use Webz.io along with other sources, you’ll need to de-duplicate on your end. |
Irrelevant data | You can filter the data based on specific criteria, e.g., language, location, keywords, and sentiment. Filtering helps improve the relevance and quality of the data. |
Not representative (bias) | We provide archived web data from domains worldwide. It’s easier to reduce bias in your model if you leverage clean data from more domains and diverse datasets. |
Optimized data preprocessing means better LLM training results
When you use messy and flawed historical web data, the more difficult and time-consuming data preprocessing becomes. And there is a greater risk that you will wind up with training data that leads to inaccurate or unexpected outputs from the model. Structured historical web data lets you speed up and improve data preprocessing, which means you get better training data and a higher-performing LLM.
Interested in learning more about how structured historical web data can help you optimize data preprocessing for your LLM? Schedule a chat with one of our web data experts.