Web data has become a crucial resource for many companies, and the need to leverage it grew significantly in 2022. Companies across industries — finance, risk intelligence, manufacturing, and security, to name a few — have realized the benefits of incorporating web data into applications, platforms, and models. That realization will soon expand to many more industries and companies.
As a leading web data provider in the field since 2016, we live and breathe web data trends. We constantly look for new and emerging data trends from across the open, deep, and dark web. We also see how the use of web data continues to evolve, and we predict exciting developments for web data in the years to come, starting in 2023:
Prediction #1: More businesses will link web data with AI models
AI technologies have always relied primarily on web data for model training and fine-tuning. We’ve recently seen a lot of hype surrounding AI models like ChatGPT and DALL-E. We expect that in 2023, many more companies will link web data with AI models like these. Linking noise-free and low-latency web data with AI models enables them to provide up-to-date insights quickly and produce better and more relevant content.
Web data allows AI models to analyze and better understand the online world, which helps businesses make better decisions because they can gain insights into consumer behavior and market trends. For example, companies could build web data-driven AI applications that track a company’s brand sentiment, identify its most popular content across platforms, and make sales predictions. ChatGPT uses textual web data, but AI models can take advantage of other types of web data. For example, OpenAI’s Whisper automatic speech recognition (ASR) system was trained on data from audio files, and Nvidia’s eDiff-I text-to-image model was trained on image data. DALL-E uses a combination of text and image data.
Prediction #2: We will see far more annotated data
Annotations make it easier for search engine crawlers to put structure to web page content. They essentially tell the search system how to “understand” the meaning of a particular value. For example, an annotation can tell the crawler if a date is an “article publication date” or a number is a “salary.” When Google and Microsoft talk about annotated data, they mean structured data, in which you have several choices of specifications, such as JSON-LD, Microdata, and RDFa. Here is an example of structured data from movie database site IMDb. Google wants to better understand the structure of the text on their page about the movie Terminator 2: Judgment Day:
Search engines today do more than link to our web pages. They also help users better understand our web page content through features like “rich results” or “visually rich snippets” shown in the search results. Search engine crawlers need annotated data to put structure to the web page content and generate these rich results snippets.
In 2023 and beyond, we anticipate seeing far more annotated data because more companies will want search engines to understand their web page content. More annotated data means that search engines and crawling solutions will better “understand” the meaning of the data, e.g., its structure for analysis.
Prediction #3: Companies will increasingly fuse web data types together
Traditionally, companies have used only social media sites and news data to monitor mentions of their brand or products. However, we predict that in 2023 many more companies will start leveraging multiple web data types from across the open, deep, and dark web. For example, a company might need to use data from news, government, and dark web sites to monitor potential threats and update its risk assessment solution accordingly. Analyzing multiple data types adds more dimensions to analysis and brings more profound insights.
In 2023, we expect media intelligence companies to start using web data from different sources, providing their customers a 360-degree view of brand and product mentions. More of their customers will want the ability to keep an eye on all mentions of their brand and products everywhere.
For example, a major technology company might monitor social media, blogs, and forums to discover customers’ thoughts about the brand or specific products. In addition, they could keep an eye on what customers say about their brand on alternative social media platforms, becoming aware of threats to the company or individuals. And for an added layer of protection, the tech company could monitor deep and dark web hacking forums to discover cybersecurity threats against its software products.
By fusing multiple web data types together, companies can gain deeper insights into their customers and see early indications of cybersecurity threats, using this knowledge to protect the business and brand.
Prediction #4: More scraping prevention measures
The World Wide Web today consists of several billion pages, many of which provide valuable information to the public. However, not every site owner wants the data in their web pages scraped by third parties, with many blocking web crawlers through robots.txt or implementing scraping prevention systems like Cloudflare, DataDome, or Imperva. It is also illegal to scrape data that is not publicly available — e.g., behind a login screen. But that login doesn’t stop bad actors from trying to scrape unauthorized web data.
We predict that in 2023 many more companies will start implementing measures to protect against illegal and unauthorized web scraping. In the future, we expect that fewer reputable sites will be available for scraping, and some web data solutions will resort to evasive scraping techniques by using proxies, automated website login tools, and Captcha-solving solutions. Many companies have already implemented measures to prevent web scraping.
Technavio estimates a 30.72% CAGR increase in the size of the content delivery network (CDN) security market from 2022 to 2027. This market growth indicates that more companies want to protect the content on their sites.
While more companies want to protect their data, a great many still allow the use of their data just not through black hat scraping techniques. Those looking to gather public data from these sites can partner with a web data provider that uses an adaptive crawler and white hat crawling processes (like us!).
Web data uses will continue to evolve
What do we see for web data in the future? The use of web data in AI models will lead to new and exciting applications, like advanced virtual chat assistants and visual art generators. Many more companies will annotate their data to take advantage of search engine features that benefit their businesses. More organizations will fuse different web data types, bringing brand monitoring and brand protection closer together. And more companies will take steps to protect their data while others will continue to make their data public, approving of white hat scraping techniques. The ways companies use web data will continue to evolve, and for many years to come.
Want to learn how to use web data effectively for your business? Contact us today to talk with one of our data experts.