How To Get The Data Your AI Application Needs

Do you remember the saying, “For people who do not know where they are going, any path will take them there”? How things change! Today, it’s “No problem, let AI and machine learning figure it out!”. Artificial intelligence has advanced by leaps and bounds in recent years. Now, AI has beaten the world champion at Go, powers driverless cars, and carries out over two-thirds of all financial transactions in the world.

Smarter algorithms, more suitable programming languages, cheaper hardware, and more abundant data have all come together to create highly fertile conditions, especially for machine learning. The larger the amounts of data available for machine learning algorithms, the better the chances are of valuable and actionable insights. The web can be a particularly good source for that data, offering exponential growth, ongoing updates, and suitability for data mining.

Data-First is the AI Way Today

Earlier AI efforts focused on rules and symbol manipulation. For applications like fraud detection, however, it became obvious that convoluted rule-based coding was expensive to develop and maintain, and not necessarily effective. Making use of the quantities of digital data that are doubling every 9 months, many of today’s AI initiatives are based on machine learning with data as a starting point instead of rules. Here’s how it plays out for the different kinds of learning:

  • Supervised learning. In this case, you know what you’re aiming for, for example, “recognize any picture of a dog”, and you want to train your machine learning program to acquire that capability. You’ll need data elements that are labeled as “corresponds” or “does not correspond” (or similar) for your program to train on. You either tag the elements yourself (which may be time-consuming) or acquire suitably structured and ready-tagged data from elsewhere.
  • Reinforcement learning. You declare a desired outcome, like winning a chess game, and tell your program to crunch data and paths (or next events) to find out what works and what doesn’t.
  • Unsupervised learning. You don’t yet know what you’re looking for. Your data is not necessarily labeled, although it should still be clean, up-to-date, and machine-readable. Your machine learning program typically uses more sophisticated algorithms to detect patterns and correlations in a dataset.

Winning with Web Data

The web is potentially one of the best data sources available, offering huge amounts of data that are constantly updated. Overall diversification is increasing, constantly broadening the relevance of web data. At the same time, sizable amounts of similarly structured data are available, streamlining ingestion into different machine learning applications.

Successful data extraction at scale then means meeting the following requirements:

  • Web data cleansing. Like other data sources, web data must be accurate without unwarranted duplication or anomalies that could compromise its quality, and therefore the results of machine learning programs.
  • Adaptation to web data pattern changes. Website designs change from time to time, altering data fields. Even smarter web data extraction programs have their limits, and may need to be updated in parallel, to avoid data being missed or misread.
  • Web data renewal. While some AI classifiers rely on looping through already acquired data to predict next steps as a function of past events, new data and updates to existing datasets are needed to improve most machine learning programs.

Leave the Heavy Lifting to Someone Else

Web data, like other big data, is also still the “heaviest” item in artificial intelligence. Although there are mountains of data available, it is not always easy to move around networks. Ideally, computation should go to the data, rather than trying to bring all the raw data to the place of computation. Data preparation may also take up to four times as much time as the machine learning process, while the time to process the data is another potential bottleneck. These constraints often lead to compromises such as using simpler machine learning classifiers that take less time to learn than more complex ones, but that offer less in the way of results.

Fortunately, there are solutions for these issues as well. Making cleansed, structured, web data available via an API helps enterprises to avoid shifting data around the web, reduces the time-consuming data preparation phase, and facilitates integration with the machine learning program. Meanwhile, there is continued progress in speeding up learning with more complex classifiers, thus letting enterprises make more of the big web data resources available.

Need Examples?

As an example, you might use the following configuration of web datasets for a machine learning application to distinguish favorable comments and reviews from unfavorable ones, for a given product or service (sentiment classification):

  1. A structured web dataset with a “rating” label (number of stars, for instance) for each comment or review data element, to be used as a training dataset
  2. A testing web dataset to check performance of the machine learning program
  3. A structured web dataset for a live test of the program.

The first two items can come from one overall dataset, split into two parts with 80% used for the training and 20% for the performance checking. A second, separate dataset can then be used for the final test.

You can also see a more detailed example in this initiative for detecting fake news using news data, NLP and learning algorithms.

Towards Better, More Accessible AI

In general, the more data you have available to use, the more you can reduce uncertainty in the functioning of your machine learning program, whether for training or for deriving insights. Using web data, it also becomes more cost-efficient to use more data to better train machine learning programs or derive insights, than to slave over a hot algorithm to try to get better results.

When provided in clean, structured, machine-readable form, web data offers the volume and the diversity that AI needs to home in on consistently successful outcomes, and the convenience to bring AI and machine learning within reach of any organization, whatever its size or sector.

Start building smarter artificial intelligence apps today using our free machine learning datasets.


Subscribe to our newsletter for more news and updates!

By submitting you agree to's Privacy Policy and further marketing communications.

Ready to Explore Web Data at Scale?

Speak with a data expert to learn more about’s solutions
Create your API account and get instant access to millions of web sources