The following is an excerpt from our new Web Data Extraction Playbook. We’ll be publishing the second part next week, or you can grab the full guide here.
The internet has become an undeniable force in our lives over the past few decades, changing everything from the way we do our shopping to the way our brains are wired. In recent years, marketing and tech companies have started eyeing the vast troves of online information as a potential data source that can be mined for analytical insights, trends and patterns.
Today, many companies are already seeing actual value from analyzing the massive amounts of web content published daily – whether it’s media monitoring companies looking to follow the discussion around their clients, cyber security providers combing the dark web for criminal activity, or researchers looking for large datasets to train AI and machine learning algorithms.
Want to jump on this bandwagon? Great idea. But before you do that, take a moment to outline your needs and what you hope to accomplish. Luckily, we’ve prepared this checklist to help you do just that. Here are the 11 questions you need to ask before launching a web data initiative.
Building the Use Case
What is the product or service I am trying to deliver?
What type of analyses or reports will I want to generate?
Who are the end users consuming the data?
A general rule of thumb for any type of data analysis is to always start with the question(s) you want to answer. Just poking at the data in the hopes that it drops some kind of insight into your lap tends to be less than entirely effective – instead, it is always wiser to identify the business question first, and then find the best way to approach the data in order to find the answer.
The same applies to extracting data from the web: if you don’t know what you’re looking for, you’re never going to find it. Some examples of the types of topics that can be examined through the prism of open web data could include:
- Price fluctuations of products or groups of products in e-commerce websites
- Monitoring news and online discussions to identify trends, sentiments, or mentions of a certain person or entity
- Predicting stock behavior based on information published on the web
Each of these types of analysis poses its own challenges and needs to be approached differently. Hence, you should start by having a clear idea of the product you’re trying to develop, its end users and the ways they will interact with the data you extract.
Finding the Data on the Web
What kind of information are you looking for (text / images / video)?
Where is this information typically published?
How often are these websites refreshed, and how fresh does your data need to be?
Are there any legal or technical requirements preventing you from accessing the data?
This next group of questions relates to the websites you want to extract the data from, and what type of data you’re looking at. Some websites are very easy to access via open APIs or manual crawling; in other cases, it might be very difficult for web crawlers to access the data, or possibly illegal to do so (read more about the legality of web scraping).
Within this group of requirements, you would also want to look at how frequently the information is updated, and whether you can settle for a snapshot of the data or must have the most up-to-date version. This brings us right back to your use case – if you’re interested in training an AI agent, you might be more interested in bulk amounts of historical data; if you’re looking to monitor the latest news about an organization that is frequently making headlines, you’ll need to look at refresh frequency.
Defining the Technical Requirements
Where will the extracted data be stored (cloud, on-premise, external database, etc.)?
How do you intend to query the data?
What is the optimal format for the data (JSON, XML, Excel, schema-less)?
Which other analytic, visualization or other softwares do you intend to use?
Once you’ve understood the business use case or research question you want answered, it’s time to dive a bit into the the more technical side of things: here is the place to think how you would need the data to be structured in order to answer the questions you’re asking, and how you would integrate this data into your existing technology stack.
Certain analytical queries you want to run might create prerequisites in terms of the data structure, which should be addressed in advance. There might be limitations around file formats and databases stemming from data visualization tools you plan to use. Text analytics and NLP sampling might benefit more from a schema-less data structure, while a SQL database might be a better fit for business intelligence analysis.
It’s important to start thinking about these things ahead of time, because they can deeply impact the types of tools and techniques you use to extract data from the web. In some cases this won’t be a big deal and you’ll be able to massage the data into whichever format you need it after it is extracted, but taking these things into account beforehand can save you a lot of trouble down the road.
Before making any investment in web data extraction, make sure to have a comprehensive understanding of the technical considerations in terms of the way you will want the data to be structured, modeled and integrated into your IT infrastructure.
In part 2 we’ll cover the three most common approaches to web data extraction and examine the pros and cons of each. Subscribe to get an email when it’s published, or get direct access to the entire guide by downloading the full Web Data Extraction Playbook. Ready to start working with web data? Check out our API Playground.