With data fully embedded in so much of our daily lives, it feels as though data normalization and the process of preparing data to draw insights should have become standardized and streamlined by now.
But it seems that we still have a long way to go in the data community until this happens. According to a recent survey of data scientists, data scientists still spend about 45% of their time on data preparation.
At least it appears to have improved a bit in the last few years since Kirk Borne, currently, the Principal Data Scientist and Executive Advisor at Booz Allen Hamilton, has famously stated on Twitter:
“Data Scientists typically spend about 80% of their time preparing and cleaning their data. They spend the other 20% of their time complaining about preparing and cleaning their data.”
Data preparation is still the major challenge for data analysts, and according to Gartner, remains the single biggest inhibitor to data science.
There are a few reasons for this that I’ll explain in this post.
The Many Faces of Data Normalization and Standardization
First, there are now millions of different data sources and formats. This includes different articles, blogs from different news sites, forums and review sites, all with different HTML structures on their websites.
That makes it a lot more challenging to normalize data.
Here’s a simple example:
The publication date of different news articles on the web can appear in multiple ways and formats. It can be based on a numerical format (i.e 01/02/2015), a textual format (i.e yesterday), or even a combination of them both (Jan 1st, 2015). Not to mention that there could be multiple types of separators and the following date, 01/02/2015, can be interpreted as January 2nd, or February 1st (depending on if it’s the American or European format).
You see, there are data standards that exist, but there happen to be a lot of them.
Let’s take a slightly more complex example: Each review site gives scores to organizations. Many of the reviews are given in star ratings. Since each review site may have a slightly different rating scale, however, Webz.io Review API standardizes these ratings with a numerical star rating between 1 and 5. This standardization is essential for developing NLP and machine learning models that extract sentiment from these reviews.
Another reason data standardization is challenging is that data has traditionally been made to be tracked or stored, not analyzed for insights. As a result, you have lots of different data warehouses, data storage systems – along with data aggregation and data preparation tools.
The idea of gathering data for insights was more recent.
Organizations Still Face Major Obstacles in Data Preparation
In addition, organizations need data that is accurate and not too noisy, especially if you want to derive insights from it.
According to IBM, organizations still face three main challenges when trying to analyze data:
- Identifying and ingesting data sources
- Enriching data so it is searchable
- Building queries
These three steps take most companies 4 months on average to complete.
However, by automating data preparation and normalization, organizations can spend their time focusing on building the predictive, artificial intelligence, machine learning, and natural language processing (NLP) models that are the core of their business.
Delivering High-Quality Data at Scale for the Enterprise
Here at Webz, our goal is to deliver high-quality, structured data to enterprise organizations so they can focus on delivering insights. With this mission, we’ve partnered with IBM’s Watson Discovery News, a global leader in AI-powered search technology to serve even more organizations than before. IBM clients now have access to Webz’s immense repository of online blogs, discussions, and reviews crawled and indexed from a diversified number of sources.
With structured, high-quality data, more enterprise organizations are able to solve data normalization and standardization challenges, which translates into better, more accurate insights for your customers. We’re proud to have a front seat in the data preparation that is so critical for these insights.
Want to learn more about how you can access high-quality, structured data at scale from the open and dark web? Contact a data expert today!