Machine Learning in Modern Data Quality Solutions

The world’s data landscape has been constantly expanding and evolving throughout the years. In 1997, when Google was released, the world was using 12,000 petabytes of data. 1 petabyte is equal to 1,000,000 gigabytes. Since then, the quantity of data that companies, organizations and governments store and process every day has increased and will continue to increase exponentially. To put things into perspective, the International Data Corporation estimated that by 2025 the world will be using around 175 zettabytes. 1 zettabyte is equal to 1,000,000 petabytes. It is evident that organizations that are not yet data-driven will struggle to succeed against the most mature companies that are guided by a comprehensive data strategy appropriate for today’s data challenges.

The sheer amount of data available impacts data analytics as organizations realize that accurate insights require trustable data sources. In fact, by 2022, Gartner predicts that 70% of them will rigorously track data quality levels via metrics. Since these metrics will validate data-driven decisions, the Data Quality workstream is shifting from being a specialized IT task to a holistic enterprise function.

Nevertheless, the more manual tasks prevail within the data quality processes, the higher the risk of slowing critical business activities. Back in 2014, reactive data quality was being described by data scientists as “janitor work”. Now, companies are taking a great first step when they become proactive with their data quality practices, but there is always room for improvement.

Best-in-class data quality solutions can now provide automation through statistical analysis and embedded AI, reducing the workload of data professionals and allowing them to concentrate on other important aspects of their company’s data efforts. These capabilities will be especially useful for organizations where their data quality programs have reached a level 4 or 5 Data Management Maturity Level.

Table 1: CMMI's Data Management Maturity Levels


The benefits of leveraging machine learning are obvious, for example, when comparing the old method of manually writing and running Data Quality rules, which will only increase in numbers exponentially as existing data sources change and new ones are added. In modern Data Quality applications, an algorithm takes a first stab at generating possible data quality rules that can be applied to the specific data set that is being analyzed. It generally starts by running basic empty checks and null checks followed by the generation of statistical profiles. A baseline is usually created in order to benchmark past data, where data can be understood over time and data drift can be identified. This effectively allows the solution to continue learning as more data is analyzed, removing the need to manually write and adjust hundreds of rules per dataset.

The other important application of AI in the data quality space that provides significant value is anomaly detection. This works through the identification of patterns across millions of rows and columns of data. Once the data is processed, machine learning algorithms detect values that break these patterns, highlighting them as an outlier for data analysts to review. The analyst will then validate the data point, and in doing so will teach the algorithm to better identify patterns in the future.

Worldwide artificial intelligence (AI) software revenue is forecast to total $62.5 billion in 2022, an increase of 21.3% from 2021, which is a good indicator of how machine learning will be more prevalent in modern data solutions. This includes data quality solutions, which have continued to evolve over time and have come a long way from the days of simply running SQL queries on a recurrent basis. Leveraging the benefits of supplementing data quality capabilities with machine learning will be critical for businesses that already store and process large amounts of data, as well as for future-proofing data quality teams that have been playing catch-up with seas of data quality and remediation requests.

As we recommended in a previous article, there are factors one must consider before investing in data quality software solutions. However, if your company is at a stage where manual data quality tasks have started to become unmanageable, then a machine learning-aided solution like Collibra Data Quality may be a good fit for your problem.

At Kalypso, we are helping companies understand how to set up, leverage, and sustain modern data quality programs. Focusing on machine learning capabilities might help you operationalize data quality faster at scale. If this is an important driver for your company, we would love to talk to you.