The centerpiece of data governance: Making information quality pay off
Best practices for scaling quality
The primary challenge for implementing these various data quality steps is doing so at enterprise scale in the post-big data, AI era. When data was safely contained on-premise in traditional relational settings, manual approaches to information quality sufficed. Today, however, “the data is exploding and the number of sources is exploding,” Shankar said. In the wake of this mass inundation of data and the need to extract useful information from it, multiple approaches have emerged to produce information quality at scale. Some of the more useful of these include the following:
♦ Data integration: Although the transformations for information quality are distinct from those for integrating data (which usually involve ETL or ELT), it’s not uncommon to leverage the same engine for both. In this case, data quality is applied to the data prior to transforming it for integration as part of a larger data engineering process for data warehouses or data lakes. “As the data is being sourced from various systems—ERP, CRM, etc.—as we are bringing that data in, we’re also deploying a data quality check for a data quality transformation before the data hits the data lake,” Ghai said.
♦ Cognitive computing: A multitude of cognitive computing approaches are helpful for scaling data quality checks. Machine learning can create rules from manual input about quality in systems that “present a sample of potential matches to people that know the data well, and ask them if potential matches are true or false matches,” Franco noted. Then, machine learning can be applied and run matching on millions of records in an automated way, said Franco. In other instances, cognitive computing techniques can triangulate aspects of the domain discovery process initiating information quality by “scanning data lakes and extracting metadata, the labels over the columns, the schema,” Ghai stated. “We’re also profiling the data at scale: We’re profiling columns; we’re intelligently sampling them to understand the shape of the data. With AI and ML, now we’re able to do intelligent domain discovery. We can tell you what kind of data it is.”
♦ Centralization: With these integration and cognitive computing methods, data is required to move. With centralization approaches leveraging data virtualization technologies, data can stay wherever it is—in its original format, structure, and data model—while being abstracted to a centralized layer for enterprise access. “You have a unified view of the data across the enterprise so all the data gets normalized, standardized with the best quality, and then it’s available to any consuming tool,” Shankar said.
The capability to scale data quality is far from a subject of academic interest. Franco detailed a marketing use case in which a retailer with a database of 10 million customer records targeted buyers ages 18–34 (a fourth of its customers) living within 5 miles (approximately 10% of these young adults) of a specific store. With flawless data, the campaign could reach 250,000 customers. But, if 20% of the email list’s contact data was not accurate, and if 50% of the age attribute was undefined, and 20% of the addresses could not be standardized to calculate the distance from the store, it would only have been possible to reach 80,000 customers, Franco reasoned.