-->

Register Now to SAVE BIG & Join Us for KMWorld 2025, November 17-20, in Washington, DC.

Are You Creating a Data Swamp?

Organizations have made investments in “small data” for years and many are achieving data governance, or at least understand the gap they need to fill. They know how to work with the relatively small number of technologies in play—including databases (standardized on SQL), ETL, DQ and BI—ideally all linked with modeling tools and/or a business glossary. These organizations are now embracing the promise of big data—a new frontier akin to the wild west or the gold rush, with programmers/data scientists let loose with a daily growing menagerie of languages and technologies outside the normal IT and governance structure. Sometimes this produces genuinely impressive-looking results and insights—especially those supporting marketing.

But making decisions based on marketing insights can be low consequence compared to other potential analytical results, such as risk analysis and pricing information. Today, there are more than 70 vendors in the big data space alone, and growing. Not to mention a large number of open source technologies, many of which are repackaged by those same vendors. Not only is the system complex, it’s getting more complex each day.

The usage of big data technology (such as Hadoop) can vary significantly:

Analysis. Which in turn can be subdivided into:

Batch (using algorithms such as map-reduce applied to relatively static data);

Dynamic (applied to dynamic event streams such as website clicks, system logs); or

Hybrid (dynamic assisted by static data such as customer information).

Data management (typically called a “data lake”). Providing a very scalable “schema-less” holding place for all source data (structured or unstructured) in its native form without having to pre-design one specific format/schema such as a traditional data warehouse. The idea is to dynamically create schemas for particular purposes for analysis, reporting, or use of traditional tools.

Is My Data Lake Really a Data Swamp?

To get desired information, someone needs to have a basic understanding of where the data resides in order to extract it. In a data lake, mass amounts of data are “thrown” into the lake, with little contextual information. No one knows what the files are for, how up-to-date they are, who’s responsible for them, whether they can be used, etc. Likewise, any “marts” formed out of the data lake need to have a detailed level of provenance back to the original data source. Analysts already have that for traditional ETL tools—it’s critical that the data lake provides the same capability. For any serious decisions made from the analysis, that same level of provenance is again needed, and in fact legally required by regulators.

Adding a context/governance layer can provide the following benefits and answer critical questions:

Catalog. What information is out there and what can I use? Who do I contact if the information looks wrong or if I need more?

Governance. Who owns/supports what? (e.g. files, technologies, analytic programs, results, data coming into data lake.) How up to date is the information and who is using it?

Context/impact. Does the data have consistent meaning (name matches are not sufficient)? What are the technology dependencies and risks?

Adding Governance and Provenance to Big Data

To help prepare organizations for big data initiatives, Adaptive has developed a software solution that ensures consistency and traceability of data across an organization. They can help organizations build the necessary foundation to enable governance and control over the data flowing into traditional data warehouses and big data environments to drive big data analytics. Adaptive’s solution provides support for three critical capabilities:

Business glossary. A cloud-based enterprisewide glossary of an organization’s terms, definitions and ontologies, providing a single point of truth for governance and knowledge transfer. Users can add stewardship and govern the changes of any item in the repository.

Metadata capture. Using Adaptive’s 75+ bridges, users can extract metadata from databases, data modeling tools, ETL tools, BI tools and Hadoop ecosystems. These extracts can be versioned and audited, allowing users to document the full history of an object for regulatory reporting.

Alignment. A comprehensive capability to support traceability and lineage analytics. Even if a piece of data is stored in 42 databases and is being processed 2,000 times, the solution can be used to follow the thread and understand the impact, change and use of data throughout the enterprise.

Adaptive’s offerings are built on industry standards—providing easier interoperability with different types of technologies. “Adaptive can discover data structures from a variety of technology landscapes,” says Jeff Goins, president and CEO of Adaptive. “Many organizations are silo-centric in nature, and Adaptive’s expertise connects all the silos, providing enterprise-wide transparency and governance management practices that are critical to Big Data initiatives.”

A Client Success Story

Adaptive’s solutions are being utilized by organizations in the banking, pharmaceutical, healthcare, retail, government and energy sectors. Although the business drivers across the client base vary, each firm is striving to ensure the accuracy and consistency of disparate data sources to support their big data initiatives. For example, a large financial services organization was investing more than $10 million in a new data lake in order to streamline the reporting process and provide more “BI on demand” capabilities to their business stakeholders. The organization was importing thousands of artifacts into the lake and needed a solution to provide governance and provenance to the data within the lake. Adaptive was selected to provide three critical capabilities:

Catalog. An analyst first uses Adaptive to ?understand what data is available for reporting;

Governance. Analysts can confirm the data they are about to use for reporting is under change control and can verify that all data in the report has a steward assigned; and

Lineage/impact. Users can trace their reporting data all the way back to its source, ensuring they are using the correct data in their analysis.

The client can now analyze governed data faster and more reliably than ever before. They have the best of both worlds—a data lake for faster, on-demand reporting; and the governance required by the business for precise decision making and adherence to compliance regulations.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues