
Data lakes are ideal solutions for organizations which prioritize data in their operations strategy.
Clean data adds value to the overall process and triggers the AI’s ability to organize existing processes and extract new patterns, helping both old and new revenue streams.
Huge amounts of clean data are provided to all participants in the company’s value chain. Both data sets and the security thereof are guaranteed at all stages.
Secure data sharing is a crucial factor when multiple teams require access to enterprise data. These steps can’t be left to improvisation: there needs to be a consistent approach.
Before deep-diving into the article, check this year’s Data and AI Forum Italia event’s programme that will be hosted on May 18. Three sessions of 45 minutes each focus on data as a key element to digital transformation in corporate environments: data governance, the transition from legacy applications to modern ones, data protection, distributed data management and on edge, process mining, AI infusion.
Data: Lakes or Swamps?
An un-managed repository of large data sets can easily become a swamp of badly biased data sets.
To help manage this element of innovation, organizations need to govern their data lake. Raw structured and unstructured data – trusted, secured, and governed – will be kept in the lake for the necessary time period. This kind of organization is known as a “Governed Data Lake”.
For organizations that derive value from their data, including data about customers, employees, transactions, and other assets, governed data lakes create opportunities to identify, understand, share and confidently act upon information.
Governing data
A governed data lake contains clean, relevant data from structured and unstructured sources that can easily be found, accessed, managed, and protected. Organization-relevant data reside on a security-rich and reliable platform.
Data that comes into your data lake is properly cleaned, classified, and protected in timely, controlled data feeds that both populate and document with reliable information assets and metadata.
Simply dumping data into a data platform won’t accelerate your analytics efforts on its own. Without appropriate governance or quality control, data lakes can quickly turn into unmanageable data swamps.
Data users know that the data they need lives in these swamps, but without a clear data governance strategy they won’t be able to find it, trust it, or use it.
A complete representation of the architecture of data lake governance can be found in the following image:
Data stakeholders
Users who consume data from the data lake vary in key ways. Understanding the difference between their approaches to data is an important aspect of successful governance.
Referencing the schematic organization of a governed data lake, the four categories of data lake users are: Data Stewards, members of Analytics Teams, Governance-risk-compliance teams, and line-of-business teams.
Data stewards
Data stewards optimize data quality and prepare ETL jobs, i.e., Extract, Transform, and Load processes, normally assigned to a software engineer who has specific skills.
Stewards catalog data and perform metadata management. In a nutshell, Data stewards strike the balance between data protection and privacy.
Analytics teams
These teams are made up of data scientists who manage data and build machine-learning models. With the help of analytics developers, models have turned into applications: thanks to developers, analytics applications are incorporated into operational systems.
Governance, risk, and compliance teams
Data governance specialists build data governance and security policies, protect data to ensure privacy control enforcement in all processes, compile retention, archival, and disposal requirements and ensure that data is compliant with policy and regulations.
Line of business teams
Line-of-business (LOB) executives such as CMOs, CFOs, or CHROs belong to this category of data lake users. Chief data officers are emerging as business owners of data, while LOB executives implement systems for specific business outcomes or actionable insights.
Building blocks of a governed data lake
A governed data lake is a reference architecture independent of specific technology that includes governance and management processes. It’s not Hadoop or a generic enterprise data warehouse that you can buy or replace.
A governed data lake is an on-premises or cloud-based solution for organizations that want to put data at the core of their operations. The building blocks of a governed data lake include the following four elements: Enterprise IT data exchange, Catalog, Governance, and Self-service access.
Data Exchange
Enterprise IT data exchange can extract, analyze, refine, transform, and exchange data between data lakes and enterprise IT systems, moving it from data puddles to data lakes. The system cleanses data and monitors data quality on an ongoing basis.
Self-service access consists of three sets of services that provide on-demand access to the data lake. Self-service access for analytics users allows access to raw data as it’s stored. For LOB teams, the service provides normalized data in simplified data structures.
For governance and risk and compliance teams, the service provides governed data for audits.
Catalog
Catalog services describe the data in the data lake—what it means, how it’s classified and the resulting governance requirements this places on the data.
Governance
Governance helps to govern the data in the data lake and applies appropriate policies, security, data quality, and privacy measures to the data stored in the lake.
Self-service
Self-service access consists of three sets of services that provide on-demand access to the data lake. Self-service access for analytics users allows access to raw data as it’s stored. For LOB teams, the service provides normalized data in simplified data structures.
For governance and risk and compliance teams, the service provides governed data for audits. Catalog services describe the data in the data lake—what it means, how it’s classified and the resulting governance requirements this places on the data.
Data ingestion
Ingestion is the process of extracting, transforming, quality processing, and exchanging data between the data lake, the enterprise’s IT systems, and other existing data lakes. Much of the data in a data lake comes from an organization’s IT systems.
Data sources can be systems operating the business, a website log, internal (customers, employees, transactions, other assets) or other sources that monitor activity. These data types can be structured, semi-structured, or unstructured.
Many of these data are a part of what is called “dark data”, an evocative expression that encompasses all data the value of which can’t be properly extracted or classified with classical analysis approaches.
More recent data sources come from what is collectively called IoT – the internet of things: a brave new world of an infinite number of small chunks of non-standard, raw, clean data.
Dark data also includes all data from industrial – and more generally IoT-based – devices that are incorrectly processed and stored.
A scalability problem in data transfer, management, and storage can easily arise. To avoid this class of risks, IBM offers scalability in regards to both volume of data and richness of transformation and replication.
Predictive analytics is a key milestone
Once your data sets are clearly and safely recorded and managed in your governed data lake, it’s time to use them to feed the cascading analytics engines, exploiting the fascinating capabilities of AI-based systems to both reorganize your data and existing processes and gain new ideas for future processes and business lines.
Let’s delve deeper into the analytics we see in action today. Modern predictive analytics can empower your business to augment historical data with real-time insights, and then to harness this to predict and shape your future.
Predictive analytics is a key milestone on the analytics journey, a point of confluence where classical statistical analysis techniques meet the new world of artificial intelligence (AI).
IBM Decision Optimization is a prescriptive analytics solution based on machine-learning techniques that enable highly data-intensive industries to make better decisions and achieve business goals by solving complex optimization problems. This ability has lead to this solution being dubbed “Decision Optimization”.
During a decision process, even if you don’t know the ‘right’ answer, you already know a lot about what constitutes a good or bad answer. Taking the output from your machine learning, you can specify an action for Decision Optimization to take, which can include optimization rules and constraints to achieve business goals.
Decision Optimization returns answers that will deliver value to the business, such as actionable items and recommendations for change.
Business leaders use this tool to make their use of resources more efficient. Some activities that have already proven the value of Decision Optimization include inventory flow for the supply chain, workforce scheduling, and routing of transportation, and more are added to the list every day.
The tipping point for AI adoption
Clean data lakes can now feed your analytical engines. It’s time to embrace the new AI-based tools, but where and when is this to happen? The situation has reached a tipping point – for the first time, organizations of all sizes can undertake the following activities:
- Embed predictive analytics into their business processes;
- Harness AI at scale;
- Extract value from previously unexplored “dark data” (or big data, or data of non-uniform size, type, or data that have been gained through unusual extraction techniques).
Building an enterprise data science platform will allow your organization to gain a significant competitive advantage.
Watson, the IBM Portfolio for AI
The IBM Portfolio for AI is richly modular and performs all stages of an AI strategy that is simultaneously predictive and proactive where necessary. The IBM portfolio of pre-built AI services takes its name from the Watson application suite.
Created to work on IBM Cloud Private for Data, the truly multi-cloud data platform, Watson is the IBM solution portfolio that exploits the full potential of Ai and ML algorithms in a business-ready approach.
It has four building blocks:
- Build – Watson Studio
- Deploy – Watson Machine Learning
- Manage – Watson OpenScale
- Catalog – Watson Knowledge Catalog
Watson Studio
Watson Studio builds and deploys the model for online, batch, or streaming deployments. It finds data, prepares them, and develops the complete model.
Watson Machine Learning
Watson Machine Learning optimizes the first model, improving its performance and mitigating the emerging cognitive biases that could make the model less effective. WML manages the basic model, updates data and refines processes, allowing the machine learning model’s deployment.
Watson OpenScale
Watson OpenScale is the manager of the entire structure. It performs a continuous evolution with fairness and explainability built in, extracting and controlling all production metrics and the chosen business KPIs. WOS monitors and orchestrates the models served by the other two Watson components, Studio and Machine Learning.
Watson Knowledge Catalog
Watson Knowledge Catalog is a data catalog tightly integrated with an enterprise data governance platform. A data catalog allows data citizens to easily find, organize and understand the data they need.
Watson Knowledge Catalog enables business users to locate, manage, categorize, and share data assets, datasets, analytical models and their relationships with other members of the organization.
Serving as a single source of truth for data engineers, data stewards, data scientists and business analysts to gain self-service access to data they can trust, Watson Knowledge Catalog is based on IBM Cloud Pak for Data.
IBM Cloud Pak for Data
Cloud Pak for Data is a bridge to AI analytics through clean data. It offers an integrated end-to-end platform for high-performance analytics that enables enterprises to achieve their data maturity goals.
This solution allows critical data to remain protected by a private firewall, while still being accessible from cloud-based applications in order to generate new insights.
Using Kubernetes, Cloud Pak for Data customers can provision new infrastructure in minutes. The platform’s in-memory database can incorporate more than 1 million events per second.
Cloud Pak for Data minimizes the time and expense required to create meaningful insights while expanding analytics capabilities. To successfully adopt machine learning and AI, organizations must be able to put their trust in meaningful and reliable information.
The disparate data must be in a consistent format and be organized in a single access point for maximum value. With Cloud Pak for Data, you can move from raw data to reliable data.
The Event
All these subjects will be discussed in depth during the Data and AI Forum Italia event that will be hosted on May 18. Three sessions of 45 minutes each will explore the subject in full.
The first and last sessions will be plenaries, while the central slot will offer six parallel events about how to modernize, secure, and automate your IT, automate your business, predict data, and use AI in prediction.
Recommended reads:
Enabling the Data Lakehouse
Actor Model for Managing Big Data