Overview of Data Lakes
Table of Contents
Defining Big Data
Big Data is a broad term to describe data sets so large or complex that traditional tools and solutions are inadequate for processing and performing analysis. The Characteristics of Big Data: The Four V’s
- Extremely large volumes of data.
- Data is increasing at a rapid rate.
- Terabytes of data >> Petabytes of data
- Diverse data sets, multiple sources
- Most sources are in the Cloud
- ‘Legacy’ systems are still present
- Various forms of data – structured, semi-structured, and unstructured.
- Increased speed of users, devices, applications
- 75 billion connected devices by 2020
- MB / s is normal, GB / s is common
- One million transactions per second
- In real time, batch.
- Data reliability.
- Inherent differences in all collected data.
- Inconsistent, sometimes inaccurate, data that varies.
The Evolution Of Data Analysis
Why “X” happened. Descriptive Analysis uses data aggregation and data mining techniques to provide insight into the past to provide answers.
What is the probability that “X” happens? Predictive Analysis uses statistical modeling and forecasting technologies to understand what might happen in the future.
What to do if “X” happens? This type of analysis uses optimization and simulation algorithms to assess possible results and answer “What should be done?”
Why Every Company Needs a Data Strategy?
There is more data than people think:
- Data grows > 10x every 5 years.
- Data platform needs to live for 15 years
There are more consumers accessing data:
- Data Scientists, Data Engineer, Data Product Manager, Data Visualizer, Business Users, Analysts, Applications, Developers.
And more requirements for making data available:
- Secure, Real time, Flexible, Scalable.
Source: IDC, DataAge 20216: The Evolution of Data to Life-Critical Don’t Focus on Big Data, Focus on the Data That’s Big. April 2017
"The world's most valuable resource is no longer oil, but data".
Source: The Economist, 2017
Data as a Strategic Asset
- Collect and retain all data
- Turn data into insights
- Make data available to intented users and customers
- Create new products and services
- Invest in data processing technologies.
Data as a Differentiator
Organizations that successfully generate business value from their data outperform their peers.
They were able to:
- Identify and act on opportunities
- Attract and retain customers
- Boost productivity
- Proactively maintain devices
- Make informed decisions
(Aberdeen: Angling for insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence)
What is Dark Data?
Source: Datumize / Factor Daily
In this age of technology-driven enlightenment, data is our competitive currency.
The raw information, buried in the mind-blowing volumes generated by transactional systems … are critical operational, customer and strategic insights that, once illuminated by analysis, can validate or clarify assumptions, inform decision-making, and help map out new paths to the future.
– Tracie Kambies, Nitin Mittal, Paul Roma, Sandeep Kumar Sharma
Tech Trends 2017, from https://www2.deloitte.com/content/dam/Deloitte/au/Documents/technology/deloitte-au-technology-dark-analytics-061017.pdf
Leak or loss of sensitive information or Personal Identification Information (PII)
Intellectual Property Risk:
Failure to protect Intellectual Property
Missing opportunities for improvement
Journey to a Modern Data Architecture
At Morris & Opazo we help you to innovate and gain value from data that is:
Our clients usually need technical and strategic help migrating on-premises workloads to the AWS Cloud. They:
- Aree overhelmed with exponential growth of data.
- Need guidance and roadmaps for storing and managing data.
- Need advice and solutions to help extract and visualize data insights
To help our clients succeed, Morris & Opazo:
- Engage them with a top-down approach
- Becomes a strategic ally
- Focuses on creating solutions
Challenges of on-premises data warehouses
- Cost of scalibility
- Long implementation cycles and high failure rates
- Failure to adapt to new technologies
- Proprietary data formats
- Governance and control issues
- Cost of maintenance
Top Areas with Negative Impact on Data Analytics Strategies
and derive value from then
Top Goals for Using a Data Lake
Source: Enterprise Strategy Group
What is a Data Lake
that allows to store:
– Any Data
– At any Scale
– At a Low Cost
What is NOT a Data Lake?
- It is not a database (OLTP).
- It is not a data warehouse (OLAP).
- It is not a product.
- It is not property of anyone.
- It is not Hadoop.
- It does not replace another data storage.
Data Lakes compared to Data Warehouses
|Characteristics||Data Warehouse||Data Lake|
|Data||Relational from transactional systems, operational databases, and line of business applications||Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications|
|Schema||Designed prior to the DW implementation (schema-on-write)||Written at the time of analysis (schema-on-read)|
|Price/Performance||Fastest query results using higher cost storage||Query results getting faster using low-cost storage|
|Data Quality||Highly curated data that serves as the central version of the truth||Any data that may or may not be curated (ie. raw data)|
|Users||Business analysts||Data scientists, Data developers, and Business analysts (using curated data)|
|Analytics||Batch reporting, BI and visualizations||Machine Learning, Predictive analytics, data discovery and profiling|
Data Access Characteristics
|Volume||MB – GB||GB – TB||PB|
|Item Size||B – KB||KB – MB||KB – TB|
|Item Size||ms||sec||min, hrs|
|Durability||Low – High||High||Very High|
|Request rate||Very High||High||Low|
|Cost / GB||$$-$||$-￠￠||￠|
The Data Lake Approach
Challenges in the Management of Data.
Customers are challenged to:
- Collect a variety of data types accumulating at varying velocities.
- Collect data from numerous sources, accumulating at differing velocities
- Store massive amounts of data without running out of space
- Cleanse and augment data quality to be analyzed
Can they automate these steps?
Basic Principle of Data Lake
Separating your Storage and Compute allows you to scale each component as required
Concept of a Data Lake
- All data in one place, a single source of truth.
- Stores in native format.
- Handles structured and unstructured data.
- Supports fast ingestion and consumption.
- Schema on read.
- Designed for low-cost storage.
- Supports protection and security rules.
- Cloud Object Storage.
- Store everything now so that you can extract insights later
Key Benefits of Data Lake
The value of a Data Lake
The ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making. Examples where Data Lakes have added value include:
Improved customer interactions
A Data Lake can combine customer data from a CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer cohort, the cause of customer churn, and the promotions or rewards that will increase loyalty.
Improve R&D innovation choices
A data lake can help your R&D teams test their hypothesis, refine assumptions, and assess results—such as choosing the right materials in your product design resulting in faster performance, doing genomic research leading to more effective medication, or understanding the willingness of customers to pay for different attributes.
Increase operational efficiencies
The Internet of Things (IoT) introduces more ways to collect data on processes like manufacturing, with real-time data coming from internet connected devices. A data lake makes it easy to store, and run analytics on machine-generated IoT data to discover ways to reduce operational costs, and increase quality.
Data Lake Reference Architecture
- Build decoupled systems:
data -> store -> process -> store -> analyze -> insights
- Use the right tool for the job:
Data structures, latency, throughput, access patterns.
- Leverage on managed and serverless services:
Scalable/elastic, available, reliable, secure, low or null management.
- Use log-centric design patterns:
Immutable logs (Data Lake), materialized views
Big Data =/ Big Costs
- Enable AI/ML Applications
Queries to the Data Lake
Data Catalog Definition
- There are more people working with data than ever before.
- Businesses are concerned with: Data privacy, Data security
Data Lakes and Analytics on AWS
AWS Analytics services
|Category||Use cases||AWS Service
Big data processing
Dashboards and visualizations
Amazon Elasticsearch Service
|Data movement||Real-time data movement||Amazon Managed Streaming for Apache Kafka (MSK)
Amazon Kinesis Data Streams
Amazon Kinesis Data Firehose
Amazon Kinesis Data Analytics
Amazon Kinesis Video Streams
|Data lake||Object storage|
Backup and archive
| Amazon S3
AWS Lake Formation
Amazon S3 Glacier
AWS Lake Formation
AWS Data Exchange
|Predictive analytics and machine learning||Frameworks and interfaces|
|AWS Deep Learning AMIs