Overview of Data Lakes
Table of Contents
Defining Big Data
Big Data is a broad term to describe data sets so large or complex that traditional tools and solutions are inadequate for processing and performing analysis. The Characteristics of Big Data: The Four V’s
Volume
Solutions must work efficiently in distributed systems and must be easily expandable to accommodate increases in traffic.
- Extremely large volumes of data.
- Data is increasing at a rapid rate.
- Terabytes of data >> Petabytes of data
Variety
Solutions need to be sophisticated enough to manage all the different types of data, while providing accurate analysis.
- Diverse data sets, multiple sources
- Most sources are in the Cloud
- ‘Legacy’ systems are still present
- Various forms of data – structured, semi-structured, and unstructured.
Velocity
Solutions must be able to manage this speed efficiently, and processing systems must be able to return results in an acceptable time frame.
- Increased speed of users, devices, applications
- 75 billion connected devices by 2020
- MB / s is normal, GB / s is common
- One million transactions per second
- In real time, batch.
Veracity
The data must remain consolidated, clean, consistent and up-to-date to make the correct decisions.
- Data reliability.
- Inherent differences in all collected data.
- Inconsistent, sometimes inaccurate, data that varies.
The Evolution Of Data Analysis

Descriptive
Why “X” happened. Descriptive Analysis uses data aggregation and data mining techniques to provide insight into the past to provide answers.

Predictive
What is the probability that “X” happens? Predictive Analysis uses statistical modeling and forecasting technologies to understand what might happen in the future.

Prescriptive
What to do if “X” happens? This type of analysis uses optimization and simulation algorithms to assess possible results and answer “What should be done?”
Why Every Company Needs a Data Strategy?
There is more data than people think:
- Data grows > 10x every 5 years.
- Data platform needs to live for 15 years
There are more consumers accessing data:
- Data Scientists, Data Engineer, Data Product Manager, Data Visualizer, Business Users, Analysts, Applications, Developers.
And more requirements for making data available:
- Secure, Real time, Flexible, Scalable.
_
Source: IDC, DataAge 20216: The Evolution of Data to Life-Critical Don’t Focus on Big Data, Focus on the Data That’s Big. April 2017

Data Strategic
"The world's most valuable resource is no longer oil, but data".
Source: The Economist, 2017
Data as a Strategic Asset
- Collect and retain all data
- Turn data into insights
- Make data available to intented users and customers
- Create new products and services
- Invest in data processing technologies.
Data as a Differentiator
Organizations that successfully generate business value from their data outperform their peers.
They were able to:
- Identify and act on opportunities
- Attract and retain customers
- Boost productivity
- Proactively maintain devices
- Make informed decisions
(Aberdeen: Angling for insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence)
What is Dark Data?
Source: Datumize / Factor Daily
In this age of technology-driven enlightenment, data is our competitive currency.
The raw information, buried in the mind-blowing volumes generated by transactional systems … are critical operational, customer and strategic insights that, once illuminated by analysis, can validate or clarify assumptions, inform decision-making, and help map out new paths to the future.
– Tracie Kambies, Nitin Mittal, Paul Roma, Sandeep Kumar Sharma
Tech Trends 2017, from https://www2.deloitte.com/content/dam/Deloitte/au/Documents/technology/deloitte-au-technology-dark-analytics-061017.pdf
Regulatory Risk:
Leak or loss of sensitive information or Personal Identification Information (PII)
Intellectual Property Risk:
Failure to protect Intellectual Property
Opportunity Risk:
Missing opportunities for improvement
Journey to a Modern Data Architecture
Data Lakes
on AWS


Data warehouse modernization
Real-time analytics with streaming data


Data Governance
Machine Learnign
At Morris & Opazo we help you to innovate and gain value from data that is:
Our clients usually need technical and strategic help migrating on-premises workloads to the AWS Cloud. They:
Growing exponentially
From new sources
Increasingly diverse
Used by many people
Analyzed by many applications
- Aree overhelmed with exponential growth of data.
- Need guidance and roadmaps for storing and managing data.
- Need advice and solutions to help extract and visualize data insights
To help our clients succeed, Morris & Opazo:
- Engage them with a top-down approach
- Becomes a strategic ally
- Focuses on creating solutions
Challenges of on-premises data warehouses
- Cost of scalibility
- Long implementation cycles and high failure rates
- Failure to adapt to new technologies
- Proprietary data formats
- Governance and control issues
- Cost of maintenance
Top Areas with Negative Impact on Data Analytics Strategies
and derive value from then
Top Goals for Using a Data Lake
Source: Enterprise Strategy Group
What is a Data Lake
Centralized repository
that allows to store:
– Any Data
– At any Scale
– At a Low Cost

What is NOT a Data Lake?
- It is not a database (OLTP).
- It is not a data warehouse (OLAP).
- It is not a product.
- It is not property of anyone.
- It is not Hadoop.
- It does not replace another data storage.
Data Lakes compared to Data Warehouses
Characteristics | Data Warehouse | Data Lake |
---|---|---|
Data | Relational from transactional systems, operational databases, and line of business applications | Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications |
Schema | Designed prior to the DW implementation (schema-on-write) | Written at the time of analysis (schema-on-read) |
Price/Performance | Fastest query results using higher cost storage | Query results getting faster using low-cost storage |
Data Quality | Highly curated data that serves as the central version of the truth | Any data that may or may not be curated (ie. raw data) |
Users | Business analysts | Data scientists, Data developers, and Business analysts (using curated data) |
Analytics | Batch reporting, BI and visualizations | Machine Learning, Predictive analytics, data discovery and profiling |
Data Temperature

Data Access Characteristics
Hot | Warm | Cold | |
---|---|---|---|
Volume | MB – GB | GB – TB | PB |
Item Size | B – KB | KB – MB | KB – TB |
Item Size | ms | sec | min, hrs |
Durability | Low – High | High | Very High |
Request rate | Very High | High | Low |
Cost / GB | $$-$ | $-¢¢ | ¢ |
The Data Lake Approach

- Devices
- Social
- Web
- Applications
- Video
- Sensor
- Database
- Clickstream



- Enterprise Search
- Interactive Fast Queries
- Reports / Dashboard
- Machine Learning
- Ad-hoc Analysis
Challenges in the Management of Data.
Customers are challenged to:
- Collect a variety of data types accumulating at varying velocities.
- Collect data from numerous sources, accumulating at differing velocities
- Store massive amounts of data without running out of space
- Cleanse and augment data quality to be analyzed
Can they automate these steps?
Analytics Pipeline
Basic Principle of Data Lake
Separating your Storage and Compute allows you to scale each component as required

Concept of a Data Lake
- All data in one place, a single source of truth.
- Stores in native format.
- Handles structured and unstructured data.
- Supports fast ingestion and consumption.
- Schema on read.
- Designed for low-cost storage.
- Supports protection and security rules.
- Cloud Object Storage.
- Store everything now so that you can extract insights later
Key Benefits of Data Lake
Performance
Easy Data Collection
High Availability and Durability
Cost efficiency
Flexible processing
Security and Compliance
Scalability
Strong consistency
The value of a Data Lake
The ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making. Examples where Data Lakes have added value include:
Improved customer interactions
A Data Lake can combine customer data from a CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer cohort, the cause of customer churn, and the promotions or rewards that will increase loyalty.
Improve R&D innovation choices
A data lake can help your R&D teams test their hypothesis, refine assumptions, and assess results—such as choosing the right materials in your product design resulting in faster performance, doing genomic research leading to more effective medication, or understanding the willingness of customers to pay for different attributes.
Increase operational efficiencies
The Internet of Things (IoT) introduces more ways to collect data on processes like manufacturing, with real-time data coming from internet connected devices. A data lake makes it easy to store, and run analytics on machine-generated IoT data to discover ways to reduce operational costs, and increase quality.
Consuption Pattern

Data Lake Reference Architecture

Architecture Principles
- Build decoupled systems:
data -> store -> process -> store -> analyze -> insights - Use the right tool for the job:
Data structures, latency, throughput, access patterns. - Leverage on managed and serverless services:
Scalable/elastic, available, reliable, secure, low or null management.
- Use log-centric design patterns:
Immutable logs (Data Lake), materialized views - Cost-effective:
Big Data =/ Big Costs - Enable AI/ML Applications
Queries to the Data Lake
Object Storage

Data Catalog Definition

Query Engine

Metadata Management
Metadata Classification
Lineage
Discovery
Searching

Data Governance
- There are more people working with data than ever before.
- Businesses are concerned with: Data privacy, Data security

Data Lakes and Analytics on AWS

AWS Analytics services
Category | Use cases | AWS Service |
---|---|---|
Analytics | Interactive analytics Big data processing Data warehousing Real-time analytics Operational analytics Dashboards and visualizations | Amazon Athena Amazon EMR Amazon Redshift Amazon Kinesis Amazon Elasticsearch Service Amazon Quicksight |
Data movement | Real-time data movement | Amazon Managed Streaming for Apache Kafka (MSK) Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Kinesis Video Streams AWS Glue |
Data lake | Object storage Backup and archive Data catalog Third-party data | Amazon S3 AWS Lake Formation Amazon S3 Glacier AWS Backup AWS Glue AWS Lake Formation AWS Data Exchange |
Predictive analytics and machine learning | Frameworks and interfaces Platform services | AWS Deep Learning AMIs Amazon SageMaker |
Best Practices for Cloud Data Management
Catalog your data, prevent the data lake from becoming a swamp
Leverage AI/Machine Learning to enhance productivity of all users of the platform
Curate and cleanse data for comsuption to increase trust
Integrate data pipeline development into your CI/CD/DevOps flow
Empower collaboration so the data lake is everyone's lake
Ensure you apply data governance and security policies to protect sensitive data
Ecosystem of Services for Big Data

Write us!