Menu Close

Overview of Data Lakes

Table of Contents

Defining Big Data

Big Data is a broad term to describe data sets so large or complex that traditional tools and solutions are inadequate for processing and performing analysis. The Characteristics of Big Data: The Four V’s

Volume

Solutions must work efficiently in distributed systems and must be easily expandable to accommodate increases in traffic.

  • Extremely large volumes of data.
  • Data is increasing at a rapid rate.
  • Terabytes of data >> Petabytes of data

Variety

Solutions need to be sophisticated enough to manage all the different types of data, while providing accurate analysis.

  • Diverse data sets, multiple sources
  • Most sources are in the Cloud
  • ‘Legacy’ systems are still present
  • Various forms of data – structured, semi-structured, and unstructured.

Velocity

Solutions must be able to manage this speed efficiently, and processing systems must be able to return results in an acceptable time frame.

  • Increased speed of users, devices, applications
  • 75 billion connected devices by 2020
  • MB / s is normal, GB / s is common
  • One million transactions per second
  • In real time, batch.

Veracity

The data must remain consolidated, clean, consistent and up-to-date to make the correct decisions.

  • Data reliability.
  • Inherent differences in all collected data.
  • Inconsistent, sometimes inaccurate, data that varies.

The Evolution Of Data Analysis

Descriptive

Why “X” happened. Descriptive Analysis uses data aggregation and data mining techniques to provide insight into the past to provide answers.

Predictive

What is the probability that “X” happens? Predictive Analysis uses statistical modeling and forecasting technologies to understand what might happen in the future.

Prescriptive

What to do if “X” happens? This type of analysis uses optimization and simulation algorithms to assess possible results and answer “What should be done?”

Why Every Company Needs a Data Strategy?

There is more data than people think:

  • Data grows > 10x every 5 years.
  • Data platform needs to live for 15 years

There are more consumers accessing data:

  • Data Scientists, Data Engineer, Data Product Manager, Data Visualizer, Business Users, Analysts, Applications, Developers.

And more requirements for making data available:

  • Secure, Real time, Flexible, Scalable.

_
Source:
IDC, DataAge 20216: The Evolution of Data to Life-Critical Don’t Focus on Big Data, Focus on the Data That’s Big. April 2017

Data Strategic

"The world's most valuable resource is no longer oil, but data".

Source: The Economist, 2017

Data as a Strategic Asset

  • Collect and retain all data
  • Turn data into insights
  • Make data available to intented users and customers
  • Create new products and services
  • Invest in data processing technologies.

Data as a Differentiator

Organizations that successfully generate business value from their data outperform their peers.

They were able to:

  • Identify and act on opportunities
  • Attract and retain customers
  • Boost productivity
  • Proactively maintain devices
  • Make informed decisions

(Aberdeen: Angling for insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence)

What is Dark Data?


Source: Datumize / Factor Daily

In this age of technology-driven enlightenment, data is our competitive currency.

The raw information, buried in the mind-blowing volumes generated by transactional systems … are critical operational, customer and strategic insights that, once illuminated by analysis, can validate or clarify assumptions, inform decision-making, and help map out new paths to the future.

– Tracie Kambies, Nitin Mittal, Paul Roma, Sandeep Kumar Sharma
Tech Trends 2017, from https://www2.deloitte.com/content/dam/Deloitte/au/Documents/technology/deloitte-au-technology-dark-analytics-061017.pdf

Regulatory Risk:
Leak or loss of sensitive information or Personal Identification Information (PII)
Intellectual Property Risk:
Failure to protect Intellectual Property
Opportunity Risk:
Missing opportunities for improvement

Journey to a Modern Data Architecture

Data Lakes
on AWS

Data warehouse modernization

Real-time analytics with streaming data

Data Governance

Machine Learning

At Morris & Opazo we help you to innovate and gain value from data that is:

Our clients usually need technical and strategic help migrating on-premises workloads to the AWS Cloud. They:

Growing exponentially

From new sources

Increasingly diverse

Used by many people

Analyzed by many applications

  • Aree overhelmed with exponential growth of data.
  • Need guidance and roadmaps for storing and managing data.
  • Need advice and solutions to help extract and visualize data insights

To help our clients succeed, Morris & Opazo:

  • Engage them with a top-down approach
  • Becomes a strategic ally
  • Focuses on creating solutions

Challenges of on-premises data warehouses

  • Cost of scalability
  • Long implementation cycles and high failure rates
  • Failure to adapt to new technologies
  • Proprietary data formats
  • Governance and control issues
  • Cost of maintenance

Top Areas with Negative Impact on Data Analytics Strategies

Cost
37%
Too many disparate data sources
35%
Limited, analysts, and/or line of business
33%
Meeting security, governance, and compliance requirements
32%
Lack of skills necessary to properly manage data sets
and derive value from then
35%

Top Goals for Using a Data Lake

Improve scalability
39%
Merge structured and unstructured data
32%
Improve application development times
28%
Improve data sharing and collaboration
27%
Analize data in place
24%

Source: Enterprise Strategy Group

What is a Data Lake

Centralized repository
that allows to store:

– Any Data
– At any Scale
– At a Low Cost

What is NOT a Data Lake?

  • It is not a database (OLTP).
  • It is not a data warehouse (OLAP).
  • It is not a product.
  • It is not property of anyone.
  • It is not Hadoop.
  • It does not replace another data storage.

Data Lakes compared to Data Warehouses

CharacteristicsData WarehouseData Lake
DataRelational from transactional systems, operational databases, and line of business applicationsNon-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
SchemaDesigned prior to the DW implementation (schema-on-write)Written at the time of analysis (schema-on-read)
Price/PerformanceFastest query results using higher cost storageQuery results getting faster using low-cost storage
Data QualityHighly curated data that serves as the central version of the truthAny data that may or may not be curated (ie. raw data)
UsersBusiness analystsData scientists, Data developers, and Business analysts (using curated data)
AnalyticsBatch reporting, BI and visualizationsMachine Learning, Predictive analytics, data discovery and profiling

Data Temperature

Data Access Characteristics

 HotWarmCold
VolumeMB – GBGB – TBPB
Item SizeB – KBKB – MBKB – TB
Item Sizemssecmin, hrs
DurabilityLow – HighHighVery High
Request rateVery HighHighLow
Cost / GB$$-$$-¢¢

The Data Lake Approach

The Data Lake Approach - E - What is a Data Lake - Morris Opazo Chile Peru EEU_Latinoamerica_graph

Challenges in the Management of Data.

Customers are challenged to:

  • Collect a variety of data types accumulating at varying velocities.
  • Collect data from numerous sources, accumulating at differing velocities
  • Store massive amounts of data without running out of space
  • Cleanse and augment data quality to be analyzed

Can they automate these steps?

Analytics Pipeline

Basic Principle of Data Lake

Separating your Storage and Compute allows you to scale each component as required

Concept of a Data Lake

  • All data in one place, a single source of truth.
  • Stores in native format.
  • Handles structured and unstructured data.
  • Supports fast ingestion and consumption.
  • Schema on read.
  • Designed for low-cost storage.
  • Supports protection and security rules.
  • Cloud Object Storage.
  • Store everything now so that you can extract insights later

Key Benefits of Data Lake

Performance

Easy Data Collection

High Availability and Durability

Cost efficiency

Flexible processing

Security and Compliance

Scalability

Strong consistency

The value of a Data Lake

The ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making. Examples where Data Lakes have added value include:

Improved customer interactions

A Data Lake can combine customer data from a CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer cohort, the cause of customer churn, and the promotions or rewards that will increase loyalty.

Improve R&D innovation choices

A data lake can help your R&D teams test their hypothesis, refine assumptions, and assess results—such as choosing the right materials in your product design resulting in faster performance, doing genomic research leading to more effective medication, or understanding the willingness of customers to pay for different attributes.

Increase operational efficiencies

The Internet of Things (IoT) introduces more ways to collect data on processes like manufacturing, with real-time data coming from internet connected devices. A data lake makes it easy to store, and run analytics on machine-generated IoT data to discover ways to reduce operational costs, and increase quality.

Consumption Pattern

Data Lake Reference Architecture

Architecture Principles

  • Build decoupled systems:
    data -> store -> process -> store -> analyze -> insights
  • Use the right tool for the job:
    Data structures, latency, throughput, access patterns.
  • Leverage on managed and serverless services:
    Scalable/elastic, available, reliable, secure, low or null management.
  • Use log-centric design patterns:
    Immutable logs (Data Lake), materialized views
  • Cost-effective:
    Big Data =/ Big Costs
  • Enable AI/ML Applications

Queries to the Data Lake

Object Storage

Data Catalog Definition

Query Engine

Metadata Management

Metadata Classification

Lineage

Discovery

Searching

Data Governance

  • There are more people working with data than ever before.
  • Businesses are concerned with: Data privacy, Data security

Data Lakes and Analytics on AWS

AWS Analytics services

CategoryUse casesAWS Service
Analytics Interactive analytics
Big data processing
Data warehousing
Real-time analytics
Operational analytics
Dashboards and visualizations
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon Elasticsearch Service
Amazon Quicksight
Data movementReal-time data movementAmazon Managed Streaming for Apache Kafka (MSK)
Amazon Kinesis Data Streams
Amazon Kinesis Data Firehose
Amazon Kinesis Data Analytics
Amazon Kinesis Video Streams
AWS Glue
Data lake Object storage
Backup and archive
Data catalog
Third-party data
Amazon S3
AWS Lake Formation
Amazon S3 Glacier
AWS Backup
AWS Glue
AWS Lake Formation
AWS Data Exchange
Predictive analytics and machine learning Frameworks and interfaces
Platform services
AWS Deep Learning AMIs
Amazon SageMaker

Best Practices for Cloud Data Management

Catalog your data, prevent the data lake from becoming a swamp

Leverage AI/Machine Learning to enhance productivity of all users of the platform

Curate and cleanse data for comsuption to increase trust

Integrate data pipeline development into your CI/CD/DevOps flow

Empower collaboration so the data lake is everyone's lake

Ensure you apply data governance and security policies to protect sensitive data

Ecosystem of Services for Big Data

Write us!