AWS Lake Formation
Building a Data Lake is a task that requires a lot of care. Although its level of complexity depends on several factors, including:
- diversity in type and origins of the data
- storage required
- demanding levels of security
no matter the multiple combinations that may occur between them (and many more factors), there will always be a common element in the possible scenarios of implementation of a Data Lake: global and general knowledge of the data lifecycle.
By definition, we could consider that a Data Lake is an architectural approach which allows to store massive amounts of data in a central location, so that they are easily available to be categorized, processed, analyzed and consumed by different teams within an organization.
A typical construction of a Data Lake is consists of:
- Storage configuration
- Data moving
- Cleaning, preparation and data catalog
- Configuration and reinforcement of security policies and compliance
- Ensure that data is available (for analytical or visualization applications, among others)
As expected, each one of these points represents a more detailed procedure and must be dealt with greater rigor. Again, the combinations between these steps make a person (or company) who wants to start ‘playing’ with a Data Lake feel overwhelmed.
Curiously, AWS has a huge range of AWS services to work with data that can sometimes ‘force’ a person to simply give up on their goal of working with a Data Lake.
Before reviewing the AWS Lake Formation proposal, we could consider some of the steps required to build a Data Lake in AWS:
- Identify data sources (such as RDBMS, files, streams, transactions)
- Create required S3 buckets to store this data, with their corresponding policies
- Create the ETLs that will perform the transformations on this data from their origins to take them to the S3 buckets. This involves the corresponding administration of audit policies and permissions of the ETLs, and create a data cleansing strategy.
- Allow analytics services to access this data in S3.
In the previous process each point implies a human-derived risk factor which could allow possible errors to be committed. The 2 ‘alternative routes’ would be to manually execute the process with total precision, generating a checklist and identifying in turn each of the new alternatives in each point; or, rely on a managed service that handles as many tasks as possible for the creation of the Data Lake. And it is at this point where AWS Lake Formation enters:
AWS Lake Formation is an attractive option for those who do not have the technical knowledge or enough time to face a project that involves a Data Lake.
As it can be seen in the previous image, AWS Lake Formation includes the 4 basic stages of a Data Lake, allowing in each of them a human interaction at the level that is desired by the user. Simply put, anyone who wants to do a ‘double click’ in any of these 4 phases can do so.
In AWS Lake Formation, S3 manages the storage layer. The user will have access to their data without any type of ‘blocking’ by this service. One-time or incremental loads can be made. It is only matter of indicating the origin and destination of the data, and how often the load will be made. This configuration is done through ‘blueprints’.
Lake Formation relies on ML Transforms to create transformations with Machine Learning, which are used in an ETL job (for example). One of these transformations is FindMatches, which helps manage duplicates of data.
As tool to catalog the data, AWS Lake Formation uses AWS Glue crawlers to obtain the metadata and create a catalog with them. These can then be used to label the information (for example, mark information as sensitive).
The administrator user will be able to create the accesses that he/she considers convenient, and AWS Lake Formation will be in charge of blocking or allowing access to the data (or the services that use them) to other users:
These permissions can be at the table and column level:
AWS Lake Formation allows to audit from a single point, without having to go through multiple services. This monitoring can be done in real time.
A strong aspect in this new service is its cost: you only pay for the underlying services, in other words, AWS Lake Formation is free. Only will be charged for the services that are invoked from it.
After having initially reviewed the AWS Lake Formation service, we could say that this is a service that will allow users (both beginners and experts) to start almost immediately with a basic Data Lake. Not only this, but it will allow to ‘abstract’ the complex technical details of implementing a basic Data Lake, and focus efforts on more specific business tasks. However, it is worth highlighting the word “basic”, since to the extent that the Big Data solution has a higher level of complexity, it may be necessary to completely review the requirements and, at this point, the scope is outside of AWS Lake Formation.
- AWS Lake Formation for Data Lakes
- Introduction to AWS Lake Formation. Prajakta Damle, Principal Product Manager, AWS Lake Formation & AWS Glue