Data Lake in Amazon Web Services (AWS)
Success Story: Big Data & Analytics
RIMAC Seguros is the leading company in the Peruvian insurance market. It is part of Breca, a Peruvian business conglomerate with an international presence and more than one hundred years old, founded by the Brescia Cafferata family.
Rimac had a large and very heavy workload for its data and analysis processes, all of which were performed over on-premise servers.
Each new process generated a new table in its database, which made the data management even more complex, and contributed to the costly processing times of data.
These, among other factors, created a scenario where the business process was unable to scale in response to the increasing volume of data and process complexity, and improving the processing time became paramount.
To guarantee the operational continuity of the platform, we have chosen services that are mostly characterized for being “serverless”, they do not depend on server infrastructure for the execution of their processes. This not only simplifies the platform administration, but also delivers high availability features in the services automatically without requiring additional efforts.
The resources provisioned by AWS to support the operation of Rimac’s Data Lake are characterized by being flexible, allowing vertical or horizontal scalability as needed.
To access the Data Lake services, users must authenticate with the AWS platform’s credentials through the Identity & Access Management (IAM) service. The same IAM service allows to establish the level of access and privileges for each service that users receive, which ultimately determines who can access each service, and what actions or activities can perform.
Protection of Data at Rest
The data storage services of the AWS Cloud have encryption systems for the data at rest, so it’s not possible to read the data without going through the corresponding decryption system.
CloudWatch and CloudTrail provide the ability to constantly monitor the resources and services of the Rimac Data Lake: alerts and notifications related to the metrics of the services and infrastructure in use. All events generated by users are registered and stored, allowing auditing and compliance processes to be carried out.
AWS Direct Connect
AWS Storage Gateway
The primary data source for the Rimac Data Lake is the Amazon S3 service. Here is where the different flat files generated by the Data Extractors are stored.
Amazon Redshift is the repository of the data transformed in the Data Lake. Amazon Redshift receives information from the ETL script transformations in EMR and from Amazon S3.
The Data Lake allows the use of EMR clusters to perform predictive analysis of the data stored in the Amazon S3 service.
Amazon Athena allows interactive queries in the Data Lake.
With Amazon SageMaker, Rimac can create, train and run its Machine Learning models.
To facilitate loading data into the Data Lake and provide a secure, easy-to-use data transfer and implementation system, AWS Storage Gateway provides a single point of access to the Data Lake structure in S3, which allows the copying of files to the Data Lake.
The implemented solution moved the workload of analyzing and processing the data carried out on-premise, towards the AWS cloud, delivering a world-class platform for the processing and storage of the data, which in turn allows to optimize and accelerate the process of obtaining results from this new platform, and relieving the workload of infrastructure and on-premises resources.