Using Bootstrap Actions in EMR – Amazon (AWS)

In the tool set AWS offers for Big Data, EMR is one of the most versatile and powerful, giving the user endless hardware and software options with the purpose of facing any challenge -and succeed- related to the processing of large volumes of data.

However, a user working with the EMR console for the very first time will find that the options are packaged in a -generous, but limited- list of software and hardware settings, and he/she may get the -wrong!- conclusion that EMR doesn’t have what it takes for the task.

 

Incorporando Acciones de Bootstrap en EMR - Amazon (AWS)

Consola Web: Creando un Cluster de EMR

In this article we will focus on studying how to include additional software elements and/or different to those offered in the Web console package, and we will leave the Hardware settings and other options for another time.

 

Master vs. Slave

One of the first challenges to understand how to really incorporate Software into EMR is to be clear about the hardware and software infrastructure that supports EMR. In very simple terms, there are only 2 categories: Master and Slave (Master + Slaves = Cluster).

Map-Reduce is a programming model that benefits from the ability of “divide to conquer”, establishing workloads for the multiple cluster nodes, distributing and organizing the tasks of each node so a big work turns into small multiple works, usually easier and faster to complete compared to trying to assume the work as a whole unit.

 

Incorporando Acciones de Bootstrap en EMR - Amazon (AWS)

EMR programmatic model

 

In this model each node receives a workload, to later work on it, and finally deliver a result consolidated by the master node.

Incorporando Acciones de Bootstrap en EMR - Amazon (AWS)

EMR Cluster: Master Node and Slave Nodes working together to solve the work according to Map-Reduce algorithms.

 

To make all this work, the software incorporated in the slave nodes must be “correspond to” the master node software, and thus be able to “talk” properly during the cluster work execution.

When we connect to the cluster, normally what we do is connect to the leading node, and through this we carry out the execution of the work. It is in this same leading node where we can take remote control via SSH, and perform activities at the operating system level, such as, for example, installing new software, or altering the configuration of the software already installed.

However, it is essential to keep in mind that the leading node does not replicate or distribute these modifications to the rest of the cluster nodes, so if we install a new library, such as Boto3, it will only be available in the Leader node, leaving Slave nodes unable to address tasks that require this library, and then any work that requires it will fail to run.

The lesson we have to take is very simple: the software must be installed and configured BEFORE the cluster exists. Otherwise, the configuration changes, and the installed software, will only be available in the Master node.

How? Very simple: with bootstrap actions.

 

Bootstrapping

 

Well, it may not be that simple at first, especially if you are not used to performing Bash scripts on Linux. However that is, broadly speaking, the only challenge to start working with EMR and Bootstrap Actions.

Bootstrap Actions are essentially Bash scripts for Linux, which automate the execution of installation steps or configuration manipulation of the installed software and the operating system in general.

A super-simplification of the previous point is: every step that a human with the command of a console of SSH in Linux would carry out, is possible to take it to a line of a script of Bash, that can be executed of automatic form, without human supervision.

For example, to install Boto3:

 

sudo pip install -U \
awscli \
boto

 

Then, it becomes a Bash script. We will name this file install_boto3.sh:

#!/bin/bash
sudo pip install -U \
awscli \
boto

 

And finally we save it in S3:

s3://mi_bucket/scripts/install_boto3.sh

 

Simple, isn’t it?

Now it only needs to reference this script in a Bootstrap action from the configuration during cluster creation. There are basically 2 options to do this (there are more options to launch a cluster, but these ones are the most used):

 

1. Using the Web console

Following the advanced options, in the step 3 of the settings during the cluster creation, open the section “Bootstrap Actions”

 

Incorporando Acciones de Bootstrap en EMR - Amazon (AWS)

 

The options to reference the script already saved in S3 will appear:

 

Incorporando Acciones de Bootstrap en EMR - Amazon (AWS)

 

And finally it only remains to launch the cluster creation, following the remaining steps of the settings.

 

2. Using AWS CLI

This option is the easiest one to control and execute. It requires to have correctly installed and configured AWS CLI with our AWS account. Just execute in our CLI the next commands line:

 

aws emr create-cluster
–applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark Name=Zeppelin
–ec2-attributes ‘{“KeyName”:”mi_cluster”,”InstanceProfile”:”EMR_EC2_Profile”,”SubnetId”:”subnet-XXXXXXXXX”,”EmrManagedSlaveSecurityGroup”:”sg-XXXXXXXXX”,”EmrManagedMasterSecurityGroup”:”sg-XXXXXXXXX”}’
–release-label emr-5.12.0
–log-uri ‘s3n://aws-logs-XXXXXXXXXXXXXX-us-east-1/elasticmapreduce/’
–steps ‘[{“Args”:[“spark-submit”,”–deploy-mode”,”cluster”,”–driver-cores”,”1″,”–maximizeResourceAllocation”,”s3://mi_bucket/scripts/mi_script_python.py”],”Type”:”CUSTOM_JAR”,
“ActionOnFailure”:”CANCEL_AND_WAIT”,”Jar”:”command-runner.jar”,”Properties”:””,”Name”:”Mi Programa Spark”}]’ –instance-groups ‘[{“InstanceCount”:4,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”CORE”,”InstanceType”:
“r4.xlarge”,”Name”:”Core – 2″},{“InstanceCount”:1,”EbsConfiguration”:{“EbsBlockDeviceConfigs”:[{“VolumeSpecification”:{“SizeInGB”:32,”VolumeType”:”gp2″},”VolumesPerInstance”:1}]},”InstanceGroupType”:”MASTER”,”InstanceType”:
“r4.xlarge”,”Name”:”Master – 1″}]’
–bootstrap-actions ‘[{“Path”:”s3://mi_bucket/scripts/install-boto3.sh”,”Name”:”Install Boto3″}]’
–ebs-root-volume-size 50
–service-role EMR_Role
–enable-debugging
–name ‘mi_aplicacion’
–scale-down-behavior TERMINATE_AT_TASK_COMPLETION
–region us-east-1

 

 

It’s worth of noting that this command is one single text line, which has been splitted in here to ease the reading.

There are parameters which are optional (i.e. –enable-debugging) and others which depend on the infrastructure resources of your own AWS account (i.e. security group IDs, instance profiles names such as “EMR_EC2_Profile” and service roles, like –service-role EMR_Role, among others).

It may seem a bit hard to handle at the beginning, but in the practice is the easiest and fastest option to launch a cluster, and over time it’s easy to get used to it.

 

3. Public Resources

As we are not always the first ones to face a problem, most of times it is possible to find someone who has solved the problem before us. And a good part of those times, that someone has been so kind to make the solution available to be reused. That’s how we find bootstrap scripts for AWS EMR that can be used directly, to achieve certain objectives, such as install Jupyter Notebook in our EMR cluster.

 

The script is constantly updated, and can be found in the S3 path:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

 

I hope you find useful this information, and ease your access to this cool technology from AWS.

Best regards!
Marcelo

AWS Partner
Microsoft Partner





Bitnami