Amazon EMR for Data Science

Create on-demand Apache Hadoop clusters to analyze Big Data with Spark

Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive, and other Big Data Frameworks

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. EMR Notebooks, based on the popular Jupyter Notebook, provide a development and collaboration environment for ad hoc querying and exploratory analysis.

EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

EMR benefits

Since you can set EMR to install Apache Spark, this service is good for for cleaning, reformatting, and analyzing big data. You can use EMR on-demand, meaning you can set it to grab the code and data from a source (e.g. S3 for the code, and S3 or RDS for the data), run the task on the cluster, and store the results somewhere (again s3, RDS, or Redshift) and terminate the cluster.

By using the service in such a way, you can reduce the cost of your cluster significantly. In my opinion, EMR is one of the most useful AWS services for data scientists.

This guide shows how the creation of such EMR cluster for Data Science purposes can be automated by using the AWS CLI.

Install AWS CLI

Install AWS command line client:

pip install awscli

Configure the AWS command line client:

aws configure

AWS Access Key ID: <<Your public access key>>
AWS Secret Access Key: <<Your private access key>>
Default region name: us-east-1
Default output format: json

Create the cluster

The following example CLI command is used to launch a three-node (m4.large) EMR 5.12. cluster with a bootstrap action. The Bootstrap Action will install all the available kernels. It will also install the ggplot and pybrain Python packages and set:

the Jupyter port to 8885
the password to jupyter
the JupyterHub port to 8005

aws emr create-cluster --release-label emr-5.12.1 \
  --name 'My emr-5.12.1 cluster' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --region us-east-1 \
  --use-default-roles --ec2-attributes KeyName=<your-ec2-key> \ 
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
    InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
  --log-uri s3://<your-s3-bucket>/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook'
    Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh", 
    Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8885,--password,jupyter,--jupyterhub,--jupyterhub-port,8005,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]

Replace <your-ec2-key> with your AWS access key and <your-s3-bucket> with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

Example:

aws emr create-cluster --release-label emr-5.12.1 \
  --name 'My emr-5.12.1 cluster' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --region us-east-1 \
  --use-default-roles --ec2-attributes KeyName=cluster_keypair \ 
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
    InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
  --log-uri s3://achilleaskn-emr-data-science/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',
    Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh", 
    Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8885,--password, mystrongjupyter,--jupyterhub,--jupyterhub-port,8005,--cached-install,--notebook-dir,s3://achilleaskn-emr-data-science/notebooks/,--copy-samples]

A cluster-id should be returned, as below:

{
    "ClusterId": "j-XXXXXXXXXXXXXXX"
}

Get information about the cluster

Cluster details

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX

Cluster state

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"State\"

What are different cluster states?

State	Description
STARTING	The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING	Bootstrap actions are being executed on the cluster.
RUNNING	A step for the cluster is currently being run.
WAITING	The cluster is currently active, but has no steps to run.
TERMINATING	The cluster is in the process of shutting down.
TERMINATED	The cluster was shut down without error.
TERMINATED_WITH_ERRORS	The cluster was shut down with errors.

Master node public DNS:

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"MasterPublicDnsName\"

You should get somthing like:

ec2-###-##-##-###.compute-1.amazonaws.com"

Add rules to access the cluster

Adding Rules to Your Security Group

Get Master's security group

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"EmrManagedMasterSecurityGroup\"

It should return the id of the SecurityGroup that is applied on the Master:

"EmrManagedMasterSecurityGroup": "sg-0ea4d4bxxx8fe235e"

To add a rule that allows inbound SSH traffic

The following command adds a rule to enable inbound traffic on TCP port 22 (SSH) to the security group with the ID sg-903004f8::

aws ec2 authorize-security-group-ingress --group-id sg-XXXXXXXXXXX --protocol tcp --port 22 --cidr 0.0.0.0/0

Replcase sg-XXXXXXXXXXX with Master's Security Group id described above.

To add a rule that allows inbound Jupyter & JupyterHub traffic

For Jupyter:

aws ec2 authorize-security-group-ingress --group-id sg-XXXXXXXXXXX --protocol tcp --port 8885 --cidr 0.0.0.0/0

For JupyterHub

aws ec2 authorize-security-group-ingress --group-id sg-XXXXXXXXXXX --protocol tcp --port 8005 --cidr 0.0.0.0/0

Replcase sg-XXXXXXXXXXX with the Master's Security Group id.

Connect to the cluster via ssh

You can also conect to the master node via ssh with the following command:

aws emr ssh --cluster-id j-XXXXXXXXXXXXXXX --key-pair-file <your-ec2-key>

Replace the <your-ec2-key> with the path/file of the key.

Terminate the cluster

When you are done, remember to terminate the cluster!

aws emr terminate-clusters --cluster-id j-XXXXXXXXXXXXXXX

...and confirm that it is terminating:

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"State\"

You should see:

    "State": "TERMINATING"
        "State": "TERMINATING"
        "State": "TERMINATING"

References

http://people.duke.edu/~ccc14/sta-663-2016/21H_Spark_Cloud.html

https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon EMR for Data Science

Create on-demand Apache Hadoop clusters to analyze Big Data with Spark

Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive, and other Big Data Frameworks

Install AWS CLI

Create the cluster

Get information about the cluster

Cluster details

Cluster state

Master node public DNS:

Add rules to access the cluster

Adding Rules to Your Security Group

Get Master's security group

To add a rule that allows inbound SSH traffic

To add a rule that allows inbound Jupyter & JupyterHub traffic

Connect to the cluster via ssh

Terminate the cluster

References

About

Releases

Packages

AchilleasKn/AWS_EMR_cluster

Folders and files

Latest commit

History

Repository files navigation

Amazon EMR for Data Science

Create on-demand Apache Hadoop clusters to analyze Big Data with Spark

Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive, and other Big Data Frameworks

Install AWS CLI

Create the cluster

Get information about the cluster

Cluster details

Cluster state

Master node public DNS:

Add rules to access the cluster

Adding Rules to Your Security Group

Get Master's security group

To add a rule that allows inbound SSH traffic

To add a rule that allows inbound Jupyter & JupyterHub traffic

Connect to the cluster via ssh

Terminate the cluster

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages