Running Spark ML models on Amazon EMR

June 25, 2025

As data volumes grow, businesses increasingly need scalable solutions for machine learning (ML) training and inference. Apache Spark MLlib, a powerful ML library built on top of Spark, offers distributed processing capabilities for handling large-scale datasets. When combined with Amazon EMR (Elastic MapReduce)—a cloud-native big data platform—you get the ability to run Spark ML models at scale with cost efficiency and flexibility.

In this blog, we’ll walk through how to run Spark ML models on Amazon EMR, including setup, deployment, and performance tips.

🚀 Why Spark ML + Amazon EMR?

Apache Spark MLlib provides scalable machine learning algorithms, such as classification, regression, clustering, and collaborative filtering. Amazon EMR simplifies the setup and management of Spark clusters in the cloud, enabling:

Distributed training over large datasets

On-demand compute scaling

Integration with S3, Hive, HDFS, and more

Lower operational overhead vs. self-managed clusters

This combination is ideal for teams building ML models that need both speed and scalability.

🧰 Step-by-Step: Running Spark ML on EMR

1. Prepare Your Data

Store your training data in Amazon S3, ideally in a partitioned and columnar format like Parquet for faster processing:

plaintext

s3://your-bucket/ml-data/training-data.parquet

Ensure the data is clean, well-structured, and includes labels if you're working on supervised learning.

2. Launch an Amazon EMR Cluster with Spark

Use the AWS Console, AWS CLI, or a CloudFormation script to create an EMR cluster:

Choose an Amazon Linux 2 EMR release (e.g., EMR 6.10+)

Enable Spark as an application

Add necessary instance types (e.g., m5.xlarge)

Configure auto-scaling if needed

Specify an EC2 key pair for SSH access (optional)

You can also enable JupyterHub for interactive development.

3. Develop Your Spark ML Model

Use PySpark or Scala to build and train your ML model. Below is a sample PySpark code for logistic regression:

python

from pyspark.sql import SparkSession

from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import VectorAssembler

spark = SparkSession.builder.appName("SparkML-EMR").getOrCreate()

# Load training data from S3

df = spark.read.parquet("s3://your-bucket/ml-data/training-data.parquet")

# Feature engineering

assembler = VectorAssembler(

inputCols=["feature1", "feature2", "feature3"],

outputCol="features"

)

assembled_df = assembler.transform(df)

# Train logistic regression model

lr = LogisticRegression(labelCol="label", featuresCol="features")

model = lr.fit(assembled_df)

# Save model to S3

model.save("s3://your-bucket/models/logistic-regression/")

4. Run Your Script on EMR

You can run your PySpark script using the AWS CLI:

bash

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \

--steps Type=Spark,Name="RunSparkML",ActionOnFailure=CONTINUE,\

Args=[--deploy-mode,cluster,--master,yarn,s3://your-bucket/scripts/run_model.py]

Alternatively, SSH into the master node and run:

bash

spark-submit run_model.py

📈 Performance Tips

Use Parquet or ORC for input data to improve I/O performance.

Tune Spark configurations: executorMemory, executorCores, and numExecutors.

Use auto-scaling to manage costs and performance dynamically.

Monitor jobs using the Spark History Server or Ganglia in EMR.

✅ Conclusion

Running Spark ML models on Amazon EMR provides a powerful and scalable way to handle large-scale machine learning workloads. By leveraging the flexibility of Spark and the cloud-native capabilities of EMR, you can focus on building models—without worrying about infrastructure.

Whether you're training models on terabytes of data or performing distributed inference, EMR offers a production-ready environment tailored for enterprise-grade machine learning.

Learn AWS Data Engineer Training

Visit IHUB Training Institute Hyderabad
Get Direction

Search This Blog

IHUB Talent Training Institute