Creating Glue bookmarks for incremental ETL

When working with ETL (Extract, Transform, Load) pipelines in AWS Glue, handling incremental data is crucial for optimizing performance and reducing costs. Instead of processing the entire dataset every time, Glue Bookmarks enable you to process only new or changed data. This functionality is essential for building scalable and efficient data pipelines in a production environment.

In this blog, we’ll explore what Glue bookmarks are, why they are important, and how to use them to build incremental ETL pipelines.


What are Glue Bookmarks?

AWS Glue Bookmarks are a feature that tracks previously processed data in a job and helps avoid reprocessing it in future runs. When bookmarks are enabled, AWS Glue stores state information—like the file name, timestamp, and read position—so that each job run knows where to pick up from.

This is particularly useful when:

  • You're reading data from S3 or JDBC sources.
  • Your data is appended regularly (e.g., logs, transactions).
  • You want to implement incremental or delta processing.


Why Use Glue Bookmarks in ETL?

Here are the key benefits of using bookmarks:

  • Efficiency: Process only new records, reducing processing time.
  • Cost Saving: Lower data scanning and compute costs.
  • Automation: No need for manual filtering or tracking state externally.
  • Reliability: Avoid duplicate data processing in your data lake or warehouse.


How to Enable Glue Bookmarks

Follow these steps to create and use Glue bookmarks for incremental ETL:

Step 1: Prepare the Data Source

Ensure that your source data is in a compatible format and supports incremental reading. Common sources include:

  • Amazon S3
  • Amazon RDS
  • Amazon Redshift

For S3, it’s best if the data is partitioned by timestamp or other logical partitions (e.g., year=2025/month=06/day=09).


Step 2: Create a Glue Job

  • Go to the AWS Glue Console.
  • Navigate to Jobs and click Add job.
  • Provide job details and choose the Spark or Python shell option.
  • In the job properties, ensure you enable bookmarks by setting:
  • Job bookmark option: "Enable"


Step 3: Write a Bookmark-Aware Script

Here’s a basic example using Python with AWS Glue's DynamicFrame:

python


import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from awsglue.context import GlueContext

from pyspark.context import SparkContext


args = getResolvedOptions(sys.argv, ['JOB_NAME'])


sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

job = glueContext.create_job(args['JOB_NAME'])


# Read from S3 with bookmark support

datasource = glueContext.create_dynamic_frame.from_options(

    connection_type="s3",

    connection_options={

        "paths": ["s3://your-bucket/input-data/"],

        "recurse": True

    },

    format="json"  # or "csv", "parquet", etc.

)


# Apply transformations if needed

transformed = ApplyMapping.apply(

    frame=datasource,

    mappings=[("id", "string", "id", "string"),

              ("timestamp", "string", "timestamp", "string")]

)


# Write to output destination

glueContext.write_dynamic_frame.from_options(

    frame=transformed,

    connection_type="s3",

    connection_options={"path": "s3://your-bucket/output-data/"},

    format="parquet"

)


job.commit()


Step 4: Run and Monitor the Job

Run the job manually or schedule it via triggers.

Check the job bookmarks state in the AWS Glue Console.

View logs in CloudWatch to confirm that incremental loading is working (look for messages like “Skipping already processed files”).


Best Practices

Partition your data: Makes incremental reads more efficient.

Avoid schema changes: Drastic changes may cause bookmark issues.

Test your logic: Use bookmarks in development carefully to avoid skipping data unintentionally.

Use bookmarks with job versioning: Changing the job script too much may reset bookmarks unless handled correctly.


Conclusion

Glue bookmarks are a powerful feature that help automate and optimize incremental ETL workflows in AWS Glue. By enabling them, you ensure that your jobs run faster, cost less, and avoid reprocessing already handled data. Whether you're building a data lake or streaming transaction logs, mastering Glue bookmarks is a must-have skill for every data engineer working in AWS.

Learn AWS Data Engineer Training
Read More: Handling corrupt files in S3 during ETL

Visit IHUB Training Institute Hyderabad
Get Direction

Comments

Popular posts from this blog

How to Use Tosca's Test Configuration Parameters

Creating a Test Execution Report with Charts in Playwright

Installing Java and Eclipse IDE for Selenium Automation