Writing custom job bookmarks in AWS Glue

AWS Glue is a powerful serverless data integration service that enables data engineers to discover, catalog, clean, and transform data. One of its most valuable features for building scalable ETL workflows is job bookmarks—a mechanism to keep track of previously processed data to avoid duplicates. While Glue automatically manages bookmarks for many jobs, sometimes custom logic is needed. In this blog, we’ll explore what job bookmarks are, why you might need to create custom ones, and how to implement them in AWS Glue.


๐Ÿ“Œ What are AWS Glue Job Bookmarks?

Job bookmarks help AWS Glue track the state of data processing. When a job runs with bookmarks enabled, it stores metadata about what data has already been read and processed. This allows you to run incremental ETL jobs by only processing new or changed data.

For example, if you have a job that loads files from S3 every hour, Glue will only pick the files added since the last job run—thanks to bookmarks.


๐Ÿง  Why Use Custom Job Bookmarks?

Out-of-the-box job bookmarks are useful but may not cover all scenarios. Custom bookmarks are helpful when:

You process non-standard data sources.

Your dataset lacks a timestamp or watermark column.

You need fine-grained control over what is considered “new” data.

You want to integrate bookmark logic across multiple datasets or jobs.


๐Ÿ› ️ How to Write Custom Job Bookmarks

Let’s walk through implementing a custom job bookmark system using Python in AWS Glue.

✅ Step 1: Enable Bookmarks in the Job

In your AWS Glue job configuration:

Go to Job details > Job parameters.

Add: --job-bookmark-option=job-bookmark-enable

This ensures the Glue context stores the state of your job.


✅ Step 2: Design a Bookmark Strategy

Define how you’ll identify new data. Common approaches include:

Timestamps (e.g., last_updated)

Sequential IDs (e.g., transaction_id)

File names or modification dates (for S3-based sources)


✅ Step 3: Retrieve Previous Bookmark

You can retrieve the last bookmark using Glue’s JobBookmark APIs:

python


from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext


glueContext = GlueContext(SparkContext.getOrCreate())

job = glueContext.create_job('my_job')


bookmark_key = "my_custom_bookmark_key"

bookmark_value = job.get_bookmark(bookmark_key)


print(f"Previous bookmark: {bookmark_value}")


✅ Step 4: Filter New Data Based on Bookmark

Use the retrieved value to filter new records in your dataset:

python

# Assuming your data is in a DataFrame

df = glueContext.create_dynamic_frame.from_catalog(database="sales_db", table_name="orders")

df_filtered = df.filter(lambda row: row["order_id"] > bookmark_value)


✅ Step 5: Save New Bookmark After Processing

After successfully processing the filtered data, store the latest value:

python

max_id = df_filtered.toDF().agg({"order_id": "max"}).collect()[0][0]

job.commit(bookmark_key, str(max_id))

This updates the bookmark with the most recent order_id, allowing the next job run to continue from there.


๐Ÿงช Tips for Reliable Bookmarking

Always validate the bookmark value before using it in filters.

Store bookmarks in external metadata stores (like DynamoDB or S3) for more control.

Combine bookmarks with job arguments for dynamic filtering.


๐Ÿ Conclusion

Custom job bookmarks in AWS Glue offer a flexible way to track and manage incremental data processing when default bookmarking isn’t enough. By building your own logic to read, filter, and update bookmarks, you can create robust ETL pipelines tailored to your business needs. Whether you’re processing streaming data, syncing across jobs, or applying custom thresholds, implementing custom bookmarks ensures efficiency and reliability in your data workflows. 

Learn AWS Data Engineer Training

Read More: Integrating Glue with Apache Airflow on MWAA
Read More: Data pipeline blueprint for e-commerce analytics
Read More: Time-travel queries in data lakes with Apache Iceberg

Visit IHUB Training Institute Hyderabad
Get Direction

Comments

Popular posts from this blog

How to Use Tosca's Test Configuration Parameters

Creating a Test Execution Report with Charts in Playwright

Installing Java and Eclipse IDE for Selenium Automation