Writing custom job bookmarks in AWS Glue
AWS Glue is a powerful serverless data integration service that enables data engineers to discover, catalog, clean, and transform data. One of its most valuable features for building scalable ETL workflows is job bookmarks—a mechanism to keep track of previously processed data to avoid duplicates. While Glue automatically manages bookmarks for many jobs, sometimes custom logic is needed. In this blog, we’ll explore what job bookmarks are, why you might need to create custom ones, and how to implement them in AWS Glue.
๐ What are AWS Glue Job Bookmarks?
Job bookmarks help AWS Glue track the state of data processing. When a job runs with bookmarks enabled, it stores metadata about what data has already been read and processed. This allows you to run incremental ETL jobs by only processing new or changed data.
For example, if you have a job that loads files from S3 every hour, Glue will only pick the files added since the last job run—thanks to bookmarks.
๐ง Why Use Custom Job Bookmarks?
Out-of-the-box job bookmarks are useful but may not cover all scenarios. Custom bookmarks are helpful when:
You process non-standard data sources.
Your dataset lacks a timestamp or watermark column.
You need fine-grained control over what is considered “new” data.
You want to integrate bookmark logic across multiple datasets or jobs.
๐ ️ How to Write Custom Job Bookmarks
Let’s walk through implementing a custom job bookmark system using Python in AWS Glue.
✅ Step 1: Enable Bookmarks in the Job
In your AWS Glue job configuration:
Go to Job details > Job parameters.
Add: --job-bookmark-option=job-bookmark-enable
This ensures the Glue context stores the state of your job.
✅ Step 2: Design a Bookmark Strategy
Define how you’ll identify new data. Common approaches include:
Timestamps (e.g., last_updated)
Sequential IDs (e.g., transaction_id)
File names or modification dates (for S3-based sources)
✅ Step 3: Retrieve Previous Bookmark
You can retrieve the last bookmark using Glue’s JobBookmark APIs:
python
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
job = glueContext.create_job('my_job')
bookmark_key = "my_custom_bookmark_key"
bookmark_value = job.get_bookmark(bookmark_key)
print(f"Previous bookmark: {bookmark_value}")
✅ Step 4: Filter New Data Based on Bookmark
Use the retrieved value to filter new records in your dataset:
python
# Assuming your data is in a DataFrame
df = glueContext.create_dynamic_frame.from_catalog(database="sales_db", table_name="orders")
df_filtered = df.filter(lambda row: row["order_id"] > bookmark_value)
✅ Step 5: Save New Bookmark After Processing
After successfully processing the filtered data, store the latest value:
python
max_id = df_filtered.toDF().agg({"order_id": "max"}).collect()[0][0]
job.commit(bookmark_key, str(max_id))
This updates the bookmark with the most recent order_id, allowing the next job run to continue from there.
๐งช Tips for Reliable Bookmarking
Always validate the bookmark value before using it in filters.
Store bookmarks in external metadata stores (like DynamoDB or S3) for more control.
Combine bookmarks with job arguments for dynamic filtering.
๐ Conclusion
Custom job bookmarks in AWS Glue offer a flexible way to track and manage incremental data processing when default bookmarking isn’t enough. By building your own logic to read, filter, and update bookmarks, you can create robust ETL pipelines tailored to your business needs. Whether you’re processing streaming data, syncing across jobs, or applying custom thresholds, implementing custom bookmarks ensures efficiency and reliability in your data workflows.
Learn AWS Data Engineer Training
Read More: Integrating Glue with Apache Airflow on MWAA
Read More: Data pipeline blueprint for e-commerce analytics
Read More: Time-travel queries in data lakes with Apache Iceberg
Visit IHUB Training Institute Hyderabad
Get Direction
Comments
Post a Comment