Handling corrupt files in S3 during ETL

In modern data pipelines, Amazon S3 often serves as a landing zone for raw data before it's transformed and loaded into data warehouses or data lakes. While S3 offers scalability and durability, ETL (Extract, Transform, Load) processes that consume files from S3 must handle the reality that some files may be corrupt, incomplete, or improperly formatted. If not handled properly, these corrupt files can break your pipeline or produce inaccurate analytics.

This blog explores how to detect, handle, and mitigate corrupt files in S3 during ETL to ensure reliable and robust data processing.


What is a Corrupt File in the Context of ETL?

A corrupt file in S3 typically refers to any file that:

  • Is partially uploaded (incomplete multipart uploads).
  • Has invalid format or encoding (e.g., malformed JSON, wrong CSV delimiter).
  • Is empty or has unreadable content.
  • Fails schema validation or throws exceptions during parsing.

These corrupt files can cause job failures, especially in distributed processing frameworks like Apache Spark, AWS Glue, or even simple Lambda-based ETL jobs.


Why Handling Corrupt Files is Important

  • Prevent pipeline failures: One bad file shouldn't crash the entire ETL job.
  • Ensure data accuracy: Prevent bad records from polluting clean datasets.
  • Save time and cost: Avoid rerunning large jobs due to a few problematic files.
  • Improve operational visibility: Identify bad data sources or upstream issues early.


Strategies to Handle Corrupt Files in S3

1. Validation Before Processing

Run a pre-check script using AWS Lambda or a lightweight EC2 job to validate file size, type, encoding, and schema before including it in the ETL job.

Use checks like:

File size > 0 bytes

File extension validation

Readability using test parsers (e.g., JSON.load, pandas.read_csv)


2. Isolate and Quarantine

Move detected corrupt files to a separate S3 folder such as s3://bucket/quarantine/.

Tag the files with metadata (x-amz-meta-error: malformed_json) for downstream analysis.

Maintain a log of corrupt files for auditing and reporting.


3. Schema Validation with Tools

Use tools like Great Expectations, Apache Deequ, or Cerberus to validate schema consistency and catch anomalies.

If using AWS Glue or PySpark, catch AnalysisException, MalformedInputException, or custom validation errors.

Example in PySpark:

python


try:

    df = spark.read.json("s3://bucket/data/")

except Exception as e:

    log_error_and_move_file(e, "corrupt_file.json")


4. Set ETL to Skip Corrupt Files

Some engines like Spark allow you to configure the job to ignore corrupt files:

python

spark.read.option("badRecordsPath", "s3://bucket/bad-records/").json("s3://bucket/data/")

AWS Glue also supports similar configurations in dynamic frames.


5. Alert and Notify

Trigger SNS alerts, CloudWatch alarms, or send Slack notifications when corrupt files are detected.

This ensures that teams are immediately aware and can act before the data pipeline fails.

Best Practices

  • Always log failed records or corrupt files with clear error messages.
  • Automate file validation as a part of your ETL orchestration.
  • Implement versioning in S3 to track and recover previous good versions of files.
  • Schedule regular scans of your raw data folders for potential corrupt files.


Conclusion

Corrupt files in S3 are inevitable in large-scale ETL pipelines — but they don’t have to break your process. By proactively validating, isolating, and gracefully handling these files, you can build more resilient, reliable, and scalable data pipelines. Whether you're using Spark, AWS Glue, or custom ETL scripts, incorporating these strategies will help ensure that your analytics and reporting stay accurate and dependable.

Learn AWS Data Engineer Training
Read More: Scheduling data clean-up jobs using Lambda

Visit IHUB Training Institute Hyderabad
Get Direction


Comments

Popular posts from this blog

How to Use Tosca's Test Configuration Parameters

Creating a Test Execution Report with Charts in Playwright

Installing Java and Eclipse IDE for Selenium Automation