Categories
topic

Managing Multiple File Formats in Spark

📌 Why File Format Choice Matters

In production Spark + ADLS pipelines, your file format directly impacts performance, cost, schema drift handling, and cross-system compatibility. Here’s a real-world, engineer-level guide for Parquet, Avro, CSV, and SequenceFile — with practical tips and robust Spark code.

📂 Parquet — The Workhorse for Analytics

What It Is

Columnar storage — compressed, optimized for heavy scans. Spark’s standard for structured big data lakes.

When to Use

Large fact tables, OLAP workloads, feature stores, BI dashboards — perfect when data is mostly append-only and read-heavy.

Read / Write

val df = spark.read
  .option("mergeSchema", "true") // ensures new columns are included
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .mode("overwrite")
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

By default, Spark picks one file’s footer — new columns in other partitions may be ignored. Use mergeSchema=true for folders where columns evolve over time.

🧬 Avro — Reliable for Streaming & Kafka

What It Is

Row-based binary format with schema embedded in each file. Designed for high-efficiency record serialization across systems and languages.

When to Use

Kafka pipelines, real-time streaming ETL, cross-platform data exchange between Spark, Java, Flink, Python, etc.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", LongType),
  StructField("name", StringType),
  StructField("new_column", StringType)
))

val df = spark.read
  .format("avro")
  .schema(mySchema)
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write.format("avro")
  .save("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

Avro supports backward/forward compatibility — but Spark won’t auto-merge multiple Avro versions in one read. Use an explicit StructType or a Schema Registry if working with Kafka topics.

📑 CSV — Simple but Fragile for Big Data

What It Is

Plain text, delimiter-separated, no built-in schema or metadata. Good for humans, risky for pipelines at scale.

When to Use

Small config tables, vendor dumps, quick exports for manual checks. Not recommended for high-volume core datasets.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", IntegerType),
  StructField("name", StringType),
  StructField("new_col", StringType)
))

val df = spark.read
  .schema(mySchema)
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

CSV files don’t store a schema — so if columns change, inferSchema often guesses wrong. For production jobs, always use a predefined StructType.

🗝️ SequenceFile — For Legacy Hadoop Only

What It Is

Hadoop’s classic binary key-value format. Still works but mostly historical in modern Spark setups.

When to Use

Bridging old MapReduce jobs, or when migrating legacy key-value logs from HDFS.

Read / Write

val rdd = spark.sparkContext.sequenceFile[String, String](
  "abfss://container@storageaccount.dfs.core.windows.net/base/")

rdd.saveAsSequenceFile("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

No schema. Pure key-value pairs. No merging or evolution — use modern formats instead for new systems.

⚙️ How to Handle Schema Changes (Quick Guide)

Format Best Practice
Parquet Use mergeSchema=true for evolving partitions
Avro Define StructType; use Schema Registry for Kafka
CSV Always define StructType; never rely on inferSchema
SequenceFile No schema merging — key-value only

✅ Generic Reader

val df = spark.read
  .format("parquet") // or "avro", "csv"
  .option("mergeSchema", "true") // Parquet only
  .schema(mySchema) // Avro & CSV: always define
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

🎓 Final Takeaway

Parquet: Best for batch analytics — merge schema when columns evolve.
Avro: Best for streaming and cross-language exchange — version carefully.
CSV: Fine for humans — explicit schema is non-negotiable.
SequenceFile: Use only for legacy MapReduce interop.


Leave a Reply

Your email address will not be published. Required fields are marked *

Take Your Learning To The Next Level.