Managing Multiple File Formats in Spark

Why File Format Choice Matters

In production Spark + ADLS pipelines, your file format directly impacts performance, cost, schema drift handling, and cross-system compatibility. Here’s a real-world, engineer-level guide for Parquet, Avro, CSV, and SequenceFile — with practical tips and robust Spark code.

Parquet — The Workhorse for Analytics

What It Is

Columnar storage — compressed, optimized for heavy scans. Spark’s standard for structured big data lakes.

When to Use

Large fact tables, OLAP workloads, feature stores, BI dashboards — perfect when data is mostly append-only and read-heavy.

Read / Write

val df = spark.read
  .option("mergeSchema", "true") // ensures new columns are included
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .mode("overwrite")
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

By default, Spark picks one file’s footer — new columns in other partitions may be ignored. Use mergeSchema=true for folders where columns evolve over time.

Avro — Reliable for Streaming & Kafka

What It Is

Row-based binary format with schema embedded in each file. Designed for high-efficiency record serialization across systems and languages.

When to Use

Kafka pipelines, real-time streaming ETL, cross-platform data exchange between Spark, Java, Flink, Python, etc.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", LongType),
  StructField("name", StringType),
  StructField("new_column", StringType)
))

val df = spark.read
  .format("avro")
  .schema(mySchema)
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write.format("avro")
  .save("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

Avro supports backward/forward compatibility — but Spark won’t auto-merge multiple Avro versions in one read. Use an explicit StructType or a Schema Registry if working with Kafka topics.

CSV — Simple but Fragile for Big Data

What It Is

Plain text, delimiter-separated, no built-in schema or metadata. Good for humans, risky for pipelines at scale.

When to Use

Small config tables, vendor dumps, quick exports for manual checks. Not recommended for high-volume core datasets.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", IntegerType),
  StructField("name", StringType),
  StructField("new_col", StringType)
))

val df = spark.read
  .schema(mySchema)
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

CSV files don’t store a schema — so if columns change, inferSchema often guesses wrong. For production jobs, always use a predefined StructType.

SequenceFile — For Legacy Hadoop Only

What It Is

Hadoop’s classic binary key-value format. Still works but mostly historical in modern Spark setups.

When to Use

Bridging old MapReduce jobs, or when migrating legacy key-value logs from HDFS.

Read / Write

val rdd = spark.sparkContext.sequenceFile[String, String](
  "abfss://container@storageaccount.dfs.core.windows.net/base/")

rdd.saveAsSequenceFile("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

No schema. Pure key-value pairs. No merging or evolution — use modern formats instead for new systems.

How to Handle Schema Changes (Quick Guide)

Format	Best Practice
Parquet	Use `mergeSchema=true` for evolving partitions
Avro	Define `StructType`; use Schema Registry for Kafka
CSV	Always define `StructType`; never rely on `inferSchema`
SequenceFile	No schema merging — key-value only

Generic Reader

val df = spark.read
  .format("parquet") // or "avro", "csv"
  .option("mergeSchema", "true") // Parquet only
  .schema(mySchema) // Avro & CSV: always define
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

Final Takeaway

Parquet: Best for batch analytics — merge schema when columns evolve.
Avro: Best for streaming and cross-language exchange — version carefully.
CSV: Fine for humans — explicit schema is non-negotiable.
SequenceFile: Use only for legacy MapReduce interop.

Why File Format Choice Matters

Parquet — The Workhorse for Analytics

What It Is

When to Use

Read / Write

Schema Evolution Tip

Avro — Reliable for Streaming & Kafka

What It Is

When to Use

Read / Write

Schema Evolution Tip

CSV — Simple but Fragile for Big Data

What It Is

When to Use

Read / Write

Schema Evolution Tip

SequenceFile — For Legacy Hadoop Only

What It Is

When to Use

Read / Write

Schema Evolution Tip

How to Handle Schema Changes (Quick Guide)

Generic Reader

Final Takeaway

Get in touch

Quick Links

Support

Contact Us

Take Your Learning To The Next Level.

Why File Format Choice Matters

Parquet — The Workhorse for Analytics

What It Is

When to Use

Read / Write

Schema Evolution Tip

Avro — Reliable for Streaming & Kafka

What It Is

When to Use

Read / Write

Schema Evolution Tip

CSV — Simple but Fragile for Big Data

What It Is

When to Use

Read / Write

Schema Evolution Tip

SequenceFile — For Legacy Hadoop Only

What It Is

When to Use

Read / Write

Schema Evolution Tip

How to Handle Schema Changes (Quick Guide)

Generic Reader

Final Takeaway

Leave a Reply Cancel reply

Get in touch

Quick Links

Support

Contact Us

Take Your Learning To The Next Level.