Why File Format Choice Matters
In production Spark + ADLS pipelines, your file format directly impacts performance, cost, schema drift handling, and cross-system compatibility. Here’s a real-world, engineer-level guide for Parquet, Avro, CSV, and SequenceFile — with practical tips and robust Spark code.
Parquet — The Workhorse for Analytics
What It Is
Columnar storage — compressed, optimized for heavy scans. Spark’s standard for structured big data lakes.
When to Use
Large fact tables, OLAP workloads, feature stores, BI dashboards — perfect when data is mostly append-only and read-heavy.
Read / Write
val df = spark.read
.option("mergeSchema", "true") // ensures new columns are included
.parquet("abfss://container@storageaccount.dfs.core.windows.net/base/")
df.write
.mode("overwrite")
.parquet("abfss://container@storageaccount.dfs.core.windows.net/output/")
Schema Evolution Tip
By default, Spark picks one file’s footer — new columns in other partitions may be ignored. Use mergeSchema=true
for folders where columns evolve over time.
Avro — Reliable for Streaming & Kafka
What It Is
Row-based binary format with schema embedded in each file. Designed for high-efficiency record serialization across systems and languages.
When to Use
Kafka pipelines, real-time streaming ETL, cross-platform data exchange between Spark, Java, Flink, Python, etc.
Read / Write
import org.apache.spark.sql.types._
val mySchema = StructType(Seq(
StructField("id", LongType),
StructField("name", StringType),
StructField("new_column", StringType)
))
val df = spark.read
.format("avro")
.schema(mySchema)
.load("abfss://container@storageaccount.dfs.core.windows.net/base/")
df.write.format("avro")
.save("abfss://container@storageaccount.dfs.core.windows.net/output/")
Schema Evolution Tip
Avro supports backward/forward compatibility — but Spark won’t auto-merge multiple Avro versions in one read. Use an explicit StructType
or a Schema Registry if working with Kafka topics.
CSV — Simple but Fragile for Big Data
What It Is
Plain text, delimiter-separated, no built-in schema or metadata. Good for humans, risky for pipelines at scale.
When to Use
Small config tables, vendor dumps, quick exports for manual checks. Not recommended for high-volume core datasets.
Read / Write
import org.apache.spark.sql.types._
val mySchema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("new_col", StringType)
))
val df = spark.read
.schema(mySchema)
.option("header", "true")
.csv("abfss://container@storageaccount.dfs.core.windows.net/base/")
df.write
.option("header", "true")
.csv("abfss://container@storageaccount.dfs.core.windows.net/output/")
Schema Evolution Tip
CSV files don’t store a schema — so if columns change, inferSchema
often guesses wrong. For production jobs, always use a predefined StructType
.
SequenceFile — For Legacy Hadoop Only
What It Is
Hadoop’s classic binary key-value format. Still works but mostly historical in modern Spark setups.
When to Use
Bridging old MapReduce jobs, or when migrating legacy key-value logs from HDFS.
Read / Write
val rdd = spark.sparkContext.sequenceFile[String, String](
"abfss://container@storageaccount.dfs.core.windows.net/base/")
rdd.saveAsSequenceFile("abfss://container@storageaccount.dfs.core.windows.net/output/")
Schema Evolution Tip
No schema. Pure key-value pairs. No merging or evolution — use modern formats instead for new systems.
How to Handle Schema Changes (Quick Guide)
Format | Best Practice |
---|---|
Parquet | Use mergeSchema=true for evolving partitions |
Avro | Define StructType ; use Schema Registry for Kafka |
CSV | Always define StructType ; never rely on inferSchema |
SequenceFile | No schema merging — key-value only |
Generic Reader
val df = spark.read
.format("parquet") // or "avro", "csv"
.option("mergeSchema", "true") // Parquet only
.schema(mySchema) // Avro & CSV: always define
.load("abfss://container@storageaccount.dfs.core.windows.net/base/")
Final Takeaway
Parquet: Best for batch analytics — merge schema when columns evolve.
Avro: Best for streaming and cross-language exchange — version carefully.
CSV: Fine for humans — explicit schema is non-negotiable.
SequenceFile: Use only for legacy MapReduce interop.