Categories
topic

Spark Installation

Apache Spark Installation Guide for Windows & macOS

A. Prerequisites (Both Windows & macOS)

  • Java JDK 11+
    Download and install from OpenJDK 11 or Oracle JDK.
    java -version
  • Python 3.8+ (Optional for PySpark)
    Install from python.org or via your package manager.
    python3 --version
  • (Windows only) Hadoop winutils.exe
    Download matching version (e.g. Hadoop 3.3.1) from GitHub and place winutils.exe in C:\hadoop\bin.

B. Installation on Windows

  1. Create Folders
    
    C:\spark
    C:\hadoop\bin  ← place winutils.exe here
        
  2. Download & Unpack Spark
    1. Go to spark.apache.org/downloads.html
    2. Select “Spark 3.5.0 pre-built for Hadoop 3.3+” and unzip into C:\spark (e.g. C:\spark\spark-3.5.0-bin-hadoop3.3).
  3. Configure Environment Variables
    In **System → Advanced → Environment Variables** add:
    
    HADOOP_HOME = C:\hadoop
    SPARK_HOME  = C:\spark\spark-3.5.0-bin-hadoop3.3
    JAVA_HOME   = C:\Program Files\Java\jdk-11.x.x
        
    Then prepend to **Path**:
    
    %HADOOP_HOME%\bin
    %SPARK_HOME%\bin
        
  4. Verify Spark Shell
    Open a new PowerShell or CMD and run:
    spark-shell
  5. Optional: PySpark
    pyspark

C. Installation on macOS

  1. Install Java
    brew install openjdk@11
    echo 'export JAVA_HOME="/usr/local/opt/openjdk@11"' >> ~/.zshrc
    echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.zshrc
    source ~/.zshrc
        
  2. Install Scala (Optional)
    brew install scala
  3. Download & Unpack Spark
    
    curl -O https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.3.tgz
    tar xzf spark-3.5.0-bin-hadoop3.3.tgz
    mv spark-3.5.0-bin-hadoop3.3 ~/spark
        
  4. Configure Environment Variables
    Add to ~/.zshrc:
    
    export SPARK_HOME=~/spark
    export PATH="$SPARK_HOME/bin:$PATH"
        
    Then run:
    source ~/.zshrc
  5. Verify Spark Shell & PySpark
    spark-shell
    pyspark

D. Quick Smoke Test

In either OS, run one of these in the shell to confirm:


// Scala (spark-shell)
spark.range(1, 1000000).selectExpr("sum(id)").show()

# Python (pyspark)
df = spark.range(1, 1000000)
df.selectExpr("sum(id)").show()

If you see the sum output without errors, your Spark setup is complete!

Categories
topic

Spark GraphX and GraphFrames

Spark GraphX & GraphFrames

1. Overview

Apache Spark offers two powerful libraries for graph processing over distributed data:
GraphX and GraphFrames. Both let you model your data
as vertices (nodes) and edges (relationships), but they differ in APIs
and capabilities.

GraphX Vertex & Edge Tables

Figure 1: GraphX represents graphs via RDD-backed vertex and edge tables.

2. Usage & Use Cases

2.1 GraphX

  • API: Scala/Java only, built on RDDs with Graph objects.
  • Common Use Cases:
    • PageRank on web graphs
    • Connected components for community detection
    • Shortest-path algorithms in transportation networks

2.2 GraphFrames

  • API: Python, Scala, and SQL support over DataFrames.
  • Common Use Cases:
    • Motif finding to detect fraud patterns
    • Label propagation for clustering social networks
    • SQL-style shortest-path queries
GraphFrames Network Diagram

Figure 2: GraphFrames builds on DataFrames for SQL-like graph queries.

3. Example: Simple PageRank with GraphFrames

Below is a PySpark example that constructs a small graph, runs PageRank, and shows the top-ranked vertices.


# spark-submit --packages graphframes:graphframes:0.8.1-spark3.3-s_2.12

from pyspark.sql import SparkSession
from graphframes import GraphFrame

# Initialize Spark
spark = SparkSession.builder \
    .appName("GraphFramesPageRank") \
    .getOrCreate()

# Define vertices and edges
v = spark.createDataFrame([
    ("A", "Alice"),
    ("B", "Bob"),
    ("C", "Cathy"),
    ("D", "David")
], ["id", "name"])

e = spark.createDataFrame([
    ("A", "B"),
    ("B", "C"),
    ("C", "A"),
    ("A", "D")
], ["src", "dst"])

# Create GraphFrame and run PageRank
g = GraphFrame(v, e)
results = g.pageRank(resetProbability=0.15, maxIter=10)

# Display the top PageRank scores
results.vertices.orderBy("pagerank", ascending=False).show()
    
GraphFrames Motif Example

Figure 3: Example motif query detecting “A follows B and B follows C.”
Categories
topic

Spark Streaming

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folders, TCP sockets, S3, Kafka, Flume, Twitter, and Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c
				
					
//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)
				
			
Categories
topic

Managing Multiple File Formats in Spark

📌 Why File Format Choice Matters

In production Spark + ADLS pipelines, your file format directly impacts performance, cost, schema drift handling, and cross-system compatibility. Here’s a real-world, engineer-level guide for Parquet, Avro, CSV, and SequenceFile — with practical tips and robust Spark code.

📂 Parquet — The Workhorse for Analytics

What It Is

Columnar storage — compressed, optimized for heavy scans. Spark’s standard for structured big data lakes.

When to Use

Large fact tables, OLAP workloads, feature stores, BI dashboards — perfect when data is mostly append-only and read-heavy.

Read / Write

val df = spark.read
  .option("mergeSchema", "true") // ensures new columns are included
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .mode("overwrite")
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

By default, Spark picks one file’s footer — new columns in other partitions may be ignored. Use mergeSchema=true for folders where columns evolve over time.

🧬 Avro — Reliable for Streaming & Kafka

What It Is

Row-based binary format with schema embedded in each file. Designed for high-efficiency record serialization across systems and languages.

When to Use

Kafka pipelines, real-time streaming ETL, cross-platform data exchange between Spark, Java, Flink, Python, etc.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", LongType),
  StructField("name", StringType),
  StructField("new_column", StringType)
))

val df = spark.read
  .format("avro")
  .schema(mySchema)
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write.format("avro")
  .save("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

Avro supports backward/forward compatibility — but Spark won’t auto-merge multiple Avro versions in one read. Use an explicit StructType or a Schema Registry if working with Kafka topics.

📑 CSV — Simple but Fragile for Big Data

What It Is

Plain text, delimiter-separated, no built-in schema or metadata. Good for humans, risky for pipelines at scale.

When to Use

Small config tables, vendor dumps, quick exports for manual checks. Not recommended for high-volume core datasets.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", IntegerType),
  StructField("name", StringType),
  StructField("new_col", StringType)
))

val df = spark.read
  .schema(mySchema)
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

CSV files don’t store a schema — so if columns change, inferSchema often guesses wrong. For production jobs, always use a predefined StructType.

🗝️ SequenceFile — For Legacy Hadoop Only

What It Is

Hadoop’s classic binary key-value format. Still works but mostly historical in modern Spark setups.

When to Use

Bridging old MapReduce jobs, or when migrating legacy key-value logs from HDFS.

Read / Write

val rdd = spark.sparkContext.sequenceFile[String, String](
  "abfss://container@storageaccount.dfs.core.windows.net/base/")

rdd.saveAsSequenceFile("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

No schema. Pure key-value pairs. No merging or evolution — use modern formats instead for new systems.

⚙️ How to Handle Schema Changes (Quick Guide)

Format Best Practice
Parquet Use mergeSchema=true for evolving partitions
Avro Define StructType; use Schema Registry for Kafka
CSV Always define StructType; never rely on inferSchema
SequenceFile No schema merging — key-value only

✅ Generic Reader

val df = spark.read
  .format("parquet") // or "avro", "csv"
  .option("mergeSchema", "true") // Parquet only
  .schema(mySchema) // Avro & CSV: always define
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

🎓 Final Takeaway

Parquet: Best for batch analytics — merge schema when columns evolve.
Avro: Best for streaming and cross-language exchange — version carefully.
CSV: Fine for humans — explicit schema is non-negotiable.
SequenceFile: Use only for legacy MapReduce interop.


Categories
topic

SQL Spark

Spark SQL is one of the most used Spark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe. In the later section of this Apache Spark tutorial, you will learn in detail using SQL  select, where,group,join,union e.t.c

				
					val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()
				
			
Categories
topic

DataFrame Spark

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs

				
					
// Create SparkSession
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()   
				
			
Categories
topic

RDD Spark

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster. 

				
					
df.createOrReplaceTempView("PERSON_DATA")
val df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()
				
			
				
					
val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()
				
			
Categories
topic

Apache Spark Architecture

Spark works in a master-slave architecture where the master is called the “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

				
					//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)
				
			
				
					
//Create RDD from external Data source
val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")
				
			
Categories
topic

Features & Advantages of Apache Spark

  • In-memory computation
  • Distributed processing using parallelize
  • Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
  • Fault-tolerant
  • Immutable
  • Lazy evaluation
  • Cache & persistence
  • Inbuild-optimization when using DataFrames
  • Supports ANSI SQL
  • Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently in a distributed fashion.
  • Applications running on Spark are 100x faster than traditional systems.
  • You will get great benefits from using Spark for data ingestion pipelines.
				
					spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path
				
			
				
					$SPARK_HOME/sbin/start-history-server.sh
				
			
Categories
topic

What is Apache Spark

Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. Spark was Originally developed at the University of California, Berkeley’s, and later donated to the Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers making Spark one of the most active open-source projects in Apache.

				
					SPARK_HOME  = C:\apps\spark-3.5.0-bin-hadoop3
HADOOP_HOME = C:\apps\spark-3.5.0-bin-hadoop3
PATH=%PATH%;C:\apps\spark-3.0.5-bin-hadoop3\bin
				
			

Take Your Learning To The Next Level.