Category: topic

topic

Apache Spark Installation Guide for Windows & macOS

A. Prerequisites (Both Windows & macOS)

Java JDK 11+
Download and install from OpenJDK 11 or Oracle JDK.
```
java -version
```
Python 3.8+ (Optional for PySpark)
Install from python.org or via your package manager.
```
python3 --version
```
(Windows only) Hadoop winutils.exe
Download matching version (e.g. Hadoop 3.3.1) from GitHub and place winutils.exe in C:\hadoop\bin.

B. Installation on Windows

Create Folders


C:\spark
C:\hadoop\bin  ← place winutils.exe here

Download & Unpack Spark
1. Go to spark.apache.org/downloads.html
2. Select “Spark 3.5.0 pre-built for Hadoop 3.3+” and unzip into C:\spark (e.g. C:\spark\spark-3.5.0-bin-hadoop3.3).

Configure Environment Variables
In **System → Advanced → Environment Variables** add:


HADOOP_HOME = C:\hadoop
SPARK_HOME  = C:\spark\spark-3.5.0-bin-hadoop3.3
JAVA_HOME   = C:\Program Files\Java\jdk-11.x.x

Then prepend to **Path**:


%HADOOP_HOME%\bin
%SPARK_HOME%\bin

Verify Spark Shell
Open a new PowerShell or CMD and run:
```
spark-shell
```
Optional: PySpark
```
pyspark
```

C. Installation on macOS

Install Java

brew install openjdk@11
echo 'export JAVA_HOME="/usr/local/opt/openjdk@11"' >> ~/.zshrc
echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Install Scala (Optional)
```
brew install scala
```

Download & Unpack Spark


curl -O https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.3.tgz
tar xzf spark-3.5.0-bin-hadoop3.3.tgz
mv spark-3.5.0-bin-hadoop3.3 ~/spark

Configure Environment Variables
Add to ~/.zshrc:


export SPARK_HOME=~/spark
export PATH="$SPARK_HOME/bin:$PATH"

Then run:

source ~/.zshrc

Verify Spark Shell & PySpark
```
spark-shell
pyspark
```

D. Quick Smoke Test

In either OS, run one of these in the shell to confirm:


// Scala (spark-shell)
spark.range(1, 1000000).selectExpr("sum(id)").show()

# Python (pyspark)
df = spark.range(1, 1000000)
df.selectExpr("sum(id)").show()

If you see the sum output without errors, your Spark setup is complete!

topic

Spark GraphX and GraphFrames

Post author By nidisoft_vishain
Post date July 9, 2024
No Comments on Spark GraphX and GraphFrames

Spark GraphX & GraphFrames

1. Overview

Apache Spark offers two powerful libraries for graph processing over distributed data:
GraphX and GraphFrames. Both let you model your data
as vertices (nodes) and edges (relationships), but they differ in APIs
and capabilities.

GraphX Vertex & Edge Tables — Figure 1: GraphX represents graphs via RDD-backed vertex and edge tables.

2. Usage & Use Cases

2.1 GraphX

API: Scala/Java only, built on RDDs with Graph objects.
Common Use Cases:
- PageRank on web graphs
- Connected components for community detection
- Shortest-path algorithms in transportation networks

2.2 GraphFrames

API: Python, Scala, and SQL support over DataFrames.
Common Use Cases:
- Motif finding to detect fraud patterns
- Label propagation for clustering social networks
- SQL-style shortest-path queries

GraphFrames Network Diagram — Figure 2: GraphFrames builds on DataFrames for SQL-like graph queries.

3. Example: Simple PageRank with GraphFrames

Below is a PySpark example that constructs a small graph, runs PageRank, and shows the top-ranked vertices.


# spark-submit --packages graphframes:graphframes:0.8.1-spark3.3-s_2.12

from pyspark.sql import SparkSession
from graphframes import GraphFrame

# Initialize Spark
spark = SparkSession.builder \
    .appName("GraphFramesPageRank") \
    .getOrCreate()

# Define vertices and edges
v = spark.createDataFrame([
    ("A", "Alice"),
    ("B", "Bob"),
    ("C", "Cathy"),
    ("D", "David")
], ["id", "name"])

e = spark.createDataFrame([
    ("A", "B"),
    ("B", "C"),
    ("C", "A"),
    ("A", "D")
], ["src", "dst"])

# Create GraphFrame and run PageRank
g = GraphFrame(v, e)
results = g.pageRank(resetProbability=0.15, maxIter=10)

# Display the top PageRank scores
results.vertices.orderBy("pagerank", ascending=False).show()

GraphFrames Motif Example — Figure 3: Example motif query detecting “A follows B and B follows C.”

topic

Spark Streaming

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folders, TCP sockets, S3, Kafka, Flume, Twitter, and Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c

				
					
//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)

topic

Managing Multiple File Formats in Spark

Post author By nidisoft_vishain
Post date July 9, 2024
No Comments on Managing Multiple File Formats in Spark

Why File Format Choice Matters

In production Spark + ADLS pipelines, your file format directly impacts performance, cost, schema drift handling, and cross-system compatibility. Here’s a real-world, engineer-level guide for Parquet, Avro, CSV, and SequenceFile — with practical tips and robust Spark code.

Parquet — The Workhorse for Analytics

What It Is

Columnar storage — compressed, optimized for heavy scans. Spark’s standard for structured big data lakes.

When to Use

Large fact tables, OLAP workloads, feature stores, BI dashboards — perfect when data is mostly append-only and read-heavy.

Read / Write

val df = spark.read
  .option("mergeSchema", "true") // ensures new columns are included
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .mode("overwrite")
  .parquet("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

By default, Spark picks one file’s footer — new columns in other partitions may be ignored. Use mergeSchema=true for folders where columns evolve over time.

Avro — Reliable for Streaming & Kafka

What It Is

Row-based binary format with schema embedded in each file. Designed for high-efficiency record serialization across systems and languages.

When to Use

Kafka pipelines, real-time streaming ETL, cross-platform data exchange between Spark, Java, Flink, Python, etc.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", LongType),
  StructField("name", StringType),
  StructField("new_column", StringType)
))

val df = spark.read
  .format("avro")
  .schema(mySchema)
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write.format("avro")
  .save("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

Avro supports backward/forward compatibility — but Spark won’t auto-merge multiple Avro versions in one read. Use an explicit StructType or a Schema Registry if working with Kafka topics.

CSV — Simple but Fragile for Big Data

What It Is

Plain text, delimiter-separated, no built-in schema or metadata. Good for humans, risky for pipelines at scale.

When to Use

Small config tables, vendor dumps, quick exports for manual checks. Not recommended for high-volume core datasets.

Read / Write

import org.apache.spark.sql.types._

val mySchema = StructType(Seq(
  StructField("id", IntegerType),
  StructField("name", StringType),
  StructField("new_col", StringType)
))

val df = spark.read
  .schema(mySchema)
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/base/")

df.write
  .option("header", "true")
  .csv("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

CSV files don’t store a schema — so if columns change, inferSchema often guesses wrong. For production jobs, always use a predefined StructType.

SequenceFile — For Legacy Hadoop Only

What It Is

Hadoop’s classic binary key-value format. Still works but mostly historical in modern Spark setups.

When to Use

Bridging old MapReduce jobs, or when migrating legacy key-value logs from HDFS.

Read / Write

val rdd = spark.sparkContext.sequenceFile[String, String](
  "abfss://container@storageaccount.dfs.core.windows.net/base/")

rdd.saveAsSequenceFile("abfss://container@storageaccount.dfs.core.windows.net/output/")

Schema Evolution Tip

No schema. Pure key-value pairs. No merging or evolution — use modern formats instead for new systems.

How to Handle Schema Changes (Quick Guide)

Format	Best Practice
Parquet	Use `mergeSchema=true` for evolving partitions
Avro	Define `StructType`; use Schema Registry for Kafka
CSV	Always define `StructType`; never rely on `inferSchema`
SequenceFile	No schema merging — key-value only

Generic Reader

val df = spark.read
  .format("parquet") // or "avro", "csv"
  .option("mergeSchema", "true") // Parquet only
  .schema(mySchema) // Avro & CSV: always define
  .load("abfss://container@storageaccount.dfs.core.windows.net/base/")

Final Takeaway

Parquet: Best for batch analytics — merge schema when columns evolve.
Avro: Best for streaming and cross-language exchange — version carefully.
CSV: Fine for humans — explicit schema is non-negotiable.
SequenceFile: Use only for legacy MapReduce interop.

topic

SQL Spark

SQL (Structured Query Language) code on computer monitor. Example of SQL code to query data from a database.

Spark SQL is one of the most used Spark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe. In the later section of this Apache Spark tutorial, you will learn in detail using SQL select, where,group,join,union e.t.c

				
					val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()

topic

DataFrame Spark

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs

				
					
// Create SparkSession
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()

topic

RDD Spark

automation, industrial business process workflow optimisation concept on virtual digital screen

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.

				
					
df.createOrReplaceTempView("PERSON_DATA")
val df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()

				
					
val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()

topic

Apache Spark Architecture

Post author By nidisoft_vishain
Post date July 9, 2024
No Comments on Apache Spark Architecture

Spark works in a master-slave architecture where the master is called the “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

				
					//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)

				
					
//Create RDD from external Data source
val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

topic

Features & Advantages of Apache Spark

Post author By nidisoft_vishain
Post date July 9, 2024
No Comments on Features & Advantages of Apache Spark

In-memory computation
Distributed processing using parallelize
Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
Fault-tolerant
Immutable
Lazy evaluation
Cache & persistence
Inbuild-optimization when using DataFrames
Supports ANSI SQL
Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently in a distributed fashion.
Applications running on Spark are 100x faster than traditional systems.
You will get great benefits from using Spark for data ingestion pipelines.

				
					spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

				
					$SPARK_HOME/sbin/start-history-server.sh

topic

What is Apache Spark

Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. Spark was Originally developed at the University of California, Berkeley’s, and later donated to the Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers making Spark one of the most active open-source projects in Apache.

				
					SPARK_HOME  = C:\apps\spark-3.5.0-bin-hadoop3
HADOOP_HOME = C:\apps\spark-3.5.0-bin-hadoop3
PATH=%PATH%;C:\apps\spark-3.0.5-bin-hadoop3\bin

Apache Spark Installation Guide for Windows & macOS

A. Prerequisites (Both Windows & macOS)

B. Installation on Windows

C. Installation on macOS

D. Quick Smoke Test

Spark GraphX & GraphFrames

1. Overview

2. Usage & Use Cases

2.1 GraphX

2.2 GraphFrames

3. Example: Simple PageRank with GraphFrames

Why File Format Choice Matters

Parquet — The Workhorse for Analytics

What It Is

When to Use

Read / Write

Schema Evolution Tip

Avro — Reliable for Streaming & Kafka

What It Is

When to Use

Read / Write

Schema Evolution Tip

CSV — Simple but Fragile for Big Data

What It Is

When to Use

Read / Write

Schema Evolution Tip

SequenceFile — For Legacy Hadoop Only

What It Is

When to Use

Read / Write

Schema Evolution Tip

How to Handle Schema Changes (Quick Guide)

Generic Reader

Final Takeaway

Get in touch

Quick Links

Support

Contact Us

Take Your Learning To The Next Level.