
Spark Installation

In order to run the Apache Spark examples mentioned in this tutorial, you need to have Spark and its needed tools to be installed on your computer. Since most developers use Windows for development, I will explain how to install Spark on Windows in this tutorial. you can also Install Spark on a Linux server if needed.

					val data = Seq(('James','','Smith','1991-04-01','M',3000),

					val columns = Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema = columns).toDF(columns:_*)

Spark GraphX and GraphFrames

Spark GraphFrames are introduced in Spark 3.0 version to support Graphs on DataFrames. Prior to 3.0, Spark had GraphX library which ideally runs on RDD, and lost all Data Frame capabilities.

GraphFrames is a graph processing library for Apache Spark that provides high-level abstractions for working with graphs and performing graph analytics. It extends Spark’s DataFrame API to support graph operations, allowing users to express complex graph queries using familiar DataFrame operations.

					// Import necessary libraries
import org.apache.spark.sql.SparkSession
import org.graphframes.GraphFrame

// Create a Spark session
val spark = SparkSession.builder.appName("GraphFramesExample").getOrCreate()


Spark Streaming

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folders, TCP sockets, S3, Kafka, Flume, Twitter, and Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards e.t.c
//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)

Spark Data Source

Spark SQL supports operating on a variety of data sources through the DataFrame interface. This section of the tutorial describes reading and writing data using the Spark Data Sources with Scala examples. Using Data source API we can load from or save data to RDMS databases, Avro, parquet, XML etc.

					val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")

SQL Spark

Spark SQL is one of the most used Spark modules which is used for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe. In the later section of this Apache Spark tutorial, you will learn in detail using SQL  select, where,group,join,union e.t.c

					val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")

DataFrame Spark

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs

// Create SparkSession
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()

RDD Spark

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster. 

val df2 = spark.sql("SELECT * from PERSON_DATA")
val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")

Apache Spark Architecture

Spark works in a master-slave architecture where the master is called the “Driver” and slaves are called “Workers”. When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all operations (transformations and actions) are executed on worker nodes, and the resources are managed by Cluster Manager.

					//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)
//Create RDD from external Data source
val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

Features & Advantages of Apache Spark

  • In-memory computation
  • Distributed processing using parallelize
  • Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
  • Fault-tolerant
  • Immutable
  • Lazy evaluation
  • Cache & persistence
  • Inbuild-optimization when using DataFrames
  • Supports ANSI SQL
  • Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently in a distributed fashion.
  • Applications running on Spark are 100x faster than traditional systems.
  • You will get great benefits from using Spark for data ingestion pipelines.
					spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

What is Apache Spark

Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. Spark was Originally developed at the University of California, Berkeley’s, and later donated to the Apache Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been contributed by thousands of engineers making Spark one of the most active open-source projects in Apache.

					SPARK_HOME  = C:\apps\spark-3.5.0-bin-hadoop3
HADOOP_HOME = C:\apps\spark-3.5.0-bin-hadoop3

Take Your Learning To The Next Level.