Spark GraphX & GraphFrames
1. Overview
Apache Spark offers two powerful libraries for graph processing over distributed data:
GraphX and GraphFrames. Both let you model your data
as vertices (nodes) and edges (relationships), but they differ in APIs
and capabilities.

2. Usage & Use Cases
2.1 GraphX
- API: Scala/Java only, built on RDDs with
Graph
objects. - Common Use Cases:
- PageRank on web graphs
- Connected components for community detection
- Shortest-path algorithms in transportation networks
2.2 GraphFrames
- API: Python, Scala, and SQL support over DataFrames.
- Common Use Cases:
- Motif finding to detect fraud patterns
- Label propagation for clustering social networks
- SQL-style shortest-path queries

3. Example: Simple PageRank with GraphFrames
Below is a PySpark example that constructs a small graph, runs PageRank, and shows the top-ranked vertices.
# spark-submit --packages graphframes:graphframes:0.8.1-spark3.3-s_2.12
from pyspark.sql import SparkSession
from graphframes import GraphFrame
# Initialize Spark
spark = SparkSession.builder \
.appName("GraphFramesPageRank") \
.getOrCreate()
# Define vertices and edges
v = spark.createDataFrame([
("A", "Alice"),
("B", "Bob"),
("C", "Cathy"),
("D", "David")
], ["id", "name"])
e = spark.createDataFrame([
("A", "B"),
("B", "C"),
("C", "A"),
("A", "D")
], ["src", "dst"])
# Create GraphFrame and run PageRank
g = GraphFrame(v, e)
results = g.pageRank(resetProbability=0.15, maxIter=10)
# Display the top PageRank scores
results.vertices.orderBy("pagerank", ascending=False).show()
