Categories
topic

Spark Installation

Apache Spark Installation Guide for Windows & macOS

A. Prerequisites (Both Windows & macOS)

  • Java JDK 11+
    Download and install from OpenJDK 11 or Oracle JDK.
    java -version
  • Python 3.8+ (Optional for PySpark)
    Install from python.org or via your package manager.
    python3 --version
  • (Windows only) Hadoop winutils.exe
    Download matching version (e.g. Hadoop 3.3.1) from GitHub and place winutils.exe in C:\hadoop\bin.

B. Installation on Windows

  1. Create Folders
    
    C:\spark
    C:\hadoop\bin  ← place winutils.exe here
        
  2. Download & Unpack Spark
    1. Go to spark.apache.org/downloads.html
    2. Select “Spark 3.5.0 pre-built for Hadoop 3.3+” and unzip into C:\spark (e.g. C:\spark\spark-3.5.0-bin-hadoop3.3).
  3. Configure Environment Variables
    In **System → Advanced → Environment Variables** add:
    
    HADOOP_HOME = C:\hadoop
    SPARK_HOME  = C:\spark\spark-3.5.0-bin-hadoop3.3
    JAVA_HOME   = C:\Program Files\Java\jdk-11.x.x
        
    Then prepend to **Path**:
    
    %HADOOP_HOME%\bin
    %SPARK_HOME%\bin
        
  4. Verify Spark Shell
    Open a new PowerShell or CMD and run:
    spark-shell
  5. Optional: PySpark
    pyspark

C. Installation on macOS

  1. Install Java
    brew install openjdk@11
    echo 'export JAVA_HOME="/usr/local/opt/openjdk@11"' >> ~/.zshrc
    echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.zshrc
    source ~/.zshrc
        
  2. Install Scala (Optional)
    brew install scala
  3. Download & Unpack Spark
    
    curl -O https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.3.tgz
    tar xzf spark-3.5.0-bin-hadoop3.3.tgz
    mv spark-3.5.0-bin-hadoop3.3 ~/spark
        
  4. Configure Environment Variables
    Add to ~/.zshrc:
    
    export SPARK_HOME=~/spark
    export PATH="$SPARK_HOME/bin:$PATH"
        
    Then run:
    source ~/.zshrc
  5. Verify Spark Shell & PySpark
    spark-shell
    pyspark

D. Quick Smoke Test

In either OS, run one of these in the shell to confirm:


// Scala (spark-shell)
spark.range(1, 1000000).selectExpr("sum(id)").show()

# Python (pyspark)
df = spark.range(1, 1000000)
df.selectExpr("sum(id)").show()

If you see the sum output without errors, your Spark setup is complete!

Leave a Reply

Your email address will not be published. Required fields are marked *

Take Your Learning To The Next Level.