If you’ve ever wanted to explore big data processing with Apache Spark but felt intimidated by the setup, you’re not alone! Spark is a powerful tool, but getting it running smoothly on Windows requires some careful version matching and configuration. In this blog post, I’ll walk you through my journey of setting up Spark on Windows 11 and running successful experiments with it.

Getting Started: Installing the Right Software

Before diving into Spark, it’s crucial to have the necessary dependencies installed. Here’s what I used:

Install Java (Version 21.0.4)

Spark runs on the Java Virtual Machine (JVM), so having Java installed is a must. I downloaded Java version 21.0.4 and verified the installation by running:

java -version

This should return the installed Java version.

Install Python (Version 3.12.9)

Since I was working with PySpark, I needed Python installed. I downloaded Python 3.12.9 and verified it with:

python --version

Download Apache Spark (Version 3.5.5)

I downloaded Apache Spark 3.5.5 and extracted the .tgz file into a folder named C:\spark-3.5.5-bin-hadoop3.

Download Hadoop Winutils and hadoop.dll

Spark relies on Hadoop binaries for Windows compatibility, so I had to download Hadoop Winutils. I made sure to get Hadoop 3.3.0 or later, and then:

  • Created a folder: C:\Winutils
  • Inside Winutils, created a bin subfolder
  • Placed hadoop.dll and winutils.exe inside C:\Winutils\bin

Install DBeaver and Set Up SQLite Database

To explore data, I installed DBeaver, a powerful database management tool. With DBeaver, I:

  • Downloaded the SQLite JDBC driver to ensure compatibility.
  • Created a new SQLite database and connected it through DBeaver.
  • Loaded the Chinook database, which contains a collection of tables related to music data.

Configuring Environment Variables

With all the downloads in place, I had to set up environment variables to let Windows know where to find Spark and Hadoop.

System Variables

  • SPARK_HOMEC:\spark-3.5.5-bin-hadoop3
  • HADOOP_HOMEC:\Winutils

Path Variables

Under the Path section, I added:

%SPARK_HOME%\bin
%HADOOP_HOME%\bin

Running PySpark

After all the setup, it was time to see if everything worked! I opened Command Prompt (CMD) and simply ran:

pyspark

And voilà! The Spark shell launched successfully, confirming that my setup was correct.

Exploring Data with Spark

With Spark up and running, I used CMD to query the SQLite database through Spark. First, I ensured that my JDBC driver was properly installed by downloading the correct version for SQLite and placing it in the appropriate directory (C:\Program Files\DBeaver\drivers\sqlite). To confirm the installation, I added the JDBC driver path in DBeaver, tested the connection, and ensured it could successfully access the database.

To query the SQLite database using Spark via CMD, I used the following command:

pyspark --driver-class-path "C:\Program Files\DBeaver\drivers\sqlite\sqlite-jdbc-3.34.0.jar" --jars "C:\Program Files\DBeaver\drivers\sqlite\sqlite-jdbc-3.34.0.jar"

This command allowed Spark to interact with the SQLite database seamlessly, ensuring that all dependencies were correctly linked.

To make data exploration more efficient, I wrote a Python script using PySpark. This script connected to the Chinook SQLite database, executed queries, and applied filters to retrieve relevant data.

Establishing the Connection

First, I established a connection to the database using the appropriate JDBC driver and Spark’s built-in support for reading SQL databases. The connection was made using the read.format function, which allowed Spark to interact with SQLite seamlessly.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQLite Integration").getOrCreate()
db_path = "jdbc:sqlite:C:/Users/REDACTED/AppData/Roaming/DBeaverData/workspace6/.metadata/sample-database-sqlite-1/Chinook.db"
df = spark.read.format("jdbc").option("url", db_path).option("dbtable", "Album").option("driver", "org.sqlite.JDBC").load()

Verifying Data

Once the connection was established, I loaded the Albums table into a Spark DataFrame and printed the first 20 rows to verify that the data was loaded correctly.

df.show(20)

Filtering Data

To demonstrate Spark’s ability to handle large datasets efficiently, I applied a filter operation to extract albums by a specific artist (Artist ID 2).

filtered_df = df.filter(df.ArtistId == 2)
filtered_df.show()

The results were displayed in a structured format, making it easy to analyze the dataset within the Python script. This approach provided an efficient way to explore large amounts of data without needing to manually query the database repeatedly.

Loading Data into Spark

Using PySpark, I connected to the Chinook SQLite database and successfully queried the Albums table, printing the first 20 rows:

+-------+----------------------------+--------+
|AlbumId|Title                        |ArtistId|
+-------+----------------------------+--------+
|1      |For Those About To Rock      |1       |
|2      |Balls to the Wall            |2       |
|3      |Restless and Wild            |2       |
...

Filtering Data

I then applied a filter to find albums by Artist ID 2, and Spark returned the correct results!

+-------+-----------------+--------+
|AlbumId|Title            |ArtistId|
+-------+-----------------+--------+
|2      |Balls to the Wall|2       |
|3      |Restless and Wild|2       |

Wrapping Up

Setting up Spark on Windows 11 can be tricky, but with the right versions and careful configuration, it’s completely doable! By following these steps, I was able to successfully install Spark, configure DBeaver, work with a SQLite database, and query data with PySpark.

If you’re looking to get started with Spark, I highly recommend paying close attention to version matching for Java, Python, and Hadoop. This will save you from many headaches down the road!