If you’ve ever wanted to explore big data processing with Apache Spark but felt intimidated by the setup, you’re not alone! Spark is a powerful tool, but getting it running smoothly on Windows requires some careful version matching and configuration. In this blog post, I’ll walk you through my journey of setting up Spark on Windows 11 and running successful experiments with it.
Getting Started: Installing the Right Software
Before diving into Spark, it’s crucial to have the necessary dependencies installed. Here’s what I used:
Install Java (Version 21.0.4)
Spark runs on the Java Virtual Machine (JVM), so having Java installed is a must. I downloaded Java version 21.0.4 and verified the installation by running:
java -version
This should return the installed Java version.
Install Python (Version 3.12.9)
Since I was working with PySpark, I needed Python installed. I downloaded Python 3.12.9 and verified it with:
python --version
Download Apache Spark (Version 3.5.5)
I downloaded Apache Spark 3.5.5 and extracted the .tgz
file into a folder named C:\spark-3.5.5-bin-hadoop3
.
Download Hadoop Winutils and hadoop.dll
Spark relies on Hadoop binaries for Windows compatibility, so I had to download Hadoop Winutils. I made sure to get Hadoop 3.3.0 or later, and then:
- Created a folder:
C:\Winutils
- Inside
Winutils
, created abin
subfolder - Placed
hadoop.dll
andwinutils.exe
insideC:\Winutils\bin
Install DBeaver and Set Up SQLite Database
To explore data, I installed DBeaver, a powerful database management tool. With DBeaver, I:
- Downloaded the SQLite JDBC driver to ensure compatibility.
- Created a new SQLite database and connected it through DBeaver.
- Loaded the Chinook database, which contains a collection of tables related to music data.
Configuring Environment Variables
With all the downloads in place, I had to set up environment variables to let Windows know where to find Spark and Hadoop.
System Variables
- SPARK_HOME →
C:\spark-3.5.5-bin-hadoop3
- HADOOP_HOME →
C:\Winutils
Path Variables
Under the Path
section, I added:
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
Running PySpark
After all the setup, it was time to see if everything worked! I opened Command Prompt (CMD) and simply ran:
pyspark
And voilà! The Spark shell launched successfully, confirming that my setup was correct.
Exploring Data with Spark
With Spark up and running, I used CMD to query the SQLite database through Spark. First, I ensured that my JDBC driver was properly installed by downloading the correct version for SQLite and placing it in the appropriate directory (C:\Program Files\DBeaver\drivers\sqlite
). To confirm the installation, I added the JDBC driver path in DBeaver, tested the connection, and ensured it could successfully access the database.
To query the SQLite database using Spark via CMD, I used the following command:
pyspark --driver-class-path "C:\Program Files\DBeaver\drivers\sqlite\sqlite-jdbc-3.34.0.jar" --jars "C:\Program Files\DBeaver\drivers\sqlite\sqlite-jdbc-3.34.0.jar"
This command allowed Spark to interact with the SQLite database seamlessly, ensuring that all dependencies were correctly linked.
To make data exploration more efficient, I wrote a Python script using PySpark. This script connected to the Chinook SQLite database, executed queries, and applied filters to retrieve relevant data.
Establishing the Connection
First, I established a connection to the database using the appropriate JDBC driver and Spark’s built-in support for reading SQL databases. The connection was made using the read.format
function, which allowed Spark to interact with SQLite seamlessly.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SQLite Integration").getOrCreate()
db_path = "jdbc:sqlite:C:/Users/REDACTED/AppData/Roaming/DBeaverData/workspace6/.metadata/sample-database-sqlite-1/Chinook.db"
df = spark.read.format("jdbc").option("url", db_path).option("dbtable", "Album").option("driver", "org.sqlite.JDBC").load()
Verifying Data
Once the connection was established, I loaded the Albums table into a Spark DataFrame and printed the first 20 rows to verify that the data was loaded correctly.
df.show(20)
Filtering Data
To demonstrate Spark’s ability to handle large datasets efficiently, I applied a filter operation to extract albums by a specific artist (Artist ID 2).
filtered_df = df.filter(df.ArtistId == 2)
filtered_df.show()
The results were displayed in a structured format, making it easy to analyze the dataset within the Python script. This approach provided an efficient way to explore large amounts of data without needing to manually query the database repeatedly.
Loading Data into Spark
Using PySpark, I connected to the Chinook SQLite database and successfully queried the Albums table, printing the first 20 rows:
+-------+----------------------------+--------+
|AlbumId|Title |ArtistId|
+-------+----------------------------+--------+
|1 |For Those About To Rock |1 |
|2 |Balls to the Wall |2 |
|3 |Restless and Wild |2 |
...
Filtering Data
I then applied a filter to find albums by Artist ID 2, and Spark returned the correct results!
+-------+-----------------+--------+
|AlbumId|Title |ArtistId|
+-------+-----------------+--------+
|2 |Balls to the Wall|2 |
|3 |Restless and Wild|2 |
Wrapping Up
Setting up Spark on Windows 11 can be tricky, but with the right versions and careful configuration, it’s completely doable! By following these steps, I was able to successfully install Spark, configure DBeaver, work with a SQLite database, and query data with PySpark.
If you’re looking to get started with Spark, I highly recommend paying close attention to version matching for Java, Python, and Hadoop. This will save you from many headaches down the road!