Apache Spark

Lightning-fast unified analytics engine for large-scale data processing. Process massive datasets with SQL, streaming, machine learning, and graph processing.

Get Started API Reference

Quick Start

Get Apache Spark up and running in minutes

Download Apache Spark

Download Spark from the Apache Spark downloads page. Choose a pre-built package for your Hadoop version or download the source to build yourself.

# Extract the downloaded archive
tar xvf spark-*.tgz
cd spark-*

Start the Interactive Shell

Launch the Spark shell to start working with your data interactively. Spark provides shells for Scala, Python, R, and SQL.

# Python shell
./bin/pyspark

# Scala shell
./bin/spark-shell

# R shell
./bin/sparkR

Make sure you have Java 17 or later installed. Set JAVA_HOME to point to your Java installation.

Run Your First Query

Try running a simple example to verify your installation:

# Create a DataFrame from a range
df = spark.range(1000 * 1000 * 1000)

# Count the records
df.count()
# Output: 1000000000

Example Output

scala> spark.range(1000 * 1000 * 1000).count()
res0: Long = 1000000000

Run an Example Application

Spark includes several sample programs. Run the SparkPi example to calculate the value of Pi:

./bin/run-example SparkPi 10

You can also submit your own applications using spark-submit. Learn more in the Submitting Applications guide.

Explore by Component

Apache Spark provides a rich set of libraries for different data processing needs

Spark SQL

Work with structured data using SQL queries and DataFrames. Connect to data sources like Parquet, JSON, Hive, and JDBC.

Structured Streaming

Build scalable and fault-tolerant streaming applications. Process real-time data from Kafka, Kinesis, and more.

MLlib

Scale machine learning with distributed algorithms for classification, regression, clustering, and collaborative filtering.

GraphX

Analyze graph-structured data with Spark’s graph computation framework and built-in graph algorithms.

Spark Connect

Connect to Spark clusters remotely using the decoupled client-server architecture introduced in Spark 3.4.

Core API

Understand RDDs and the fundamental distributed computing primitives that power all Spark components.

Choose Your Language

Write Spark applications in Scala, Java, Python, R, or SQL

Scala

Native Spark language with type-safe Dataset APIs and functional programming support.

Explore Scala API

Java

Full-featured Java APIs for enterprise applications with familiar syntax and tooling.

Explore Java API

Python

PySpark brings Spark to the Python ecosystem with pandas-like APIs and native library integration.

Explore Python API

R

SparkR enables R users to leverage Spark’s distributed computing capabilities with familiar R syntax.

Explore R API

SQL

Write standard SQL queries with ANSI SQL support, window functions, and advanced analytics.

Explore SQL Reference

Deploy Anywhere

Run Spark on your preferred cluster manager

Standalone Mode

Deploy Spark on a private cluster with the built-in standalone cluster manager. Simple setup with minimal dependencies.

Kubernetes

Run Spark natively on Kubernetes with container orchestration and resource isolation. Ideal for cloud-native deployments.

Apache YARN

Integrate with Hadoop YARN for resource management in Hadoop clusters. Leverage existing Hadoop infrastructure.

Cluster Overview

Understand Spark’s cluster architecture, deployment modes, and how applications are executed across a cluster.

Ready to process big data at scale?

Start building distributed data processing applications with Apache Spark. From batch processing to real-time analytics, Spark powers some of the world’s largest data workloads.

Get Started View on GitHub

Documentation Index

Apache Spark

Quick Start

Explore by Component

Spark SQL

Structured Streaming

MLlib

GraphX

Spark Connect

Core API

Choose Your Language

Scala

Java

Python

R

SQL

Deploy Anywhere

Standalone Mode

Kubernetes

Apache YARN

Cluster Overview

Ready to process big data at scale?