PySpark is the Python API for Apache Spark, allowing you to harness the power of distributed computing with Python’s simplicity.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/spark/llms.txt
Use this file to discover all available pages before exploring further.
API Documentation
Access the complete PySpark API documentation at: Spark Python API (Sphinx)Main Modules
pyspark.sql
The core module for working with structured data in PySpark. Key Classes:- SparkSession - Entry point for Spark functionality
- DataFrame - Distributed collection of data organized into named columns
- Column - Column expression in a DataFrame
- Row - Row of data in a DataFrame
- functions - Built-in functions for DataFrame operations
pyspark.sql.types
Data types for defining schemas. Key Classes:- StructType - Schema definition
- StructField - Field in a schema
- Data types: StringType, IntegerType, DoubleType, DateType, TimestampType, etc.
pyspark.sql.streaming
Streaming API for real-time data processing. Key Classes:- DataStreamReader - Read streaming data
- DataStreamWriter - Write streaming results
- StreamingQuery - Handle to a running query
pyspark.ml
Machine learning library built on DataFrames. Key Modules:- pyspark.ml.feature - Feature transformers and extractors
- pyspark.ml.classification - Classification algorithms
- pyspark.ml.regression - Regression algorithms
- pyspark.ml.clustering - Clustering algorithms
pyspark.pandas
Pandas API on Spark for seamless transition from pandas. Key Features:- Drop-in replacement for pandas operations
- Distributed computing on large datasets
- Compatible with most pandas APIs
Quick Start Example
Here’s a simple example to get started with PySpark:Reading Data
Read data from various sources:DataFrame Operations
Common DataFrame transformations:User-Defined Functions (UDFs)
Create custom functions:SQL Queries
Execute SQL queries on DataFrames:Pandas API on Spark
Use familiar pandas syntax with Spark’s distributed computing:Installation
Install PySpark using pip:Spark Connect Support
Since Spark 3.4, most PySpark APIs are supported in Spark Connect, including DataFrame, functions, and Column. However, SparkContext and RDD are not supported.
Performance Tips
- Use built-in functions instead of UDFs when possible
- Cache DataFrames that are reused multiple times
- Use broadcast joins for small tables
- Partition data appropriately for your workload
- Use columnar formats like Parquet for better performance
Additional Resources
- PySpark Getting Started Guide
- Spark SQL Programming Guide
- PySpark Migration Guide
- Spark Connect Overview
- Python Examples
Always check the API reference documentation for the “Supports Spark Connect” label to verify compatibility when using Spark Connect.
