PySpark Feature Comparison
This document provides a comprehensive comparison of PySpark DataFrame API features against all 6 Moltres interfaces:
PySpark-Style (Sync) -
DataFrame- Primary PySpark-compatible APIPySpark-Style (Async) -
AsyncDataFrame- Async version of PySpark-style APIPandas-Style (Sync) -
PandasDataFrame- Pandas-compatible APIPandas-Style (Async) -
AsyncPandasDataFrame- Async version of Pandas-style APIPolars-Style (Sync) -
PolarsDataFrame- Polars LazyFrame-compatible APIPolars-Style (Async) -
AsyncPolarsDataFrame- Async version of Polars-style API
Status Indicators
✅ Supported - Fully implemented with full feature parity
⚠️ Partial - Partially implemented or with limitations
❌ Not Implemented - Not available in this interface
🔄 Different API - Available but with different method name/API signature
📝 Notes - Additional implementation details, differences, or limitations
Selection & Projection
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
All interfaces support column selection |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
SQL expression selection |
Column access |
✅ ( |
✅ ( |
🔄 ( |
🔄 ( |
🔄 ( |
🔄 ( |
PySpark-style supports dot notation; Pandas/Polars use bracket notation |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Pandas/Polars-style column access (bracket notation) |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Multi-column selection (Pandas/Polars) |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Boolean indexing (Pandas/Polars) |
Filtering
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
🔄 ( |
🔄 ( |
✅ ( |
✅ ( |
All support filtering |
|
✅ |
✅ |
🔄 ( |
🔄 ( |
✅ |
✅ |
Alias for |
|
❌ |
❌ |
✅ |
✅ |
❌ |
❌ |
Pandas-style string query syntax |
|
❌ |
❌ |
✅ |
✅ |
❌ |
❌ |
Pandas-style membership check |
|
❌ |
❌ |
✅ |
✅ |
❌ |
❌ |
Pandas-style range check |
Joins
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
All support joins |
|
✅ |
✅ |
❌ |
❌ |
✅ ( |
✅ ( |
Cross join support |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Semi-join (filter rows with matches) |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Anti-join (filter rows without matches) |
Join types: |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
All join types supported |
GroupBy & Aggregations
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
All support grouping |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Aggregation expressions |
|
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Dictionary syntax (PySpark/Pandas) |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Basic aggregations |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
First/last value per group |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Count distinct (Pandas/Polars) |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Statistical aggregations |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Pivot operation |
Sorting
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
All support sorting |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
Alias for |
|
❌ |
❌ |
✅ |
✅ |
❌ |
❌ |
Pandas-style sorting |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
Polars-style sorting |
Window Functions
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Window function support |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Row number window function |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Ranking functions |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Lead/lag functions |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
First/last value in window |
Set Operations
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
Union operation |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
Union all (same as union) |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
Intersection |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
Set difference |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ |
✅ |
Remove duplicates |
|
✅ |
✅ |
✅ |
✅ |
✅ ( |
✅ ( |
Drop duplicates with subset |
Column Manipulation
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
Add/replace column |
|
❌ |
❌ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
Multiple columns |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Drop columns |
|
✅ |
✅ |
✅ ( |
✅ ( |
✅ ( |
✅ ( |
Rename column |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Column alias |
|
❌ |
❌ |
✅ |
✅ |
❌ |
❌ |
Pandas-style column assignment |
Null Handling
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ ( |
✅ ( |
Fill null values |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Column-specific fill |
|
✅ |
✅ |
✅ |
✅ |
✅ ( |
✅ ( |
Drop null rows |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Null handling property |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Null handling property |
String Operations
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String accessor |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String accessor |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String contains |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String startswith |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String endswith |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String replace |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String split |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
String length |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
String functions (all) |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Substring function |
Date/Time Operations
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Date accessor |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Date accessor |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Date accessor |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Date accessor |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Date functions (all) |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Date conversion |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Timestamp conversion |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Date arithmetic |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Date arithmetic |
File I/O
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ ( |
✅ ( |
❌ |
❌ |
✅ ( |
✅ ( |
Read CSV |
|
✅ ( |
✅ ( |
❌ |
❌ |
✅ ( |
✅ ( |
Read JSON |
|
✅ ( |
✅ ( |
❌ |
❌ |
✅ ( |
✅ ( |
Read Parquet |
|
✅ ( |
✅ ( |
❌ |
❌ |
✅ ( |
✅ ( |
Read text |
|
✅ |
✅ |
❌ |
❌ |
✅ ( |
✅ ( |
Write CSV |
|
✅ |
✅ |
❌ |
❌ |
✅ ( |
✅ ( |
Write JSON |
|
✅ |
✅ |
❌ |
❌ |
✅ ( |
✅ ( |
Write Parquet |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Save to table |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Insert into table |
|
✅ |
✅ |
❌ |
❌ |
✅ |
✅ |
Write mode |
Schema Operations
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Column names |
|
✅ |
✅ |
🔄 ( |
🔄 ( |
✅ |
✅ |
Schema information |
|
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Column types (Pandas-style) |
|
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Print schema tree |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
Polars schema format |
Execution Methods
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ ( |
✅ |
✅ ( |
✅ |
✅ ( |
Execute and return results |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Streaming execution |
|
✅ |
✅ ( |
❌ |
❌ |
❌ |
❌ |
Display results |
|
✅ |
✅ ( |
❌ |
❌ |
✅ ( |
✅ ( |
Take n rows |
|
✅ |
✅ ( |
❌ |
❌ |
❌ |
❌ |
First row |
|
✅ |
✅ ( |
✅ |
✅ |
✅ |
✅ |
First n rows |
|
✅ |
✅ ( |
✅ |
✅ |
✅ |
✅ |
Last n rows |
|
✅ |
✅ ( |
❌ |
❌ |
❌ |
❌ |
Row count |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ ( |
Polars-style fetch |
Statistics & Descriptive
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ ( |
✅ |
✅ ( |
✅ ( |
✅ ( |
Statistical summary |
|
✅ |
✅ ( |
❌ |
❌ |
❌ |
❌ |
Custom statistics |
|
❌ |
❌ |
✅ |
✅ ( |
✅ |
✅ |
Count unique values |
|
❌ |
❌ |
✅ |
✅ ( |
❌ |
❌ |
Value frequency (Pandas) |
|
❌ |
❌ |
✅ |
✅ ( |
❌ |
❌ |
DataFrame info (Pandas) |
|
❌ |
❌ |
✅ |
✅ ( |
❌ |
❌ |
Check if empty (Pandas) |
|
❌ |
❌ |
✅ |
✅ ( |
❌ |
❌ |
DataFrame shape (Pandas) |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ ( |
Polars dimensions |
Data Reshaping
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Pivot operation |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Explode array/JSON |
|
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
Unpivot (Pandas/Polars) |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
Unnest nested structures (Polars) |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
Slice rows (Polars) |
Sampling & Limiting
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Random sampling |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Limit rows |
CTEs & SQL
PySpark Method |
PySpark-Style (Sync) |
PySpark-Style (Async) |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Common Table Expression |
|
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
Recursive CTE (Polars) |
|
✅ ( |
✅ ( |
❌ |
❌ |
❌ |
❌ |
Raw SQL execution |
|
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
Get SQL string |
|
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
Get SQLAlchemy statement |
Interface-Specific Features
Pandas-Style Unique Features
Feature |
Pandas-Style (Sync) |
Pandas-Style (Async) |
Notes |
|---|---|---|---|
|
✅ |
✅ |
String-based query syntax |
|
✅ |
✅ ( |
Label-based indexing |
|
✅ |
✅ ( |
Integer-based indexing |
|
✅ |
✅ |
Append DataFrames |
|
✅ |
✅ |
Membership check |
|
✅ |
✅ |
Range check |
|
✅ |
✅ |
Column assignment |
|
✅ |
✅ |
Sorting with parameters |
|
✅ |
✅ |
Value frequency |
|
✅ |
✅ |
DataFrame info |
|
✅ |
✅ |
Unpivot operation |
Polars-Style Unique Features
Feature |
Polars-Style (Sync) |
Polars-Style (Async) |
Notes |
|---|---|---|---|
|
✅ |
✅ |
Mark as lazy (no-op in Moltres) |
|
✅ |
✅ ( |
Fetch n rows |
|
✅ |
✅ |
Multiple column operations |
|
✅ |
✅ |
Rename multiple columns |
|
✅ |
✅ |
Add row number column |
|
✅ |
✅ |
Add context DataFrame |
|
✅ |
✅ |
Recursive CTE |
|
✅ |
✅ |
Unnest nested structures |
|
✅ |
✅ |
Slice rows |
|
✅ |
✅ |
Sample every nth row |
|
✅ |
✅ |
Interpolate missing values |
|
✅ |
✅ |
Quantile calculation |
|
✅ |
✅ |
Horizontal stack |
|
✅ |
✅ |
Vertical stack |
|
✅ |
✅ |
Set difference |
|
✅ |
✅ |
Cross join |
|
✅ |
✅ |
Drop nulls |
|
✅ |
✅ |
Fill nulls |
|
✅ |
✅ |
Unique rows |
|
✅ |
✅ ( |
Query plan explanation |
Async-Specific Features
Feature |
PySpark-Style (Async) |
Pandas-Style (Async) |
Polars-Style (Async) |
Notes |
|---|---|---|---|---|
|
✅ |
✅ |
✅ |
Async execution |
|
✅ |
✅ |
✅ |
Async streaming |
|
✅ |
❌ |
❌ |
Async display |
|
✅ |
❌ |
✅ ( |
Async take |
|
✅ |
❌ |
❌ |
Async first row |
|
✅ |
✅ |
✅ |
Async head |
|
✅ |
✅ |
✅ |
Async tail |
|
✅ |
❌ |
❌ |
Async count |
|
✅ |
✅ |
✅ |
Async describe |
|
✅ |
❌ |
❌ |
Async summary |
|
❌ |
✅ |
✅ |
Async nunique |
|
❌ |
✅ |
❌ |
Async value_counts |
|
❌ |
✅ |
❌ |
Async info |
|
❌ |
✅ |
❌ |
Async shape |
|
❌ |
✅ |
❌ |
Async empty |
|
❌ |
❌ |
✅ |
Async height |
|
❌ |
❌ |
✅ |
Async schema |
|
❌ |
✅ |
❌ |
Async dtypes |
|
❌ |
❌ |
✅ |
Async fetch |
|
❌ |
❌ |
✅ |
Async write CSV |
|
❌ |
❌ |
✅ |
Async write JSON |
|
❌ |
❌ |
✅ |
Async write Parquet |
|
❌ |
❌ |
✅ |
Async explain |
|
❌ |
✅ |
❌ |
Async loc indexer |
|
❌ |
✅ |
❌ |
Async iloc indexer |
Summary
Overall Coverage
PySpark-Style (Sync): ~98% API compatibility with PySpark DataFrame API
PySpark-Style (Async): Full async support for all sync methods
Pandas-Style (Sync): Comprehensive Pandas DataFrame API with SQL pushdown
Pandas-Style (Async): Full async support for all Pandas-style methods
Polars-Style (Sync): Comprehensive Polars LazyFrame API with SQL pushdown
Polars-Style (Async): Full async support for all Polars-style methods
Key Differences
Column Access: PySpark uses
df.colattribute access, while Pandas/Polars usedf['col']bracket notationFiltering: PySpark uses
where()/filter(), Pandas usesquery(), Polars usesfilter()GroupBy: PySpark uses
groupBy(), Pandas usesgroupby(), Polars usesgroup_by()Sorting: PySpark uses
orderBy(), Pandas usessort_values(), Polars usessort()Null Handling: PySpark uses
naproperty, Pandas/Polars use direct methodsFile I/O: PySpark uses
spark.read.*, Moltres usesdb.read.*ordb.scan_*(Polars)Async: All async interfaces require
awaitfor execution methods
Implementation Notes
All interfaces maintain lazy evaluation until execution
SQL pushdown execution for all operations
Type safety with proper type hints
Comprehensive error handling and validation
Support for multiple database dialects (SQLite, PostgreSQL, MySQL, DuckDB)