PySpark Feature Comparison

This document provides a comprehensive comparison of PySpark DataFrame API features against all 6 Moltres interfaces:

  1. PySpark-Style (Sync) - DataFrame - Primary PySpark-compatible API

  2. PySpark-Style (Async) - AsyncDataFrame - Async version of PySpark-style API

  3. Pandas-Style (Sync) - PandasDataFrame - Pandas-compatible API

  4. Pandas-Style (Async) - AsyncPandasDataFrame - Async version of Pandas-style API

  5. Polars-Style (Sync) - PolarsDataFrame - Polars LazyFrame-compatible API

  6. Polars-Style (Async) - AsyncPolarsDataFrame - Async version of Polars-style API

Status Indicators

  • Supported - Fully implemented with full feature parity

  • ⚠️ Partial - Partially implemented or with limitations

  • Not Implemented - Not available in this interface

  • 🔄 Different API - Available but with different method name/API signature

  • 📝 Notes - Additional implementation details, differences, or limitations

Selection & Projection

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

select(*cols)

✅ (select())

✅ (select())

All interfaces support column selection

selectExpr(*exprs)

✅ (select_expr())

✅ (select_expr())

✅ (select_expr())

✅ (select_expr())

SQL expression selection

Column access df.col

✅ (__getattr__)

✅ (__getattr__)

🔄 (df['col'])

🔄 (df['col'])

🔄 (df['col'])

🔄 (df['col'])

PySpark-style supports dot notation; Pandas/Polars use bracket notation df['col'] instead

df["col"]

Pandas/Polars-style column access (bracket notation)

df[["col1", "col2"]]

Multi-column selection (Pandas/Polars)

df[df["col"] > 5]

Boolean indexing (Pandas/Polars)

Filtering

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

where(condition)

🔄 (query())

🔄 (query())

✅ (filter())

✅ (filter())

All support filtering

filter(condition)

🔄 (query())

🔄 (query())

Alias for where() in PySpark-style

query(expr)

Pandas-style string query syntax

isin(values)

Pandas-style membership check

between(start, end)

Pandas-style range check

Joins

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

join(other, on, how)

✅ (merge())

✅ (merge())

All support joins

crossJoin(other)

✅ (cross_join())

✅ (cross_join())

Cross join support

semi_join(other, on)

Semi-join (filter rows with matches)

anti_join(other, on)

Anti-join (filter rows without matches)

Join types: inner, left, right, outer

All join types supported

GroupBy & Aggregations

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

groupBy(*cols)

✅ (groupby())

✅ (groupby())

✅ (group_by())

✅ (group_by())

All support grouping

agg(*exprs)

Aggregation expressions

agg({"col": "func"})

Dictionary syntax (PySpark/Pandas)

sum(), mean(), min(), max(), count()

Basic aggregations

first(), last()

First/last value per group

nunique()

Count distinct (Pandas/Polars)

std(), var()

Statistical aggregations

pivot(pivot_col)

Pivot operation

Sorting

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

orderBy(*cols)

✅ (sort_values())

✅ (sort_values())

✅ (sort())

✅ (sort())

All support sorting

sort(*cols)

✅ (sort_values())

✅ (sort_values())

Alias for orderBy()

sort_values(by, ascending)

Pandas-style sorting

sort(by, descending)

Polars-style sorting

Window Functions

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

over(windowSpec)

Window function support

row_number()

Row number window function

rank(), dense_rank()

Ranking functions

lead(), lag()

Lead/lag functions

first_value(), last_value()

First/last value in window

Set Operations

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

union(other)

✅ (concat())

✅ (concat())

Union operation

unionAll(other)

✅ (concat())

✅ (concat())

Union all (same as union)

intersect(other)

✅ (concat() + dedup)

✅ (concat() + dedup)

Intersection

except(other)

✅ (concat() + filter)

✅ (concat() + filter)

✅ (difference())

✅ (difference())

Set difference

distinct()

✅ (drop_duplicates())

✅ (drop_duplicates())

Remove duplicates

dropDuplicates(subset)

✅ (unique())

✅ (unique())

Drop duplicates with subset

Column Manipulation

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

withColumn(name, expr)

✅ (assign())

✅ (assign())

✅ (with_column())

✅ (with_column())

Add/replace column

withColumns({name: expr})

✅ (assign())

✅ (assign())

✅ (with_columns())

✅ (with_columns())

Multiple columns

drop(*cols)

Drop columns

withColumnRenamed(old, new)

✅ (rename())

✅ (rename())

✅ (rename())

✅ (rename())

Rename column

alias(name)

Column alias

assign(**kwargs)

Pandas-style column assignment

Null Handling

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

fillna(value)

✅ (fill_null())

✅ (fill_null())

Fill null values

fillna({col: value})

Column-specific fill

dropna(how, subset)

✅ (drop_nulls())

✅ (drop_nulls())

Drop null rows

na.drop()

Null handling property

na.fill(value)

Null handling property

String Operations

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

df["col"].str.upper()

String accessor

df["col"].str.lower()

String accessor

df["col"].str.contains(pattern)

String contains

df["col"].str.startswith(pattern)

String startswith

df["col"].str.endswith(pattern)

String endswith

df["col"].str.replace(old, new)

String replace

df["col"].str.split(delimiter)

String split

df["col"].str.len()

String length

upper(col), lower(col)

String functions (all)

substring(col, pos, len)

Substring function

Date/Time Operations

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

df["col"].dt.year

Date accessor

df["col"].dt.month

Date accessor

df["col"].dt.day

Date accessor

df["col"].dt.hour

Date accessor

year(col), month(col), etc.

Date functions (all)

to_date(col)

Date conversion

to_timestamp(col)

Timestamp conversion

date_add(col, days)

Date arithmetic

date_sub(col, days)

Date arithmetic

File I/O

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

spark.read.csv(path)

✅ (db.read.csv())

✅ (db.read.csv())

✅ (db.scan_csv())

✅ (db.scan_csv())

Read CSV

spark.read.json(path)

✅ (db.read.json())

✅ (db.read.json())

✅ (db.scan_json())

✅ (db.scan_json())

Read JSON

spark.read.parquet(path)

✅ (db.read.parquet())

✅ (db.read.parquet())

✅ (db.scan_parquet())

✅ (db.scan_parquet())

Read Parquet

spark.read.text(path)

✅ (db.read.text())

✅ (db.read.text())

✅ (db.scan_text())

✅ (db.scan_text())

Read text

df.write.csv(path)

✅ (write_csv())

✅ (write_csv())

Write CSV

df.write.json(path)

✅ (write_json())

✅ (write_json())

Write JSON

df.write.parquet(path)

✅ (write_parquet())

✅ (write_parquet())

Write Parquet

df.write.saveAsTable(name)

Save to table

df.write.insertInto(table)

Insert into table

df.write.mode("overwrite")

Write mode

Schema Operations

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

df.columns

Column names

df.schema

🔄 (dtypes)

🔄 (dtypes)

Schema information

df.dtypes

Column types (Pandas-style)

df.printSchema()

Print schema tree

df.schema (Polars)

Polars schema format

Execution Methods

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

collect()

✅ (await collect())

✅ (await collect())

✅ (await collect())

Execute and return results

collect() (streaming)

Streaming execution

show(n, truncate)

✅ (await show())

Display results

take(n)

✅ (await take())

✅ (fetch())

✅ (await fetch())

Take n rows

first()

✅ (await first())

First row

head(n)

✅ (await head())

First n rows

tail(n)

✅ (await tail())

Last n rows

count()

✅ (await count())

Row count

fetch(n) (Polars)

✅ (await fetch())

Polars-style fetch

Statistics & Descriptive

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

describe(*cols)

✅ (await describe())

✅ (await describe())

✅ (await describe())

✅ (await describe())

Statistical summary

summary(*stats)

✅ (await summary())

Custom statistics

nunique(column)

✅ (await nunique())

Count unique values

value_counts(column)

✅ (await value_counts())

Value frequency (Pandas)

info()

✅ (await info())

DataFrame info (Pandas)

empty

✅ (await empty)

Check if empty (Pandas)

shape

✅ (await shape)

DataFrame shape (Pandas)

width, height (Polars)

✅ (await height)

Polars dimensions

Data Reshaping

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

pivot(pivot_col, values)

Pivot operation

explode(col)

Explode array/JSON

melt(id_vars, value_vars)

Unpivot (Pandas/Polars)

unnest(cols)

Unnest nested structures (Polars)

slice(offset, length)

Slice rows (Polars)

Sampling & Limiting

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

sample(fraction, seed)

Random sampling

limit(n)

Limit rows

CTEs & SQL

PySpark Method

PySpark-Style (Sync)

PySpark-Style (Async)

Pandas-Style (Sync)

Pandas-Style (Async)

Polars-Style (Sync)

Polars-Style (Async)

Notes

cte(name)

Common Table Expression

with_recursive(name)

Recursive CTE (Polars)

spark.sql(query)

✅ (db.sql())

✅ (await db.sql())

Raw SQL execution

to_sql()

Get SQL string

to_sqlalchemy()

Get SQLAlchemy statement

Interface-Specific Features

Pandas-Style Unique Features

Feature

Pandas-Style (Sync)

Pandas-Style (Async)

Notes

query(expr)

String-based query syntax

loc indexer

✅ (await loc)

Label-based indexing

iloc indexer

✅ (await iloc)

Integer-based indexing

append(other)

Append DataFrames

isin(values)

Membership check

between(start, end)

Range check

assign(**kwargs)

Column assignment

sort_values(by, ascending)

Sorting with parameters

value_counts(column)

Value frequency

info()

DataFrame info

melt()

Unpivot operation

Polars-Style Unique Features

Feature

Polars-Style (Sync)

Polars-Style (Async)

Notes

lazy()

Mark as lazy (no-op in Moltres)

fetch(n)

✅ (await fetch())

Fetch n rows

with_columns(*exprs)

Multiple column operations

with_columns_renamed(mapping)

Rename multiple columns

with_row_count(name)

Add row number column

with_context(df)

Add context DataFrame

with_recursive(name)

Recursive CTE

unnest(cols)

Unnest nested structures

slice(offset, length)

Slice rows

gather_every(n, offset)

Sample every nth row

interpolate(method)

Interpolate missing values

quantile(quantile)

Quantile calculation

hstack(other)

Horizontal stack

vstack(other)

Vertical stack

difference(other)

Set difference

cross_join(other)

Cross join

drop_nulls(subset)

Drop nulls

fill_null(value)

Fill nulls

unique(subset)

Unique rows

explain(format)

✅ (await explain())

Query plan explanation

Async-Specific Features

Feature

PySpark-Style (Async)

Pandas-Style (Async)

Polars-Style (Async)

Notes

await collect()

Async execution

await collect(stream=True)

Async streaming

await show()

Async display

await take()

✅ (await fetch())

Async take

await first()

Async first row

await head()

Async head

await tail()

Async tail

await count()

Async count

await describe()

Async describe

await summary()

Async summary

await nunique()

Async nunique

await value_counts()

Async value_counts

await info()

Async info

await shape

Async shape

await empty

Async empty

await height

Async height

await schema

Async schema

await dtypes

Async dtypes

await fetch()

Async fetch

await write_csv()

Async write CSV

await write_json()

Async write JSON

await write_parquet()

Async write Parquet

await explain()

Async explain

await loc

Async loc indexer

await iloc

Async iloc indexer

Summary

Overall Coverage

  • PySpark-Style (Sync): ~98% API compatibility with PySpark DataFrame API

  • PySpark-Style (Async): Full async support for all sync methods

  • Pandas-Style (Sync): Comprehensive Pandas DataFrame API with SQL pushdown

  • Pandas-Style (Async): Full async support for all Pandas-style methods

  • Polars-Style (Sync): Comprehensive Polars LazyFrame API with SQL pushdown

  • Polars-Style (Async): Full async support for all Polars-style methods

Key Differences

  1. Column Access: PySpark uses df.col attribute access, while Pandas/Polars use df['col'] bracket notation

  2. Filtering: PySpark uses where()/filter(), Pandas uses query(), Polars uses filter()

  3. GroupBy: PySpark uses groupBy(), Pandas uses groupby(), Polars uses group_by()

  4. Sorting: PySpark uses orderBy(), Pandas uses sort_values(), Polars uses sort()

  5. Null Handling: PySpark uses na property, Pandas/Polars use direct methods

  6. File I/O: PySpark uses spark.read.*, Moltres uses db.read.* or db.scan_* (Polars)

  7. Async: All async interfaces require await for execution methods

Implementation Notes

  • All interfaces maintain lazy evaluation until execution

  • SQL pushdown execution for all operations

  • Type safety with proper type hints

  • Comprehensive error handling and validation

  • Support for multiple database dialects (SQLite, PostgreSQL, MySQL, DuckDB)