PySpark Feature Comparison

This document provides a comprehensive comparison of PySpark DataFrame API features against all 6 Moltres interfaces:

PySpark-Style (Sync) - DataFrame - Primary PySpark-compatible API
PySpark-Style (Async) - AsyncDataFrame - Async version of PySpark-style API
Pandas-Style (Sync) - PandasDataFrame - Pandas-compatible API
Pandas-Style (Async) - AsyncPandasDataFrame - Async version of Pandas-style API
Polars-Style (Sync) - PolarsDataFrame - Polars LazyFrame-compatible API
Polars-Style (Async) - AsyncPolarsDataFrame - Async version of Polars-style API

Status Indicators

✅ Supported - Fully implemented with full feature parity
⚠️ Partial - Partially implemented or with limitations
❌ Not Implemented - Not available in this interface
🔄 Different API - Available but with different method name/API signature
📝 Notes - Additional implementation details, differences, or limitations

Selection & Projection

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`select(*cols)`	✅	✅	✅ (`select()`)	✅ (`select()`)	✅	✅	All interfaces support column selection
`selectExpr(*exprs)`	✅	✅	✅ (`select_expr()`)	✅ (`select_expr()`)	✅ (`select_expr()`)	✅ (`select_expr()`)	SQL expression selection
Column access `df.col`	✅ (`__getattr__`)	✅ (`__getattr__`)	🔄 (`df['col']`)	🔄 (`df['col']`)	🔄 (`df['col']`)	🔄 (`df['col']`)	PySpark-style supports dot notation; Pandas/Polars use bracket notation `df['col']` instead
`df["col"]`	❌	❌	✅	✅	✅	✅	Pandas/Polars-style column access (bracket notation)
`df[["col1", "col2"]]`	❌	❌	✅	✅	✅	✅	Multi-column selection (Pandas/Polars)
`df[df["col"] > 5]`	❌	❌	✅	✅	✅	✅	Boolean indexing (Pandas/Polars)

Filtering

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`where(condition)`	✅	✅	🔄 (`query()`)	🔄 (`query()`)	✅ (`filter()`)	✅ (`filter()`)	All support filtering
`filter(condition)`	✅	✅	🔄 (`query()`)	🔄 (`query()`)	✅	✅	Alias for `where()` in PySpark-style
`query(expr)`	❌	❌	✅	✅	❌	❌	Pandas-style string query syntax
`isin(values)`	❌	❌	✅	✅	❌	❌	Pandas-style membership check
`between(start, end)`	❌	❌	✅	✅	❌	❌	Pandas-style range check

Joins

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`join(other, on, how)`	✅	✅	✅ (`merge()`)	✅ (`merge()`)	✅	✅	All support joins
`crossJoin(other)`	✅	✅	❌	❌	✅ (`cross_join()`)	✅ (`cross_join()`)	Cross join support
`semi_join(other, on)`	✅	✅	❌	❌	❌	❌	Semi-join (filter rows with matches)
`anti_join(other, on)`	✅	✅	❌	❌	❌	❌	Anti-join (filter rows without matches)
Join types: `inner`, `left`, `right`, `outer`	✅	✅	✅	✅	✅	✅	All join types supported

GroupBy & Aggregations

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`groupBy(*cols)`	✅	✅	✅ (`groupby()`)	✅ (`groupby()`)	✅ (`group_by()`)	✅ (`group_by()`)	All support grouping
`agg(*exprs)`	✅	✅	✅	✅	✅	✅	Aggregation expressions
`agg({"col": "func"})`	✅	✅	✅	✅	❌	❌	Dictionary syntax (PySpark/Pandas)
`sum()`, `mean()`, `min()`, `max()`, `count()`	✅	✅	✅	✅	✅	✅	Basic aggregations
`first()`, `last()`	✅	✅	✅	✅	✅	✅	First/last value per group
`nunique()`	❌	❌	✅	✅	✅	✅	Count distinct (Pandas/Polars)
`std()`, `var()`	✅	✅	✅	✅	✅	✅	Statistical aggregations
`pivot(pivot_col)`	✅	✅	✅	✅	✅	✅	Pivot operation

Sorting

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`orderBy(*cols)`	✅	✅	✅ (`sort_values()`)	✅ (`sort_values()`)	✅ (`sort()`)	✅ (`sort()`)	All support sorting
`sort(*cols)`	✅	✅	✅ (`sort_values()`)	✅ (`sort_values()`)	✅	✅	Alias for `orderBy()`
`sort_values(by, ascending)`	❌	❌	✅	✅	❌	❌	Pandas-style sorting
`sort(by, descending)`	❌	❌	❌	❌	✅	✅	Polars-style sorting

Window Functions

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`over(windowSpec)`	✅	✅	✅	✅	✅	✅	Window function support
`row_number()`	✅	✅	✅	✅	✅	✅	Row number window function
`rank()`, `dense_rank()`	✅	✅	✅	✅	✅	✅	Ranking functions
`lead()`, `lag()`	✅	✅	✅	✅	✅	✅	Lead/lag functions
`first_value()`, `last_value()`	✅	✅	✅	✅	✅	✅	First/last value in window

Set Operations

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`union(other)`	✅	✅	✅ (`concat()`)	✅ (`concat()`)	✅	✅	Union operation
`unionAll(other)`	✅	✅	✅ (`concat()`)	✅ (`concat()`)	✅	✅	Union all (same as union)
`intersect(other)`	✅	✅	✅ (`concat()` + dedup)	✅ (`concat()` + dedup)	✅	✅	Intersection
`except(other)`	✅	✅	✅ (`concat()` + filter)	✅ (`concat()` + filter)	✅ (`difference()`)	✅ (`difference()`)	Set difference
`distinct()`	✅	✅	✅ (`drop_duplicates()`)	✅ (`drop_duplicates()`)	✅	✅	Remove duplicates
`dropDuplicates(subset)`	✅	✅	✅	✅	✅ (`unique()`)	✅ (`unique()`)	Drop duplicates with subset

Column Manipulation

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`withColumn(name, expr)`	✅	✅	✅ (`assign()`)	✅ (`assign()`)	✅ (`with_column()`)	✅ (`with_column()`)	Add/replace column
`withColumns({name: expr})`	❌	❌	✅ (`assign()`)	✅ (`assign()`)	✅ (`with_columns()`)	✅ (`with_columns()`)	Multiple columns
`drop(*cols)`	✅	✅	✅	✅	✅	✅	Drop columns
`withColumnRenamed(old, new)`	✅	✅	✅ (`rename()`)	✅ (`rename()`)	✅ (`rename()`)	✅ (`rename()`)	Rename column
`alias(name)`	✅	✅	✅	✅	✅	✅	Column alias
`assign(**kwargs)`	❌	❌	✅	✅	❌	❌	Pandas-style column assignment

Null Handling

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`fillna(value)`	✅	✅	✅	✅	✅ (`fill_null()`)	✅ (`fill_null()`)	Fill null values
`fillna({col: value})`	✅	✅	✅	✅	✅	✅	Column-specific fill
`dropna(how, subset)`	✅	✅	✅	✅	✅ (`drop_nulls()`)	✅ (`drop_nulls()`)	Drop null rows
`na.drop()`	✅	✅	❌	❌	❌	❌	Null handling property
`na.fill(value)`	✅	✅	❌	❌	❌	❌	Null handling property

String Operations

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`df["col"].str.upper()`	❌	❌	✅	✅	✅	✅	String accessor
`df["col"].str.lower()`	❌	❌	✅	✅	✅	✅	String accessor
`df["col"].str.contains(pattern)`	❌	❌	✅	✅	✅	✅	String contains
`df["col"].str.startswith(pattern)`	❌	❌	✅	✅	✅	✅	String startswith
`df["col"].str.endswith(pattern)`	❌	❌	✅	✅	✅	✅	String endswith
`df["col"].str.replace(old, new)`	❌	❌	✅	✅	✅	✅	String replace
`df["col"].str.split(delimiter)`	❌	❌	✅	✅	✅	✅	String split
`df["col"].str.len()`	❌	❌	✅	✅	✅	✅	String length
`upper(col)`, `lower(col)`	✅	✅	✅	✅	✅	✅	String functions (all)
`substring(col, pos, len)`	✅	✅	✅	✅	✅	✅	Substring function

Date/Time Operations

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`df["col"].dt.year`	❌	❌	✅	✅	✅	✅	Date accessor
`df["col"].dt.month`	❌	❌	✅	✅	✅	✅	Date accessor
`df["col"].dt.day`	❌	❌	✅	✅	✅	✅	Date accessor
`df["col"].dt.hour`	❌	❌	✅	✅	✅	✅	Date accessor
`year(col)`, `month(col)`, etc.	✅	✅	✅	✅	✅	✅	Date functions (all)
`to_date(col)`	✅	✅	✅	✅	✅	✅	Date conversion
`to_timestamp(col)`	✅	✅	✅	✅	✅	✅	Timestamp conversion
`date_add(col, days)`	✅	✅	✅	✅	✅	✅	Date arithmetic
`date_sub(col, days)`	✅	✅	✅	✅	✅	✅	Date arithmetic

File I/O

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`spark.read.csv(path)`	✅ (`db.read.csv()`)	✅ (`db.read.csv()`)	❌	❌	✅ (`db.scan_csv()`)	✅ (`db.scan_csv()`)	Read CSV
`spark.read.json(path)`	✅ (`db.read.json()`)	✅ (`db.read.json()`)	❌	❌	✅ (`db.scan_json()`)	✅ (`db.scan_json()`)	Read JSON
`spark.read.parquet(path)`	✅ (`db.read.parquet()`)	✅ (`db.read.parquet()`)	❌	❌	✅ (`db.scan_parquet()`)	✅ (`db.scan_parquet()`)	Read Parquet
`spark.read.text(path)`	✅ (`db.read.text()`)	✅ (`db.read.text()`)	❌	❌	✅ (`db.scan_text()`)	✅ (`db.scan_text()`)	Read text
`df.write.csv(path)`	✅	✅	❌	❌	✅ (`write_csv()`)	✅ (`write_csv()`)	Write CSV
`df.write.json(path)`	✅	✅	❌	❌	✅ (`write_json()`)	✅ (`write_json()`)	Write JSON
`df.write.parquet(path)`	✅	✅	❌	❌	✅ (`write_parquet()`)	✅ (`write_parquet()`)	Write Parquet
`df.write.saveAsTable(name)`	✅	✅	❌	❌	❌	❌	Save to table
`df.write.insertInto(table)`	✅	✅	❌	❌	❌	❌	Insert into table
`df.write.mode("overwrite")`	✅	✅	❌	❌	✅	✅	Write mode

Schema Operations

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`df.columns`	✅	✅	✅	✅	✅	✅	Column names
`df.schema`	✅	✅	🔄 (`dtypes`)	🔄 (`dtypes`)	✅	✅	Schema information
`df.dtypes`	✅	✅	✅	✅	❌	❌	Column types (Pandas-style)
`df.printSchema()`	✅	✅	❌	❌	❌	❌	Print schema tree
`df.schema` (Polars)	❌	❌	❌	❌	✅	✅	Polars schema format

Execution Methods

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`collect()`	✅	✅ (`await collect()`)	✅	✅ (`await collect()`)	✅	✅ (`await collect()`)	Execute and return results
`collect()` (streaming)	✅	✅	✅	✅	✅	✅	Streaming execution
`show(n, truncate)`	✅	✅ (`await show()`)	❌	❌	❌	❌	Display results
`take(n)`	✅	✅ (`await take()`)	❌	❌	✅ (`fetch()`)	✅ (`await fetch()`)	Take n rows
`first()`	✅	✅ (`await first()`)	❌	❌	❌	❌	First row
`head(n)`	✅	✅ (`await head()`)	✅	✅	✅	✅	First n rows
`tail(n)`	✅	✅ (`await tail()`)	✅	✅	✅	✅	Last n rows
`count()`	✅	✅ (`await count()`)	❌	❌	❌	❌	Row count
`fetch(n)` (Polars)	❌	❌	❌	❌	✅	✅ (`await fetch()`)	Polars-style fetch

Statistics & Descriptive

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`describe(*cols)`	✅	✅ (`await describe()`)	✅	✅ (`await describe()`)	✅ (`await describe()`)	✅ (`await describe()`)	Statistical summary
`summary(*stats)`	✅	✅ (`await summary()`)	❌	❌	❌	❌	Custom statistics
`nunique(column)`	❌	❌	✅	✅ (`await nunique()`)	✅	✅	Count unique values
`value_counts(column)`	❌	❌	✅	✅ (`await value_counts()`)	❌	❌	Value frequency (Pandas)
`info()`	❌	❌	✅	✅ (`await info()`)	❌	❌	DataFrame info (Pandas)
`empty`	❌	❌	✅	✅ (`await empty`)	❌	❌	Check if empty (Pandas)
`shape`	❌	❌	✅	✅ (`await shape`)	❌	❌	DataFrame shape (Pandas)
`width`, `height` (Polars)	❌	❌	❌	❌	✅	✅ (`await height`)	Polars dimensions

Data Reshaping

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`pivot(pivot_col, values)`	✅	✅	✅	✅	✅	✅	Pivot operation
`explode(col)`	✅	✅	✅	✅	✅	✅	Explode array/JSON
`melt(id_vars, value_vars)`	❌	❌	✅	✅	✅	✅	Unpivot (Pandas/Polars)
`unnest(cols)`	❌	❌	❌	❌	✅	✅	Unnest nested structures (Polars)
`slice(offset, length)`	❌	❌	❌	❌	✅	✅	Slice rows (Polars)

Sampling & Limiting

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`sample(fraction, seed)`	✅	✅	✅	✅	✅	✅	Random sampling
`limit(n)`	✅	✅	✅	✅	✅	✅	Limit rows

CTEs & SQL

PySpark Method	PySpark-Style (Sync)	PySpark-Style (Async)	Pandas-Style (Sync)	Pandas-Style (Async)	Polars-Style (Sync)	Polars-Style (Async)	Notes
`cte(name)`	✅	✅	✅	✅	✅	✅	Common Table Expression
`with_recursive(name)`	❌	❌	❌	❌	✅	✅	Recursive CTE (Polars)
`spark.sql(query)`	✅ (`db.sql()`)	✅ (`await db.sql()`)	❌	❌	❌	❌	Raw SQL execution
`to_sql()`	✅	✅	✅	✅	❌	❌	Get SQL string
`to_sqlalchemy()`	✅	✅	✅	✅	✅	✅	Get SQLAlchemy statement

Interface-Specific Features

Pandas-Style Unique Features

Feature	Pandas-Style (Sync)	Pandas-Style (Async)	Notes
`query(expr)`	✅	✅	String-based query syntax
`loc` indexer	✅	✅ (`await loc`)	Label-based indexing
`iloc` indexer	✅	✅ (`await iloc`)	Integer-based indexing
`append(other)`	✅	✅	Append DataFrames
`isin(values)`	✅	✅	Membership check
`between(start, end)`	✅	✅	Range check
`assign(**kwargs)`	✅	✅	Column assignment
`sort_values(by, ascending)`	✅	✅	Sorting with parameters
`value_counts(column)`	✅	✅	Value frequency
`info()`	✅	✅	DataFrame info
`melt()`	✅	✅	Unpivot operation

Polars-Style Unique Features

Feature	Polars-Style (Sync)	Polars-Style (Async)	Notes
`lazy()`	✅	✅	Mark as lazy (no-op in Moltres)
`fetch(n)`	✅	✅ (`await fetch()`)	Fetch n rows
`with_columns(*exprs)`	✅	✅	Multiple column operations
`with_columns_renamed(mapping)`	✅	✅	Rename multiple columns
`with_row_count(name)`	✅	✅	Add row number column
`with_context(df)`	✅	✅	Add context DataFrame
`with_recursive(name)`	✅	✅	Recursive CTE
`unnest(cols)`	✅	✅	Unnest nested structures
`slice(offset, length)`	✅	✅	Slice rows
`gather_every(n, offset)`	✅	✅	Sample every nth row
`interpolate(method)`	✅	✅	Interpolate missing values
`quantile(quantile)`	✅	✅	Quantile calculation
`hstack(other)`	✅	✅	Horizontal stack
`vstack(other)`	✅	✅	Vertical stack
`difference(other)`	✅	✅	Set difference
`cross_join(other)`	✅	✅	Cross join
`drop_nulls(subset)`	✅	✅	Drop nulls
`fill_null(value)`	✅	✅	Fill nulls
`unique(subset)`	✅	✅	Unique rows
`explain(format)`	✅	✅ (`await explain()`)	Query plan explanation

Async-Specific Features

Feature	PySpark-Style (Async)	Pandas-Style (Async)	Polars-Style (Async)	Notes
`await collect()`	✅	✅	✅	Async execution
`await collect(stream=True)`	✅	✅	✅	Async streaming
`await show()`	✅	❌	❌	Async display
`await take()`	✅	❌	✅ (`await fetch()`)	Async take
`await first()`	✅	❌	❌	Async first row
`await head()`	✅	✅	✅	Async head
`await tail()`	✅	✅	✅	Async tail
`await count()`	✅	❌	❌	Async count
`await describe()`	✅	✅	✅	Async describe
`await summary()`	✅	❌	❌	Async summary
`await nunique()`	❌	✅	✅	Async nunique
`await value_counts()`	❌	✅	❌	Async value_counts
`await info()`	❌	✅	❌	Async info
`await shape`	❌	✅	❌	Async shape
`await empty`	❌	✅	❌	Async empty
`await height`	❌	❌	✅	Async height
`await schema`	❌	❌	✅	Async schema
`await dtypes`	❌	✅	❌	Async dtypes
`await fetch()`	❌	❌	✅	Async fetch
`await write_csv()`	❌	❌	✅	Async write CSV
`await write_json()`	❌	❌	✅	Async write JSON
`await write_parquet()`	❌	❌	✅	Async write Parquet
`await explain()`	❌	❌	✅	Async explain
`await loc`	❌	✅	❌	Async loc indexer
`await iloc`	❌	✅	❌	Async iloc indexer

Summary

Overall Coverage

PySpark-Style (Sync): ~98% API compatibility with PySpark DataFrame API
PySpark-Style (Async): Full async support for all sync methods
Pandas-Style (Sync): Comprehensive Pandas DataFrame API with SQL pushdown
Pandas-Style (Async): Full async support for all Pandas-style methods
Polars-Style (Sync): Comprehensive Polars LazyFrame API with SQL pushdown
Polars-Style (Async): Full async support for all Polars-style methods

Key Differences

Column Access: PySpark uses df.col attribute access, while Pandas/Polars use df['col'] bracket notation
Filtering: PySpark uses where()/filter(), Pandas uses query(), Polars uses filter()
GroupBy: PySpark uses groupBy(), Pandas uses groupby(), Polars uses group_by()
Sorting: PySpark uses orderBy(), Pandas uses sort_values(), Polars uses sort()
Null Handling: PySpark uses na property, Pandas/Polars use direct methods
File I/O: PySpark uses spark.read.*, Moltres uses db.read.* or db.scan_* (Polars)
Async: All async interfaces require await for execution methods

Implementation Notes

All interfaces maintain lazy evaluation until execution
SQL pushdown execution for all operations
Type safety with proper type hints
Comprehensive error handling and validation
Support for multiple database dialects (SQLite, PostgreSQL, MySQL, DuckDB)