# PySpark Feature Comparison This document provides a comprehensive comparison of PySpark DataFrame API features against all 6 Moltres interfaces: 1. **PySpark-Style (Sync)** - `DataFrame` - Primary PySpark-compatible API 2. **PySpark-Style (Async)** - `AsyncDataFrame` - Async version of PySpark-style API 3. **Pandas-Style (Sync)** - `PandasDataFrame` - Pandas-compatible API 4. **Pandas-Style (Async)** - `AsyncPandasDataFrame` - Async version of Pandas-style API 5. **Polars-Style (Sync)** - `PolarsDataFrame` - Polars LazyFrame-compatible API 6. **Polars-Style (Async)** - `AsyncPolarsDataFrame` - Async version of Polars-style API ## Status Indicators - ✅ **Supported** - Fully implemented with full feature parity - ⚠️ **Partial** - Partially implemented or with limitations - ❌ **Not Implemented** - Not available in this interface - 🔄 **Different API** - Available but with different method name/API signature - 📝 **Notes** - Additional implementation details, differences, or limitations ## Selection & Projection | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `select(*cols)` | ✅ | ✅ | ✅ (`select()`) | ✅ (`select()`) | ✅ | ✅ | All interfaces support column selection | | `selectExpr(*exprs)` | ✅ | ✅ | ✅ (`select_expr()`) | ✅ (`select_expr()`) | ✅ (`select_expr()`) | ✅ (`select_expr()`) | SQL expression selection | | Column access `df.col` | ✅ (`__getattr__`) | ✅ (`__getattr__`) | 🔄 (`df['col']`) | 🔄 (`df['col']`) | 🔄 (`df['col']`) | 🔄 (`df['col']`) | PySpark-style supports dot notation; Pandas/Polars use bracket notation `df['col']` instead | | `df["col"]` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Pandas/Polars-style column access (bracket notation) | | `df[["col1", "col2"]]` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Multi-column selection (Pandas/Polars) | | `df[df["col"] > 5]` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Boolean indexing (Pandas/Polars) | ## Filtering | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `where(condition)` | ✅ | ✅ | 🔄 (`query()`) | 🔄 (`query()`) | ✅ (`filter()`) | ✅ (`filter()`) | All support filtering | | `filter(condition)` | ✅ | ✅ | 🔄 (`query()`) | 🔄 (`query()`) | ✅ | ✅ | Alias for `where()` in PySpark-style | | `query(expr)` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | Pandas-style string query syntax | | `isin(values)` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | Pandas-style membership check | | `between(start, end)` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | Pandas-style range check | ## Joins | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `join(other, on, how)` | ✅ | ✅ | ✅ (`merge()`) | ✅ (`merge()`) | ✅ | ✅ | All support joins | | `crossJoin(other)` | ✅ | ✅ | ❌ | ❌ | ✅ (`cross_join()`) | ✅ (`cross_join()`) | Cross join support | | `semi_join(other, on)` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Semi-join (filter rows with matches) | | `anti_join(other, on)` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Anti-join (filter rows without matches) | | Join types: `inner`, `left`, `right`, `outer` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | All join types supported | ## GroupBy & Aggregations | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `groupBy(*cols)` | ✅ | ✅ | ✅ (`groupby()`) | ✅ (`groupby()`) | ✅ (`group_by()`) | ✅ (`group_by()`) | All support grouping | | `agg(*exprs)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Aggregation expressions | | `agg({"col": "func"})` | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | Dictionary syntax (PySpark/Pandas) | | `sum()`, `mean()`, `min()`, `max()`, `count()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Basic aggregations | | `first()`, `last()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | First/last value per group | | `nunique()` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Count distinct (Pandas/Polars) | | `std()`, `var()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Statistical aggregations | | `pivot(pivot_col)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Pivot operation | ## Sorting | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `orderBy(*cols)` | ✅ | ✅ | ✅ (`sort_values()`) | ✅ (`sort_values()`) | ✅ (`sort()`) | ✅ (`sort()`) | All support sorting | | `sort(*cols)` | ✅ | ✅ | ✅ (`sort_values()`) | ✅ (`sort_values()`) | ✅ | ✅ | Alias for `orderBy()` | | `sort_values(by, ascending)` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | Pandas-style sorting | | `sort(by, descending)` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | Polars-style sorting | ## Window Functions | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `over(windowSpec)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Window function support | | `row_number()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Row number window function | | `rank()`, `dense_rank()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Ranking functions | | `lead()`, `lag()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Lead/lag functions | | `first_value()`, `last_value()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | First/last value in window | ## Set Operations | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `union(other)` | ✅ | ✅ | ✅ (`concat()`) | ✅ (`concat()`) | ✅ | ✅ | Union operation | | `unionAll(other)` | ✅ | ✅ | ✅ (`concat()`) | ✅ (`concat()`) | ✅ | ✅ | Union all (same as union) | | `intersect(other)` | ✅ | ✅ | ✅ (`concat()` + dedup) | ✅ (`concat()` + dedup) | ✅ | ✅ | Intersection | | `except(other)` | ✅ | ✅ | ✅ (`concat()` + filter) | ✅ (`concat()` + filter) | ✅ (`difference()`) | ✅ (`difference()`) | Set difference | | `distinct()` | ✅ | ✅ | ✅ (`drop_duplicates()`) | ✅ (`drop_duplicates()`) | ✅ | ✅ | Remove duplicates | | `dropDuplicates(subset)` | ✅ | ✅ | ✅ | ✅ | ✅ (`unique()`) | ✅ (`unique()`) | Drop duplicates with subset | ## Column Manipulation | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `withColumn(name, expr)` | ✅ | ✅ | ✅ (`assign()`) | ✅ (`assign()`) | ✅ (`with_column()`) | ✅ (`with_column()`) | Add/replace column | | `withColumns({name: expr})` | ❌ | ❌ | ✅ (`assign()`) | ✅ (`assign()`) | ✅ (`with_columns()`) | ✅ (`with_columns()`) | Multiple columns | | `drop(*cols)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Drop columns | | `withColumnRenamed(old, new)` | ✅ | ✅ | ✅ (`rename()`) | ✅ (`rename()`) | ✅ (`rename()`) | ✅ (`rename()`) | Rename column | | `alias(name)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Column alias | | `assign(**kwargs)` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | Pandas-style column assignment | ## Null Handling | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `fillna(value)` | ✅ | ✅ | ✅ | ✅ | ✅ (`fill_null()`) | ✅ (`fill_null()`) | Fill null values | | `fillna({col: value})` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Column-specific fill | | `dropna(how, subset)` | ✅ | ✅ | ✅ | ✅ | ✅ (`drop_nulls()`) | ✅ (`drop_nulls()`) | Drop null rows | | `na.drop()` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Null handling property | | `na.fill(value)` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Null handling property | ## String Operations | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `df["col"].str.upper()` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String accessor | | `df["col"].str.lower()` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String accessor | | `df["col"].str.contains(pattern)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String contains | | `df["col"].str.startswith(pattern)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String startswith | | `df["col"].str.endswith(pattern)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String endswith | | `df["col"].str.replace(old, new)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String replace | | `df["col"].str.split(delimiter)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String split | | `df["col"].str.len()` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | String length | | `upper(col)`, `lower(col)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | String functions (all) | | `substring(col, pos, len)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Substring function | ## Date/Time Operations | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `df["col"].dt.year` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Date accessor | | `df["col"].dt.month` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Date accessor | | `df["col"].dt.day` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Date accessor | | `df["col"].dt.hour` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Date accessor | | `year(col)`, `month(col)`, etc. | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Date functions (all) | | `to_date(col)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Date conversion | | `to_timestamp(col)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Timestamp conversion | | `date_add(col, days)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Date arithmetic | | `date_sub(col, days)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Date arithmetic | ## File I/O | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `spark.read.csv(path)` | ✅ (`db.read.csv()`) | ✅ (`db.read.csv()`) | ❌ | ❌ | ✅ (`db.scan_csv()`) | ✅ (`db.scan_csv()`) | Read CSV | | `spark.read.json(path)` | ✅ (`db.read.json()`) | ✅ (`db.read.json()`) | ❌ | ❌ | ✅ (`db.scan_json()`) | ✅ (`db.scan_json()`) | Read JSON | | `spark.read.parquet(path)` | ✅ (`db.read.parquet()`) | ✅ (`db.read.parquet()`) | ❌ | ❌ | ✅ (`db.scan_parquet()`) | ✅ (`db.scan_parquet()`) | Read Parquet | | `spark.read.text(path)` | ✅ (`db.read.text()`) | ✅ (`db.read.text()`) | ❌ | ❌ | ✅ (`db.scan_text()`) | ✅ (`db.scan_text()`) | Read text | | `df.write.csv(path)` | ✅ | ✅ | ❌ | ❌ | ✅ (`write_csv()`) | ✅ (`write_csv()`) | Write CSV | | `df.write.json(path)` | ✅ | ✅ | ❌ | ❌ | ✅ (`write_json()`) | ✅ (`write_json()`) | Write JSON | | `df.write.parquet(path)` | ✅ | ✅ | ❌ | ❌ | ✅ (`write_parquet()`) | ✅ (`write_parquet()`) | Write Parquet | | `df.write.saveAsTable(name)` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Save to table | | `df.write.insertInto(table)` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Insert into table | | `df.write.mode("overwrite")` | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | Write mode | ## Schema Operations | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `df.columns` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Column names | | `df.schema` | ✅ | ✅ | 🔄 (`dtypes`) | 🔄 (`dtypes`) | ✅ | ✅ | Schema information | | `df.dtypes` | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | Column types (Pandas-style) | | `df.printSchema()` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | Print schema tree | | `df.schema` (Polars) | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | Polars schema format | ## Execution Methods | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `collect()` | ✅ | ✅ (`await collect()`) | ✅ | ✅ (`await collect()`) | ✅ | ✅ (`await collect()`) | Execute and return results | | `collect()` (streaming) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Streaming execution | | `show(n, truncate)` | ✅ | ✅ (`await show()`) | ❌ | ❌ | ❌ | ❌ | Display results | | `take(n)` | ✅ | ✅ (`await take()`) | ❌ | ❌ | ✅ (`fetch()`) | ✅ (`await fetch()`) | Take n rows | | `first()` | ✅ | ✅ (`await first()`) | ❌ | ❌ | ❌ | ❌ | First row | | `head(n)` | ✅ | ✅ (`await head()`) | ✅ | ✅ | ✅ | ✅ | First n rows | | `tail(n)` | ✅ | ✅ (`await tail()`) | ✅ | ✅ | ✅ | ✅ | Last n rows | | `count()` | ✅ | ✅ (`await count()`) | ❌ | ❌ | ❌ | ❌ | Row count | | `fetch(n)` (Polars) | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ (`await fetch()`) | Polars-style fetch | ## Statistics & Descriptive | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `describe(*cols)` | ✅ | ✅ (`await describe()`) | ✅ | ✅ (`await describe()`) | ✅ (`await describe()`) | ✅ (`await describe()`) | Statistical summary | | `summary(*stats)` | ✅ | ✅ (`await summary()`) | ❌ | ❌ | ❌ | ❌ | Custom statistics | | `nunique(column)` | ❌ | ❌ | ✅ | ✅ (`await nunique()`) | ✅ | ✅ | Count unique values | | `value_counts(column)` | ❌ | ❌ | ✅ | ✅ (`await value_counts()`) | ❌ | ❌ | Value frequency (Pandas) | | `info()` | ❌ | ❌ | ✅ | ✅ (`await info()`) | ❌ | ❌ | DataFrame info (Pandas) | | `empty` | ❌ | ❌ | ✅ | ✅ (`await empty`) | ❌ | ❌ | Check if empty (Pandas) | | `shape` | ❌ | ❌ | ✅ | ✅ (`await shape`) | ❌ | ❌ | DataFrame shape (Pandas) | | `width`, `height` (Polars) | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ (`await height`) | Polars dimensions | ## Data Reshaping | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `pivot(pivot_col, values)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Pivot operation | | `explode(col)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Explode array/JSON | | `melt(id_vars, value_vars)` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | Unpivot (Pandas/Polars) | | `unnest(cols)` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | Unnest nested structures (Polars) | | `slice(offset, length)` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | Slice rows (Polars) | ## Sampling & Limiting | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `sample(fraction, seed)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Random sampling | | `limit(n)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Limit rows | ## CTEs & SQL | PySpark Method | PySpark-Style (Sync) | PySpark-Style (Async) | Pandas-Style (Sync) | Pandas-Style (Async) | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------------|---------------------|----------------------|-------------------|---------------------|-------------------|---------------------|-------| | `cte(name)` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Common Table Expression | | `with_recursive(name)` | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | Recursive CTE (Polars) | | `spark.sql(query)` | ✅ (`db.sql()`) | ✅ (`await db.sql()`) | ❌ | ❌ | ❌ | ❌ | Raw SQL execution | | `to_sql()` | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | Get SQL string | | `to_sqlalchemy()` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Get SQLAlchemy statement | ## Interface-Specific Features ### Pandas-Style Unique Features | Feature | Pandas-Style (Sync) | Pandas-Style (Async) | Notes | |---------|-------------------|---------------------|-------| | `query(expr)` | ✅ | ✅ | String-based query syntax | | `loc` indexer | ✅ | ✅ (`await loc`) | Label-based indexing | | `iloc` indexer | ✅ | ✅ (`await iloc`) | Integer-based indexing | | `append(other)` | ✅ | ✅ | Append DataFrames | | `isin(values)` | ✅ | ✅ | Membership check | | `between(start, end)` | ✅ | ✅ | Range check | | `assign(**kwargs)` | ✅ | ✅ | Column assignment | | `sort_values(by, ascending)` | ✅ | ✅ | Sorting with parameters | | `value_counts(column)` | ✅ | ✅ | Value frequency | | `info()` | ✅ | ✅ | DataFrame info | | `melt()` | ✅ | ✅ | Unpivot operation | ### Polars-Style Unique Features | Feature | Polars-Style (Sync) | Polars-Style (Async) | Notes | |---------|-------------------|---------------------|-------| | `lazy()` | ✅ | ✅ | Mark as lazy (no-op in Moltres) | | `fetch(n)` | ✅ | ✅ (`await fetch()`) | Fetch n rows | | `with_columns(*exprs)` | ✅ | ✅ | Multiple column operations | | `with_columns_renamed(mapping)` | ✅ | ✅ | Rename multiple columns | | `with_row_count(name)` | ✅ | ✅ | Add row number column | | `with_context(df)` | ✅ | ✅ | Add context DataFrame | | `with_recursive(name)` | ✅ | ✅ | Recursive CTE | | `unnest(cols)` | ✅ | ✅ | Unnest nested structures | | `slice(offset, length)` | ✅ | ✅ | Slice rows | | `gather_every(n, offset)` | ✅ | ✅ | Sample every nth row | | `interpolate(method)` | ✅ | ✅ | Interpolate missing values | | `quantile(quantile)` | ✅ | ✅ | Quantile calculation | | `hstack(other)` | ✅ | ✅ | Horizontal stack | | `vstack(other)` | ✅ | ✅ | Vertical stack | | `difference(other)` | ✅ | ✅ | Set difference | | `cross_join(other)` | ✅ | ✅ | Cross join | | `drop_nulls(subset)` | ✅ | ✅ | Drop nulls | | `fill_null(value)` | ✅ | ✅ | Fill nulls | | `unique(subset)` | ✅ | ✅ | Unique rows | | `explain(format)` | ✅ | ✅ (`await explain()`) | Query plan explanation | ### Async-Specific Features | Feature | PySpark-Style (Async) | Pandas-Style (Async) | Polars-Style (Async) | Notes | |---------|---------------------|---------------------|---------------------|-------| | `await collect()` | ✅ | ✅ | ✅ | Async execution | | `await collect(stream=True)` | ✅ | ✅ | ✅ | Async streaming | | `await show()` | ✅ | ❌ | ❌ | Async display | | `await take()` | ✅ | ❌ | ✅ (`await fetch()`) | Async take | | `await first()` | ✅ | ❌ | ❌ | Async first row | | `await head()` | ✅ | ✅ | ✅ | Async head | | `await tail()` | ✅ | ✅ | ✅ | Async tail | | `await count()` | ✅ | ❌ | ❌ | Async count | | `await describe()` | ✅ | ✅ | ✅ | Async describe | | `await summary()` | ✅ | ❌ | ❌ | Async summary | | `await nunique()` | ❌ | ✅ | ✅ | Async nunique | | `await value_counts()` | ❌ | ✅ | ❌ | Async value_counts | | `await info()` | ❌ | ✅ | ❌ | Async info | | `await shape` | ❌ | ✅ | ❌ | Async shape | | `await empty` | ❌ | ✅ | ❌ | Async empty | | `await height` | ❌ | ❌ | ✅ | Async height | | `await schema` | ❌ | ❌ | ✅ | Async schema | | `await dtypes` | ❌ | ✅ | ❌ | Async dtypes | | `await fetch()` | ❌ | ❌ | ✅ | Async fetch | | `await write_csv()` | ❌ | ❌ | ✅ | Async write CSV | | `await write_json()` | ❌ | ❌ | ✅ | Async write JSON | | `await write_parquet()` | ❌ | ❌ | ✅ | Async write Parquet | | `await explain()` | ❌ | ❌ | ✅ | Async explain | | `await loc` | ❌ | ✅ | ❌ | Async loc indexer | | `await iloc` | ❌ | ✅ | ❌ | Async iloc indexer | ## Summary ### Overall Coverage - **PySpark-Style (Sync)**: ~98% API compatibility with PySpark DataFrame API - **PySpark-Style (Async)**: Full async support for all sync methods - **Pandas-Style (Sync)**: Comprehensive Pandas DataFrame API with SQL pushdown - **Pandas-Style (Async)**: Full async support for all Pandas-style methods - **Polars-Style (Sync)**: Comprehensive Polars LazyFrame API with SQL pushdown - **Polars-Style (Async)**: Full async support for all Polars-style methods ### Key Differences 1. **Column Access**: PySpark uses `df.col` attribute access, while Pandas/Polars use `df['col']` bracket notation 2. **Filtering**: PySpark uses `where()`/`filter()`, Pandas uses `query()`, Polars uses `filter()` 3. **GroupBy**: PySpark uses `groupBy()`, Pandas uses `groupby()`, Polars uses `group_by()` 4. **Sorting**: PySpark uses `orderBy()`, Pandas uses `sort_values()`, Polars uses `sort()` 5. **Null Handling**: PySpark uses `na` property, Pandas/Polars use direct methods 6. **File I/O**: PySpark uses `spark.read.*`, Moltres uses `db.read.*` or `db.scan_*` (Polars) 7. **Async**: All async interfaces require `await` for execution methods ### Implementation Notes - All interfaces maintain lazy evaluation until execution - SQL pushdown execution for all operations - Type safety with proper type hints - Comprehensive error handling and validation - Support for multiple database dialects (SQLite, PostgreSQL, MySQL, DuckDB)