# PySpark to Moltres Migration: Working Around Inconsistencies This guide helps you work around known API inconsistencies when migrating from PySpark to Moltres. For a complete comparison, see [MOLTRES_VS_PYSPARK_COMPARISON.md](MOLTRES_VS_PYSPARK_COMPARISON.md). --- ## Overview Moltres achieves **100% API compatibility** with PySpark for core DataFrame operations (v0.16.0+). This guide documents the fixes that were made and any remaining considerations. --- ## High Priority Inconsistencies ### 1. `order_by()` / `orderBy()` / `sort()` - String Parameter Support ✅ **FIXED in v0.16.0** **Status:** Fixed! Moltres now accepts both strings and `Column` objects, matching PySpark behavior. #### PySpark Code ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName("MyApp").getOrCreate() df = spark.read.table("orders") # Both of these work in PySpark result1 = df.orderBy("amount") # ✅ String result2 = df.orderBy(col("amount")) # ✅ Column object result3 = df.sort("amount") # ✅ String result4 = df.sort(col("amount").desc()) # ✅ Column object ``` #### Moltres Code (v0.16.0+) ```python from moltres import connect, col db = connect("sqlite:///example.db") df = db.table("orders").select() # Both strings and Column objects work in Moltres (v0.16.0+) result1 = df.order_by("amount") # ✅ String works! result2 = df.order_by(col("amount")) # ✅ Column object works result3 = df.order_by(col("amount").desc()) # ✅ Descending works result4 = df.orderBy("amount") # ✅ PySpark-style alias with string works result5 = df.sort("amount") # ✅ PySpark-style alias with string works ``` #### Migration Pattern **Before (PySpark):** ```python df.orderBy("category", "amount") df.sort("name") ``` **After (Moltres v0.16.0+):** ```python # Works exactly the same - no changes needed! df.order_by("category", "amount") df.sort("name") # Or use PySpark-style aliases: df.orderBy("category", "amount") df.sort("name") ``` **Status:** ✅ Fixed in v0.16.0 - No workaround needed! --- ## Medium Priority Inconsistencies ### 2. Window Functions Usage Pattern ✅ **FIXED in v0.16.0** **Status:** Fixed! Moltres now supports window functions in `withColumn()`, matching PySpark behavior. #### PySpark Code ```python from pyspark.sql.window import Window from pyspark.sql.functions import row_number window = Window.partitionBy("category").orderBy("amount") df = df.withColumn("row_num", row_number().over(window)) ``` #### Moltres Code (v0.16.0+) ```python from moltres.expressions import functions as F from moltres import col # Works exactly the same - no changes needed! df = df.withColumn( "row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount")) ) ``` #### Migration Pattern **Before (PySpark):** ```python df = df.withColumn("row_num", row_number().over(window)) df = df.withColumn("rank", rank().over(window)) ``` **After (Moltres v0.16.0+):** ```python # Works exactly the same - no changes needed! df = df.withColumn( "row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount")) ) df = df.withColumn( "rank", F.rank().over(partition_by=col("category"), order_by=col("amount")) ) ``` **Status:** ✅ Fixed in v0.16.0 - No workaround needed! --- ## Low Priority Inconsistencies ### 3. `drop()` - Column Object Support ✅ **FIXED in v0.16.0** **Status:** Fixed! Moltres now accepts both strings and `Column` objects, matching PySpark behavior. #### PySpark Code ```python # Both work in PySpark df.drop("col1", "col2") # ✅ Strings df.drop(col("col1"), col("col2")) # ✅ Column objects ``` #### Moltres Code (v0.16.0+) ```python # Both work in Moltres (v0.16.0+) df.drop("col1", "col2") # ✅ Strings work df.drop(col("col1"), col("col2")) # ✅ Column objects now work! df.drop("col1", col("col2")) # ✅ Mixed usage works ``` #### Migration Pattern **Before (PySpark):** ```python df.drop(col("col1"), col("col2")) ``` **After (Moltres v0.16.0+):** ```python # Works exactly the same - no changes needed! df.drop(col("col1"), col("col2")) # Or use strings: df.drop("col1", "col2") ``` **Status:** ✅ Fixed in v0.16.0 - No workaround needed! --- ## General Migration Tips ### 1. Always Import `col` from Moltres ```python from moltres import col # Essential for column operations ``` ### 2. Use `col()` Wrapper for Column References When in doubt, wrap column names with `col()`: ```python # Safe pattern df.select(col("name"), col("amount")) df.filter(col("age") > 18) df.group_by(col("category")) # Works, but strings also work df.order_by(col("name")) # Required for order_by ``` ### 3. Check Method Signatures If a method doesn't work as expected, check the method signature: ```python # Check what types are accepted help(df.order_by) # Shows: order_by(*columns: Column) help(df.group_by) # Shows: group_by(*columns: Union[Column, str]) ``` ### 4. Use PySpark-Style Aliases Moltres provides PySpark-style camelCase aliases: ```python # Both work df.order_by(col("name")) df.orderBy(col("name")) # PySpark-style alias df.group_by("category") df.groupBy("category") # PySpark-style alias ``` --- ## Complete Migration Example ### PySpark Code ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum, avg spark = SparkSession.builder.appName("MyApp").getOrCreate() df = spark.read.table("orders") result = ( df.select("category", "amount") .filter(col("amount") > 100) .groupBy("category") .agg(sum("amount").alias("total"), avg("amount").alias("avg")) .orderBy("total") # String column name ) result.show() ``` ### Moltres Code (With Workarounds) ```python from moltres import connect, col from moltres.expressions import functions as F db = connect("sqlite:///example.db") df = db.table("orders").select() result = ( df.select("category", "amount") .where(col("amount") > 100) .group_by("category") .agg(F.sum(col("amount")).alias("total"), F.avg(col("amount")).alias("avg")) .order_by(col("total")) # Must use col() wrapper ) result.show() ``` **Key Changes:** 1. `SparkSession` → `connect()` 2. `read.table()` → `table().select()` 3. `filter()` → `where()` (or use `filter()`) 4. `groupBy()` → `group_by()` (or use `groupBy()`) 5. `orderBy("total")` → `order_by(col("total"))` ⚠️ **Must use col()** --- ## Testing Your Migration After migrating, test your code to ensure it works: ```python # Test basic operations df = db.table("orders").select() assert df.count() > 0 # Test sorting (with col() wrapper) sorted_df = df.order_by(col("amount")) results = sorted_df.collect() assert len(results) > 0 # Test grouping grouped = df.group_by("category").agg(F.sum(col("amount"))) results = grouped.collect() assert len(results) > 0 ``` --- ## Getting Help - **Full API Comparison:** See [MOLTRES_VS_PYSPARK_COMPARISON.md](MOLTRES_VS_PYSPARK_COMPARISON.md) - **Detailed Audit:** See [PYSPARK_INTERFACE_AUDIT.md](PYSPARK_INTERFACE_AUDIT.md) - **Examples:** See `docs/examples/` directory for working code samples - **Issues:** Report inconsistencies or request features on GitHub --- ## Recent Improvements (v0.16.0) The Moltres team has recently fixed all PySpark compatibility issues: 1. ✅ **String support in `order_by()`** - Fixed in v0.16.0 2. ✅ **Column object support in `drop()`** - Fixed in v0.16.0 3. ✅ **Window functions in `withColumn()`** - Fixed in v0.16.0 **Result:** Moltres now achieves **100% PySpark API compatibility** for core DataFrame operations! See [PYSPARK_INTERFACE_AUDIT.md](PYSPARK_INTERFACE_AUDIT.md) for the complete roadmap. --- *Last Updated: 2025* *Moltres Version: 0.16.0*