PySpark to Moltres Migration: Working Around Inconsistencies
This guide helps you work around known API inconsistencies when migrating from PySpark to Moltres. For a complete comparison, see MOLTRES_VS_PYSPARK_COMPARISON.md.
Overview
Moltres achieves 100% API compatibility with PySpark for core DataFrame operations (v0.16.0+). This guide documents the fixes that were made and any remaining considerations.
High Priority Inconsistencies
1. order_by() / orderBy() / sort() - String Parameter Support ✅ FIXED in v0.16.0
Status: Fixed! Moltres now accepts both strings and Column objects, matching PySpark behavior.
PySpark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.table("orders")
# Both of these work in PySpark
result1 = df.orderBy("amount") # ✅ String
result2 = df.orderBy(col("amount")) # ✅ Column object
result3 = df.sort("amount") # ✅ String
result4 = df.sort(col("amount").desc()) # ✅ Column object
Moltres Code (v0.16.0+)
from moltres import connect, col
db = connect("sqlite:///example.db")
df = db.table("orders").select()
# Both strings and Column objects work in Moltres (v0.16.0+)
result1 = df.order_by("amount") # ✅ String works!
result2 = df.order_by(col("amount")) # ✅ Column object works
result3 = df.order_by(col("amount").desc()) # ✅ Descending works
result4 = df.orderBy("amount") # ✅ PySpark-style alias with string works
result5 = df.sort("amount") # ✅ PySpark-style alias with string works
Migration Pattern
Before (PySpark):
df.orderBy("category", "amount")
df.sort("name")
After (Moltres v0.16.0+):
# Works exactly the same - no changes needed!
df.order_by("category", "amount")
df.sort("name")
# Or use PySpark-style aliases:
df.orderBy("category", "amount")
df.sort("name")
Status: ✅ Fixed in v0.16.0 - No workaround needed!
Medium Priority Inconsistencies
2. Window Functions Usage Pattern ✅ FIXED in v0.16.0
Status: Fixed! Moltres now supports window functions in withColumn(), matching PySpark behavior.
PySpark Code
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("category").orderBy("amount")
df = df.withColumn("row_num", row_number().over(window))
Moltres Code (v0.16.0+)
from moltres.expressions import functions as F
from moltres import col
# Works exactly the same - no changes needed!
df = df.withColumn(
"row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount"))
)
Migration Pattern
Before (PySpark):
df = df.withColumn("row_num", row_number().over(window))
df = df.withColumn("rank", rank().over(window))
After (Moltres v0.16.0+):
# Works exactly the same - no changes needed!
df = df.withColumn(
"row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount"))
)
df = df.withColumn(
"rank", F.rank().over(partition_by=col("category"), order_by=col("amount"))
)
Status: ✅ Fixed in v0.16.0 - No workaround needed!
Low Priority Inconsistencies
3. drop() - Column Object Support ✅ FIXED in v0.16.0
Status: Fixed! Moltres now accepts both strings and Column objects, matching PySpark behavior.
PySpark Code
# Both work in PySpark
df.drop("col1", "col2") # ✅ Strings
df.drop(col("col1"), col("col2")) # ✅ Column objects
Moltres Code (v0.16.0+)
# Both work in Moltres (v0.16.0+)
df.drop("col1", "col2") # ✅ Strings work
df.drop(col("col1"), col("col2")) # ✅ Column objects now work!
df.drop("col1", col("col2")) # ✅ Mixed usage works
Migration Pattern
Before (PySpark):
df.drop(col("col1"), col("col2"))
After (Moltres v0.16.0+):
# Works exactly the same - no changes needed!
df.drop(col("col1"), col("col2"))
# Or use strings:
df.drop("col1", "col2")
Status: ✅ Fixed in v0.16.0 - No workaround needed!
General Migration Tips
1. Always Import col from Moltres
from moltres import col # Essential for column operations
2. Use col() Wrapper for Column References
When in doubt, wrap column names with col():
# Safe pattern
df.select(col("name"), col("amount"))
df.filter(col("age") > 18)
df.group_by(col("category")) # Works, but strings also work
df.order_by(col("name")) # Required for order_by
3. Check Method Signatures
If a method doesn’t work as expected, check the method signature:
# Check what types are accepted
help(df.order_by) # Shows: order_by(*columns: Column)
help(df.group_by) # Shows: group_by(*columns: Union[Column, str])
4. Use PySpark-Style Aliases
Moltres provides PySpark-style camelCase aliases:
# Both work
df.order_by(col("name"))
df.orderBy(col("name")) # PySpark-style alias
df.group_by("category")
df.groupBy("category") # PySpark-style alias
Complete Migration Example
PySpark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg
spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.table("orders")
result = (
df.select("category", "amount")
.filter(col("amount") > 100)
.groupBy("category")
.agg(sum("amount").alias("total"), avg("amount").alias("avg"))
.orderBy("total") # String column name
)
result.show()
Moltres Code (With Workarounds)
from moltres import connect, col
from moltres.expressions import functions as F
db = connect("sqlite:///example.db")
df = db.table("orders").select()
result = (
df.select("category", "amount")
.where(col("amount") > 100)
.group_by("category")
.agg(F.sum(col("amount")).alias("total"), F.avg(col("amount")).alias("avg"))
.order_by(col("total")) # Must use col() wrapper
)
result.show()
Key Changes:
SparkSession→connect()read.table()→table().select()filter()→where()(or usefilter())groupBy()→group_by()(or usegroupBy())orderBy("total")→order_by(col("total"))⚠️ Must use col()
Testing Your Migration
After migrating, test your code to ensure it works:
# Test basic operations
df = db.table("orders").select()
assert df.count() > 0
# Test sorting (with col() wrapper)
sorted_df = df.order_by(col("amount"))
results = sorted_df.collect()
assert len(results) > 0
# Test grouping
grouped = df.group_by("category").agg(F.sum(col("amount")))
results = grouped.collect()
assert len(results) > 0
Getting Help
Full API Comparison: See MOLTRES_VS_PYSPARK_COMPARISON.md
Detailed Audit: See PYSPARK_INTERFACE_AUDIT.md
Examples: See
docs/examples/directory for working code samplesIssues: Report inconsistencies or request features on GitHub
Recent Improvements (v0.16.0)
The Moltres team has recently fixed all PySpark compatibility issues:
✅ String support in
order_by()- Fixed in v0.16.0✅ Column object support in
drop()- Fixed in v0.16.0✅ Window functions in
withColumn()- Fixed in v0.16.0
Result: Moltres now achieves 100% PySpark API compatibility for core DataFrame operations!
See PYSPARK_INTERFACE_AUDIT.md for the complete roadmap.
Last Updated: 2025 Moltres Version: 0.16.0