PySpark to Moltres Migration: Working Around Inconsistencies

This guide helps you work around known API inconsistencies when migrating from PySpark to Moltres. For a complete comparison, see MOLTRES_VS_PYSPARK_COMPARISON.md.

Overview

Moltres provides high PySpark API compatibility for core DataFrame operations, with documented exceptions. This guide documents fixes that were made and remaining considerations when porting production jobs.

Before migrating: Review union semantics (union vs unionAll), file I/O paths (db.load.* vs deprecated db.read.*), and write execution behavior in the Public API and PySpark migration guide.

High Priority Inconsistencies

1. `order_by()` / `orderBy()` / `sort()` - String Parameter Support ✅ FIXED in v0.16.0

Status: Fixed! Moltres now accepts both strings and Column objects, matching PySpark behavior.

PySpark Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.table("orders")

# Both of these work in PySpark
result1 = df.orderBy("amount")  # ✅ String
result2 = df.orderBy(col("amount"))  # ✅ Column object
result3 = df.sort("amount")  # ✅ String
result4 = df.sort(col("amount").desc())  # ✅ Column object

Moltres Code (v0.16.0+)

from moltres import connect, col

db = connect("sqlite:///example.db")
df = db.table("orders").select()

# Both strings and Column objects work in Moltres (v0.16.0+)
result1 = df.order_by("amount")  # ✅ String works!
result2 = df.order_by(col("amount"))  # ✅ Column object works
result3 = df.order_by(col("amount").desc())  # ✅ Descending works
result4 = df.orderBy("amount")  # ✅ PySpark-style alias with string works
result5 = df.sort("amount")  # ✅ PySpark-style alias with string works

Migration Pattern

Before (PySpark):

df.orderBy("category", "amount")
df.sort("name")

After (Moltres v0.16.0+):

# Works exactly the same - no changes needed!
df.order_by("category", "amount")
df.sort("name")

# Or use PySpark-style aliases:
df.orderBy("category", "amount")
df.sort("name")

Status: ✅ Fixed in v0.16.0 - No workaround needed!

Medium Priority Inconsistencies

2. Window Functions Usage Pattern ✅ FIXED in v0.16.0

Status: Fixed! Moltres now supports window functions in withColumn(), matching PySpark behavior.

PySpark Code

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy("category").orderBy("amount")
df = df.withColumn("row_num", row_number().over(window))

Moltres Code (v0.16.0+)

from moltres.expressions import functions as F
from moltres import col

# Works exactly the same - no changes needed!
df = df.withColumn(
    "row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount"))
)

Migration Pattern

Before (PySpark):

df = df.withColumn("row_num", row_number().over(window))
df = df.withColumn("rank", rank().over(window))

After (Moltres v0.16.0+):

# Works exactly the same - no changes needed!
df = df.withColumn(
    "row_num", F.row_number().over(partition_by=col("category"), order_by=col("amount"))
)
df = df.withColumn(
    "rank", F.rank().over(partition_by=col("category"), order_by=col("amount"))
)

Status: ✅ Fixed in v0.16.0 - No workaround needed!

Low Priority Inconsistencies

3. `drop()` - Column Object Support ✅ FIXED in v0.16.0

Status: Fixed! Moltres now accepts both strings and Column objects, matching PySpark behavior.

PySpark Code

# Both work in PySpark
df.drop("col1", "col2")  # ✅ Strings
df.drop(col("col1"), col("col2"))  # ✅ Column objects

Moltres Code (v0.16.0+)

# Both work in Moltres (v0.16.0+)
df.drop("col1", "col2")  # ✅ Strings work
df.drop(col("col1"), col("col2"))  # ✅ Column objects now work!
df.drop("col1", col("col2"))  # ✅ Mixed usage works

Migration Pattern

Before (PySpark):

df.drop(col("col1"), col("col2"))

After (Moltres v0.16.0+):

# Works exactly the same - no changes needed!
df.drop(col("col1"), col("col2"))
# Or use strings:
df.drop("col1", "col2")

Status: ✅ Fixed in v0.16.0 - No workaround needed!

General Migration Tips

1. Always Import `col` from Moltres

from moltres import col  # Essential for column operations

2. Column References for Sorting and Grouping

Strings work for order_by(), sort(), and group_by() (v0.16.0+). Use col() when you need expressions (e.g. descending sort or computed columns):

# Safe patterns
df.select(col("name"), col("amount"))
df.filter(col("age") > 18)
df.group_by("category")  # strings work
df.order_by("name")  # strings work
df.order_by(col("amount").desc())  # expressions need col()

3. Check Method Signatures

If a method doesn’t work as expected, check the method signature:

# Check what types are accepted
help(df.order_by)  # Accepts Column expressions and column name strings
help(df.group_by)  # Accepts Column expressions and column name strings

4. Use PySpark-Style Aliases

Moltres provides PySpark-style camelCase aliases:

# Both work
df.order_by(col("name"))
df.orderBy(col("name"))  # PySpark-style alias

df.group_by("category")
df.groupBy("category")  # PySpark-style alias

Complete Migration Example

PySpark Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.table("orders")

result = (
    df.select("category", "amount")
    .filter(col("amount") > 100)
    .groupBy("category")
    .agg(sum("amount").alias("total"), avg("amount").alias("avg"))
    .orderBy("total")  # String column name
)
result.show()

Moltres Code

from moltres import connect, col
from moltres.expressions import functions as F

db = connect("sqlite:///example.db")
df = db.table("orders").select()

result = (
    df.select("category", "amount")
    .where(col("amount") > 100)
    .group_by("category")
    .agg(F.sum(col("amount")).alias("total"), F.avg(col("amount")).alias("avg"))
    .order_by("total")  # strings work on output column names
)
result.show()

Key Changes:

SparkSession → connect()
read.table() → table().select()
filter() → where() (or use filter())
groupBy() → group_by() (or use groupBy())
orderBy("total") → order_by("total") (strings supported)

Testing Your Migration

After migrating, test your code to ensure it works:

# Test basic operations
df = db.table("orders").select()
assert df.count() > 0

# Test sorting (strings supported)
sorted_df = df.order_by("amount")
results = sorted_df.collect()
assert len(results) > 0

# Test grouping
grouped = df.group_by("category").agg(F.sum(col("amount")))
results = grouped.collect()
assert len(results) > 0

Getting Help

Full API Comparison: See MOLTRES_VS_PYSPARK_COMPARISON.md
Detailed Audit: See the Moltres vs PySpark comparison
Examples: See docs/examples/ directory for working code samples
Issues: Report inconsistencies or request features on GitHub

Union Semantics (Important Difference from PySpark)

PySpark union() returns all rows (like SQL UNION ALL). Moltres uses SQL semantics:

Moltres	PySpark equivalent	Behavior
`df.union(other)`	`union().distinct()`	DISTINCT rows only
`df.unionAll(other)`	`union()`	ALL rows

When migrating PySpark df1.union(df2), use df1.unionAll(df2) in Moltres.

Recent Improvements (v0.16.0+)

The Moltres team has recently fixed all PySpark compatibility issues:

✅ String support in order_by() - Fixed in v0.16.0
✅ Column object support in drop() - Fixed in v0.16.0
✅ Window functions in withColumn() - Fixed in v0.16.0

Result: Moltres provides high PySpark API compatibility for core DataFrame operations, with documented exceptions listed above and in the migration guide.

See the Moltres vs PySpark comparison for the complete roadmap.

Last Updated: 2026-07-07 Moltres Version: 1.1.0

PySpark to Moltres Migration: Working Around Inconsistencies

Overview

High Priority Inconsistencies

1. order_by() / orderBy() / sort() - String Parameter Support ✅ FIXED in v0.16.0

PySpark Code

Moltres Code (v0.16.0+)

Migration Pattern

Medium Priority Inconsistencies

2. Window Functions Usage Pattern ✅ FIXED in v0.16.0

PySpark Code

Moltres Code (v0.16.0+)

Migration Pattern

Low Priority Inconsistencies

3. drop() - Column Object Support ✅ FIXED in v0.16.0

PySpark Code

Moltres Code (v0.16.0+)

Migration Pattern

General Migration Tips

1. Always Import col from Moltres

2. Column References for Sorting and Grouping

3. Check Method Signatures

4. Use PySpark-Style Aliases

Complete Migration Example

PySpark Code

Moltres Code

Testing Your Migration

Getting Help

Union Semantics (Important Difference from PySpark)

Recent Improvements (v0.16.0+)

1. `order_by()` / `orderBy()` / `sort()` - String Parameter Support ✅ FIXED in v0.16.0

3. `drop()` - Column Object Support ✅ FIXED in v0.16.0

1. Always Import `col` from Moltres