Frequently Asked Questions (FAQ)

Common questions about Moltres and their answers.

General Questions

What is Moltres?

Moltres is a Python library that provides a PySpark-like DataFrame API for SQL databases. It compiles DataFrame operations to SQL and executes them in your database, providing SQL pushdown without requiring a Spark cluster.

How is Moltres different from PySpark?

  • No cluster required: Moltres works with any SQL database, no Spark cluster needed

  • SQL pushdown: Operations are compiled to SQL and executed in the database

  • Simpler deployment: Just connect to your existing database

  • Lower overhead: No cluster management or resource allocation

How is Moltres different from Pandas?

  • SQL execution: Moltres executes queries in the database, not in Python memory

  • Lazy evaluation: Queries are optimized before execution

  • Scalability: Can handle larger datasets (limited by database, not Python memory)

  • SQL pushdown: Operations compile to SQL for better performance

Which databases are supported?

Moltres supports any database with a SQLAlchemy driver, including:

  • SQLite

  • PostgreSQL

  • MySQL

  • SQL Server

  • Oracle

  • And many others

Usage Questions

How do I connect to a database?

from moltres import connect

# SQLite
db = connect("sqlite:///example.db")

# PostgreSQL
db = connect("postgresql://user:pass@host/dbname")

# MySQL
db = connect("mysql://user:pass@host/dbname")

How do I read data from a table?

# Select all columns
df = db.table("users").select()

# Select specific columns
df = db.table("users").select("id", "name", "email")

# With filtering
from moltres import col
df = db.table("users").select().where(col("active") == True)

How do I read data from files?

# CSV
df = db.load.csv("data.csv")

# JSON
df = db.load.json("data.json")

# Parquet
df = db.load.parquet("data.parquet")

# With options
df = db.load.csv("data.csv", header=True, delimiter=",")

How do I filter rows?

from moltres import col

# Single condition
df = db.table("users").select().where(col("age") > 18)

# Multiple conditions
df = db.table("users").select().where(
    (col("age") > 18) & (col("active") == True)
)

How do I join tables?

customers = db.table("customers").select()
orders = db.table("orders").select()

# Inner join
from moltres import col
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner")

# Left join
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="left")

How do I aggregate data?

from moltres import col
from moltres.expressions.functions import sum, avg, count

df = (
    db.table("orders")
    .select()
    .group_by("category")
    .agg(
        sum(col("amount")).alias("total"),
        avg(col("amount")).alias("average"),
        count("*").alias("count")
    )
)

How do I execute a query and get results?

# Collect all results
results = df.collect()  # Returns list of dicts

# Stream results (for large datasets)
for chunk in df.collect(stream=True):
    process_chunk(chunk)

# Show first few rows
df.show(10)

How do I see the generated SQL?

sql = df.to_sql()
print(sql)

Performance Questions

Why is my query slow?

Common causes:

  1. Missing indexes - Create indexes on filtered/joined columns

  2. Full table scans - Add WHERE clauses to filter early

  3. Large result sets - Use LIMIT or streaming

  4. Inefficient joins - Ensure join columns are indexed

See Performance Guide for details.

How do I optimize queries?

  1. Create indexes on frequently queried columns

  2. Filter early in the query chain

  3. Select only needed columns

  4. Use LIMIT for exploration

  5. Configure connection pooling

See Performance Guide for more tips.

Does Moltres cache results?

No, Moltres doesn’t cache results. Each query executes fresh in the database. For caching, use:

  • Database query cache (if supported)

  • Application-level caching (Redis, Memcached)

  • Materialized views in the database

Error Questions

What does “CompilationError” mean?

A CompilationError means Moltres couldn’t compile your DataFrame operation to SQL. Common causes:

  • Unsupported operation for your SQL dialect

  • Invalid column references

  • Missing join conditions

Check the error message and suggestion for details.

What does “ExecutionError” mean?

An ExecutionError means the SQL query failed to execute. Common causes:

  • Table/column doesn’t exist

  • SQL syntax error

  • Data type mismatch

  • Constraint violation

Check the error message and suggestion for details.

How do I debug query errors?

  1. Use df.to_sql() to see the generated SQL

  2. Check the error message and suggestion

  3. Verify table/column names exist

  4. Test the SQL directly in your database

  5. See Debugging Guide for more help

Configuration Questions

How do I configure connection pooling?

db = connect(
    "postgresql://user:pass@host/dbname",
    pool_size=10,        # Number of connections
    max_overflow=5,      # Additional connections
    pool_timeout=30,     # Timeout in seconds
    pool_pre_ping=True,  # Verify connections
)

How do I set query timeout?

# Via configuration
db = connect(
    "postgresql://user:pass@host/dbname",
    query_timeout=30.0  # 30 seconds
)

# Via environment variable
# MOLTRES_QUERY_TIMEOUT=30.0

How do I enable SQL logging?

db = connect(
    "postgresql://user:pass@host/dbname",
    echo=True  # Enable SQL logging
)

Feature Questions

Does Moltres support UDFs (User-Defined Functions)?

No, Moltres doesn’t support Python UDFs since operations are pushed down to SQL. Instead:

  • Use SQL functions available in your database

  • Use Moltres expression functions (see moltres.expressions.functions)

  • For complex logic, use raw SQL expressions

Does Moltres support window functions?

Yes! Moltres supports window functions:

from moltres.expressions.window import Window
from moltres.expressions.functions import row_number, rank

window = Window.partition_by(col("category")).order_by(col("amount").desc())
df = df.with_column("rank", rank().over(window))

Does Moltres support recursive queries?

Yes! Use recursive CTEs:

initial = db.table("seed").select(...)
recursive = initial.select(...)  # References CTE
df = initial.recursive_cte("name", recursive)

Does Moltres support LATERAL joins?

Yes, for PostgreSQL and MySQL 8.0+:

df = customers.join(
    orders.select().where(col("customer_id") == col("customers.id")),
    how="left",
    lateral=True
)

Can I use raw SQL?

Yes, you can execute raw SQL:

# Execute raw SQL
results = db.execute("SELECT * FROM users WHERE age > 18")

# Use in subqueries
subquery = db.execute("SELECT id FROM active_users")
df = db.table("orders").select().where(col("user_id").isin(subquery))

Migration Questions

How do I migrate from PySpark?

See the Migration Guide for detailed instructions.

Key differences:

  • Replace SparkSession with connect()

  • Update imports (pyspark.sql.functionsmoltres.expressions.functions)

  • Use col() consistently

  • Replace UDFs with SQL functions

How do I migrate from Pandas?

  1. Replace pd.read_csv() with db.load.csv()

  2. Replace pandas operations with Moltres DataFrame operations

  3. Use df.collect() to get results as list of dicts

  4. Convert to pandas if needed: pd.DataFrame(df.collect())

Troubleshooting

My query returns no results

  1. Check if the table has data: db.table("users").select().count()

  2. Verify filter conditions: df.to_sql() to see SQL

  3. Check for NULL values: Use isnull() / isnotnull()

My query is too slow

  1. Check indexes: EXPLAIN the query

  2. Filter early: Apply WHERE clauses before joins

  3. Use LIMIT: Test with small result sets first

  4. See Performance Guide

I get “table does not exist” error

  1. Verify table name spelling

  2. Check database connection

  3. Ensure table is in the correct schema

  4. Use db.execute("SHOW TABLES") to list tables

Column names are case-sensitive

SQL column names may be case-sensitive depending on your database:

  • SQLite: Case-insensitive

  • PostgreSQL: Case-sensitive (use quotes)

  • MySQL: Depends on configuration

Use exact column names as they appear in your database.

Getting Help

  • Documentation: See README and other guides

  • Examples: See Examples Guide

  • Issues: Report on GitHub

  • Questions: Check this FAQ first

For more help: