Performance Tuning Guide

This guide covers performance optimization strategies for Moltres queries and best practices for achieving optimal performance.

Overview

Moltres compiles DataFrame operations to SQL and executes them in your database. Performance depends on:

Database engine - The underlying SQL database’s optimization capabilities
Query structure - How operations are composed and compiled
Indexes - Database indexes on frequently queried columns
Connection pooling - Efficient connection management
Query patterns - How queries are structured and executed

SQL Pushdown

Moltres pushes operations down to SQL whenever possible, which means:

Filtering happens in the database: df.where(col("age") > 18) → WHERE age > 18
Aggregations execute in the database: df.group_by("category").agg(sum("amount")) → GROUP BY category, SUM(amount)
Joins are compiled to SQL JOINs: df1.join(df2, on="id") → INNER JOIN ... ON ...
Sorting uses database ORDER BY: df.order_by(col("name")) → ORDER BY name

This means most operations are executed efficiently in the database, not in Python.

Performance Best Practices

1. Use Indexes

Create indexes on columns used in:

WHERE clauses (filtering)
JOIN conditions
ORDER BY clauses
GROUP BY columns

# Create index for frequently filtered column
db.execute("CREATE INDEX idx_users_age ON users(age)")

# Create index for join column
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")

# Composite index for multiple columns
db.execute("CREATE INDEX idx_sales_region_date ON sales(region, date)")

2. Filter Early

Apply filters as early as possible to reduce data scanned:

# Good: Filter before join
df = (
    db.table("orders")
    .select()
    .where(col("status") == "active")  # Filter early
    .join(customers, on=[col("orders.customer_id") == col("customers.id")])
)

# Less efficient: Join before filter
df = (
    db.table("orders")
    .select()
    .join(customers, on=[col("orders.customer_id") == col("customers.id")])
    .where(col("status") == "active")  # Filter after join
)

3. Select Only Needed Columns

Avoid SELECT * when you only need specific columns:

# Good: Select only needed columns
df = db.table("users").select("id", "name", "email")

# Less efficient: Select all columns
df = db.table("users").select()  # SELECT *

4. Use LIMIT for Exploratory Queries

When exploring data, use limit() to avoid fetching large result sets:

# Good: Limit results during exploration
df = db.table("orders").select().limit(100)

# Avoid: Fetching all rows
df = db.table("orders").select()  # May return millions of rows

5. Optimize Joins

Use appropriate join types (INNER vs LEFT)
Ensure join columns are indexed
Consider join order for complex queries

# Ensure join columns are indexed
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")
db.execute("CREATE INDEX idx_customers_id ON customers(id)")

# Use appropriate join type
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner")

6. Use Aggregations Efficiently

Group by indexed columns when possible:

# Create index on grouping column
db.execute("CREATE INDEX idx_orders_category ON orders(category)")

# Group by indexed column
df = (
    db.table("orders")
    .select()
    .group_by("category")
    .agg(sum(col("amount")).alias("total"))
)

7. Connection Pooling

Configure connection pooling for better performance:

db = connect(
    "postgresql://user:pass@host/dbname",
    pool_size=10,        # Number of connections to maintain
    max_overflow=5,      # Additional connections allowed
    pool_timeout=30,     # Timeout for getting connection
    pool_recycle=3600,   # Recycle connections after 1 hour
    pool_pre_ping=True,  # Verify connections before use
)

8. Use Streaming for Large Results

For large result sets, use streaming to avoid loading everything into memory:

# Stream results instead of loading all at once
for chunk in df.collect(stream=True):
    process_chunk(chunk)

9. Batch Operations

Use batch inserts/updates when modifying data:

# Batch insert is more efficient than individual inserts
table.insert_many([
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": "Bob"},
    # ... many rows
])

10. Query Timeout

Set appropriate query timeouts to prevent long-running queries:

# Set timeout via configuration
db = connect(
    "postgresql://user:pass@host/dbname",
    query_timeout=30.0  # 30 seconds
)

# Or via environment variable
# MOLTRES_QUERY_TIMEOUT=30.0

Performance Monitoring

Moltres provides performance monitoring hooks:

from moltres.engine import register_performance_hook

def log_slow_queries(sql: str, elapsed: float, metadata: dict):
    if elapsed > 1.0:  # Log queries taking more than 1 second
        print(f"Slow query ({elapsed:.2f}s): {sql[:200]}")
        print(f"  Rows: {metadata.get('rowcount', 'N/A')}")

register_performance_hook("query_end", log_slow_queries)

Database-Specific Optimizations

PostgreSQL

Use EXPLAIN ANALYZE to understand query plans
Consider partitioning for large tables
Use materialized views for complex aggregations
Enable query statistics: pg_stat_statements

# View query plan
df = db.table("orders").select().where(col("amount") > 100)
sql = df.to_sql()
db.execute(f"EXPLAIN ANALYZE {sql}")

SQLite

Use WAL mode for better concurrency: PRAGMA journal_mode=WAL
Increase cache size: PRAGMA cache_size = -64000 (64MB)
Use appropriate page size: PRAGMA page_size = 4096

db.execute("PRAGMA journal_mode=WAL")
db.execute("PRAGMA cache_size = -64000")

MySQL

Use InnoDB engine for better performance
Configure buffer pool size appropriately
Use query cache for read-heavy workloads

Common Performance Issues

Issue 1: Full Table Scans

Symptom: Queries are slow even with small result sets

Solution: Add indexes on filtered columns

# Check if index exists
db.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)")

# Use indexed column in filter
df = db.table("users").select().where(col("email") == "user@example.com")

Issue 2: Large Result Sets

Symptom: Memory usage spikes when collecting results

Solution: Use streaming or add LIMIT

# Use streaming
for chunk in df.collect(stream=True):
    process_chunk(chunk)

# Or add limit
df = df.limit(1000)

Issue 3: N+1 Query Problem

Symptom: Many small queries instead of one large query

Solution: Use joins instead of loops

# Bad: N+1 queries
for customer in customers:
    orders = db.table("orders").select().where(col("customer_id") == customer["id"])

# Good: Single query with join
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")])

Issue 4: Inefficient Aggregations

Symptom: Aggregations are slow

Solution: Ensure grouping columns are indexed

# Create index on grouping column
db.execute("CREATE INDEX idx_sales_category ON sales(category)")

# Group by indexed column
df = db.table("sales").select().group_by("category").agg(sum(col("amount")))

Benchmarking

To benchmark Moltres queries:

import time

start = time.perf_counter()
results = df.collect()
elapsed = time.perf_counter() - start

print(f"Query took {elapsed:.3f} seconds")
print(f"Returned {len(results)} rows")

Performance Comparison

Moltres performance characteristics:

SQL Pushdown: Operations execute in the database, not in Python
Lazy Evaluation: Queries are optimized before execution
Connection Pooling: Efficient connection reuse
Streaming: Support for large result sets without memory issues

Compared to:

Pandas: Moltres is faster for large datasets (database handles processing)
PySpark: Moltres has lower overhead (no cluster setup)
Raw SQL: Similar performance (Moltres compiles to SQL)

Summary

Index frequently queried columns
Filter early in the query chain
Select only needed columns
Use LIMIT for exploration
Optimize joins with indexes
Configure connection pooling
Use streaming for large results
Monitor slow queries
Use database-specific optimizations
Benchmark and profile queries

For more information, see: