Performance Tuning Guide

This guide covers performance optimization strategies for Moltres queries and best practices for achieving optimal performance.

Overview

Moltres compiles DataFrame operations to SQL and executes them in your database. Performance depends on:

  1. Database engine - The underlying SQL database’s optimization capabilities

  2. Query structure - How operations are composed and compiled

  3. Indexes - Database indexes on frequently queried columns

  4. Connection pooling - Efficient connection management

  5. Query patterns - How queries are structured and executed

SQL Pushdown

Moltres pushes operations down to SQL whenever possible, which means:

  • Filtering happens in the database: df.where(col("age") > 18)WHERE age > 18

  • Aggregations execute in the database: df.group_by("category").agg(sum("amount"))GROUP BY category, SUM(amount)

  • Joins are compiled to SQL JOINs: df1.join(df2, on="id")INNER JOIN ... ON ...

  • Sorting uses database ORDER BY: df.order_by(col("name"))ORDER BY name

This means most operations are executed efficiently in the database, not in Python.

Performance Best Practices

1. Use Indexes

Create indexes on columns used in:

  • WHERE clauses (filtering)

  • JOIN conditions

  • ORDER BY clauses

  • GROUP BY columns

# Create index for frequently filtered column
db.execute("CREATE INDEX idx_users_age ON users(age)")

# Create index for join column
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")

# Composite index for multiple columns
db.execute("CREATE INDEX idx_sales_region_date ON sales(region, date)")

2. Filter Early

Apply filters as early as possible to reduce data scanned:

# Good: Filter before join
df = (
    db.table("orders")
    .select()
    .where(col("status") == "active")  # Filter early
    .join(customers, on=[col("orders.customer_id") == col("customers.id")])
)

# Less efficient: Join before filter
df = (
    db.table("orders")
    .select()
    .join(customers, on=[col("orders.customer_id") == col("customers.id")])
    .where(col("status") == "active")  # Filter after join
)

3. Select Only Needed Columns

Avoid SELECT * when you only need specific columns:

# Good: Select only needed columns
df = db.table("users").select("id", "name", "email")

# Less efficient: Select all columns
df = db.table("users").select()  # SELECT *

4. Use LIMIT for Exploratory Queries

When exploring data, use limit() to avoid fetching large result sets:

# Good: Limit results during exploration
df = db.table("orders").select().limit(100)

# Avoid: Fetching all rows
df = db.table("orders").select()  # May return millions of rows

5. Optimize Joins

  • Use appropriate join types (INNER vs LEFT)

  • Ensure join columns are indexed

  • Consider join order for complex queries

# Ensure join columns are indexed
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")
db.execute("CREATE INDEX idx_customers_id ON customers(id)")

# Use appropriate join type
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner")

6. Use Aggregations Efficiently

Group by indexed columns when possible:

# Create index on grouping column
db.execute("CREATE INDEX idx_orders_category ON orders(category)")

# Group by indexed column
df = (
    db.table("orders")
    .select()
    .group_by("category")
    .agg(sum(col("amount")).alias("total"))
)

7. Connection Pooling

Configure connection pooling for better performance:

db = connect(
    "postgresql://user:pass@host/dbname",
    pool_size=10,        # Number of connections to maintain
    max_overflow=5,      # Additional connections allowed
    pool_timeout=30,     # Timeout for getting connection
    pool_recycle=3600,   # Recycle connections after 1 hour
    pool_pre_ping=True,  # Verify connections before use
)

8. Use Streaming for Large Results

For large result sets, use streaming to avoid loading everything into memory:

# Stream results instead of loading all at once
for chunk in df.collect(stream=True):
    process_chunk(chunk)

9. Batch Operations

Use batch inserts/updates when modifying data:

# Batch insert is more efficient than individual inserts
table.insert_many([
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": "Bob"},
    # ... many rows
])

10. Query Timeout

Set appropriate query timeouts to prevent long-running queries:

# Set timeout via configuration
db = connect(
    "postgresql://user:pass@host/dbname",
    query_timeout=30.0  # 30 seconds
)

# Or via environment variable
# MOLTRES_QUERY_TIMEOUT=30.0

Performance Monitoring

Moltres provides performance monitoring hooks:

from moltres.engine import register_performance_hook

def log_slow_queries(sql: str, elapsed: float, metadata: dict):
    if elapsed > 1.0:  # Log queries taking more than 1 second
        print(f"Slow query ({elapsed:.2f}s): {sql[:200]}")
        print(f"  Rows: {metadata.get('rowcount', 'N/A')}")

register_performance_hook("query_end", log_slow_queries)

Database-Specific Optimizations

PostgreSQL

  • Use EXPLAIN ANALYZE to understand query plans

  • Consider partitioning for large tables

  • Use materialized views for complex aggregations

  • Enable query statistics: pg_stat_statements

# View query plan
df = db.table("orders").select().where(col("amount") > 100)
sql = df.to_sql()
db.execute(f"EXPLAIN ANALYZE {sql}")

SQLite

  • Use WAL mode for better concurrency: PRAGMA journal_mode=WAL

  • Increase cache size: PRAGMA cache_size = -64000 (64MB)

  • Use appropriate page size: PRAGMA page_size = 4096

db.execute("PRAGMA journal_mode=WAL")
db.execute("PRAGMA cache_size = -64000")

MySQL

  • Use InnoDB engine for better performance

  • Configure buffer pool size appropriately

  • Use query cache for read-heavy workloads

Common Performance Issues

Issue 1: Full Table Scans

Symptom: Queries are slow even with small result sets

Solution: Add indexes on filtered columns

# Check if index exists
db.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)")

# Use indexed column in filter
df = db.table("users").select().where(col("email") == "user@example.com")

Issue 2: Large Result Sets

Symptom: Memory usage spikes when collecting results

Solution: Use streaming or add LIMIT

# Use streaming
for chunk in df.collect(stream=True):
    process_chunk(chunk)

# Or add limit
df = df.limit(1000)

Issue 3: N+1 Query Problem

Symptom: Many small queries instead of one large query

Solution: Use joins instead of loops

# Bad: N+1 queries
for customer in customers:
    orders = db.table("orders").select().where(col("customer_id") == customer["id"])

# Good: Single query with join
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")])

Issue 4: Inefficient Aggregations

Symptom: Aggregations are slow

Solution: Ensure grouping columns are indexed

# Create index on grouping column
db.execute("CREATE INDEX idx_sales_category ON sales(category)")

# Group by indexed column
df = db.table("sales").select().group_by("category").agg(sum(col("amount")))

Benchmarking

To benchmark Moltres queries:

import time

start = time.perf_counter()
results = df.collect()
elapsed = time.perf_counter() - start

print(f"Query took {elapsed:.3f} seconds")
print(f"Returned {len(results)} rows")

Performance Comparison

Moltres performance characteristics:

  • SQL Pushdown: Operations execute in the database, not in Python

  • Lazy Evaluation: Queries are optimized before execution

  • Connection Pooling: Efficient connection reuse

  • Streaming: Support for large result sets without memory issues

Compared to:

  • Pandas: Moltres is faster for large datasets (database handles processing)

  • PySpark: Moltres has lower overhead (no cluster setup)

  • Raw SQL: Similar performance (Moltres compiles to SQL)

Summary

  1. Index frequently queried columns

  2. Filter early in the query chain

  3. Select only needed columns

  4. Use LIMIT for exploration

  5. Optimize joins with indexes

  6. Configure connection pooling

  7. Use streaming for large results

  8. Monitor slow queries

  9. Use database-specific optimizations

  10. Benchmark and profile queries

For more information, see: