Performance Tuning Guide
This guide covers performance optimization strategies for Moltres queries and best practices for achieving optimal performance.
Overview
Moltres compiles DataFrame operations to SQL and executes them in your database. Performance depends on:
Database engine - The underlying SQL database’s optimization capabilities
Query structure - How operations are composed and compiled
Indexes - Database indexes on frequently queried columns
Connection pooling - Efficient connection management
Query patterns - How queries are structured and executed
SQL Pushdown
Moltres pushes operations down to SQL whenever possible, which means:
Filtering happens in the database:
df.where(col("age") > 18)→WHERE age > 18Aggregations execute in the database:
df.group_by("category").agg(sum("amount"))→GROUP BY category, SUM(amount)Joins are compiled to SQL JOINs:
df1.join(df2, on="id")→INNER JOIN ... ON ...Sorting uses database ORDER BY:
df.order_by(col("name"))→ORDER BY name
This means most operations are executed efficiently in the database, not in Python.
Performance Best Practices
1. Use Indexes
Create indexes on columns used in:
WHERE clauses (filtering)
JOIN conditions
ORDER BY clauses
GROUP BY columns
# Create index for frequently filtered column
db.execute("CREATE INDEX idx_users_age ON users(age)")
# Create index for join column
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")
# Composite index for multiple columns
db.execute("CREATE INDEX idx_sales_region_date ON sales(region, date)")
2. Filter Early
Apply filters as early as possible to reduce data scanned:
# Good: Filter before join
df = (
db.table("orders")
.select()
.where(col("status") == "active") # Filter early
.join(customers, on=[col("orders.customer_id") == col("customers.id")])
)
# Less efficient: Join before filter
df = (
db.table("orders")
.select()
.join(customers, on=[col("orders.customer_id") == col("customers.id")])
.where(col("status") == "active") # Filter after join
)
3. Select Only Needed Columns
Avoid SELECT * when you only need specific columns:
# Good: Select only needed columns
df = db.table("users").select("id", "name", "email")
# Less efficient: Select all columns
df = db.table("users").select() # SELECT *
4. Use LIMIT for Exploratory Queries
When exploring data, use limit() to avoid fetching large result sets:
# Good: Limit results during exploration
df = db.table("orders").select().limit(100)
# Avoid: Fetching all rows
df = db.table("orders").select() # May return millions of rows
5. Optimize Joins
Use appropriate join types (INNER vs LEFT)
Ensure join columns are indexed
Consider join order for complex queries
# Ensure join columns are indexed
db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)")
db.execute("CREATE INDEX idx_customers_id ON customers(id)")
# Use appropriate join type
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner")
6. Use Aggregations Efficiently
Group by indexed columns when possible:
# Create index on grouping column
db.execute("CREATE INDEX idx_orders_category ON orders(category)")
# Group by indexed column
df = (
db.table("orders")
.select()
.group_by("category")
.agg(sum(col("amount")).alias("total"))
)
7. Connection Pooling
Configure connection pooling for better performance:
db = connect(
"postgresql://user:pass@host/dbname",
pool_size=10, # Number of connections to maintain
max_overflow=5, # Additional connections allowed
pool_timeout=30, # Timeout for getting connection
pool_recycle=3600, # Recycle connections after 1 hour
pool_pre_ping=True, # Verify connections before use
)
8. Use Streaming for Large Results
For large result sets, use streaming to avoid loading everything into memory:
# Stream results instead of loading all at once
for chunk in df.collect(stream=True):
process_chunk(chunk)
9. Batch Operations
Use batch inserts/updates when modifying data:
# Batch insert is more efficient than individual inserts
table.insert_many([
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
# ... many rows
])
10. Query Timeout
Set appropriate query timeouts to prevent long-running queries:
# Set timeout via configuration
db = connect(
"postgresql://user:pass@host/dbname",
query_timeout=30.0 # 30 seconds
)
# Or via environment variable
# MOLTRES_QUERY_TIMEOUT=30.0
Performance Monitoring
Moltres provides performance monitoring hooks:
from moltres.engine import register_performance_hook
def log_slow_queries(sql: str, elapsed: float, metadata: dict):
if elapsed > 1.0: # Log queries taking more than 1 second
print(f"Slow query ({elapsed:.2f}s): {sql[:200]}")
print(f" Rows: {metadata.get('rowcount', 'N/A')}")
register_performance_hook("query_end", log_slow_queries)
Database-Specific Optimizations
PostgreSQL
Use
EXPLAIN ANALYZEto understand query plansConsider partitioning for large tables
Use materialized views for complex aggregations
Enable query statistics:
pg_stat_statements
# View query plan
df = db.table("orders").select().where(col("amount") > 100)
sql = df.to_sql()
db.execute(f"EXPLAIN ANALYZE {sql}")
SQLite
Use WAL mode for better concurrency:
PRAGMA journal_mode=WALIncrease cache size:
PRAGMA cache_size = -64000(64MB)Use appropriate page size:
PRAGMA page_size = 4096
db.execute("PRAGMA journal_mode=WAL")
db.execute("PRAGMA cache_size = -64000")
MySQL
Use InnoDB engine for better performance
Configure buffer pool size appropriately
Use query cache for read-heavy workloads
Common Performance Issues
Issue 1: Full Table Scans
Symptom: Queries are slow even with small result sets
Solution: Add indexes on filtered columns
# Check if index exists
db.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)")
# Use indexed column in filter
df = db.table("users").select().where(col("email") == "user@example.com")
Issue 2: Large Result Sets
Symptom: Memory usage spikes when collecting results
Solution: Use streaming or add LIMIT
# Use streaming
for chunk in df.collect(stream=True):
process_chunk(chunk)
# Or add limit
df = df.limit(1000)
Issue 3: N+1 Query Problem
Symptom: Many small queries instead of one large query
Solution: Use joins instead of loops
# Bad: N+1 queries
for customer in customers:
orders = db.table("orders").select().where(col("customer_id") == customer["id"])
# Good: Single query with join
df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")])
Issue 4: Inefficient Aggregations
Symptom: Aggregations are slow
Solution: Ensure grouping columns are indexed
# Create index on grouping column
db.execute("CREATE INDEX idx_sales_category ON sales(category)")
# Group by indexed column
df = db.table("sales").select().group_by("category").agg(sum(col("amount")))
Benchmarking
To benchmark Moltres queries:
import time
start = time.perf_counter()
results = df.collect()
elapsed = time.perf_counter() - start
print(f"Query took {elapsed:.3f} seconds")
print(f"Returned {len(results)} rows")
Performance Comparison
Moltres performance characteristics:
SQL Pushdown: Operations execute in the database, not in Python
Lazy Evaluation: Queries are optimized before execution
Connection Pooling: Efficient connection reuse
Streaming: Support for large result sets without memory issues
Compared to:
Pandas: Moltres is faster for large datasets (database handles processing)
PySpark: Moltres has lower overhead (no cluster setup)
Raw SQL: Similar performance (Moltres compiles to SQL)
Summary
Index frequently queried columns
Filter early in the query chain
Select only needed columns
Use LIMIT for exploration
Optimize joins with indexes
Configure connection pooling
Use streaming for large results
Monitor slow queries
Use database-specific optimizations
Benchmark and profile queries
For more information, see: