# Performance Tuning Guide This guide covers performance optimization strategies for Moltres queries and best practices for achieving optimal performance. ## Overview Moltres compiles DataFrame operations to SQL and executes them in your database. Performance depends on: 1. **Database engine** - The underlying SQL database's optimization capabilities 2. **Query structure** - How operations are composed and compiled 3. **Indexes** - Database indexes on frequently queried columns 4. **Connection pooling** - Efficient connection management 5. **Query patterns** - How queries are structured and executed ## SQL Pushdown Moltres pushes operations down to SQL whenever possible, which means: - **Filtering** happens in the database: `df.where(col("age") > 18)` → `WHERE age > 18` - **Aggregations** execute in the database: `df.group_by("category").agg(sum("amount"))` → `GROUP BY category, SUM(amount)` - **Joins** are compiled to SQL JOINs: `df1.join(df2, on="id")` → `INNER JOIN ... ON ...` - **Sorting** uses database ORDER BY: `df.order_by(col("name"))` → `ORDER BY name` This means most operations are executed efficiently in the database, not in Python. ## Performance Best Practices ### 1. Use Indexes Create indexes on columns used in: - **WHERE clauses** (filtering) - **JOIN conditions** - **ORDER BY** clauses - **GROUP BY** columns ```python # Create index for frequently filtered column db.execute("CREATE INDEX idx_users_age ON users(age)") # Create index for join column db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)") # Composite index for multiple columns db.execute("CREATE INDEX idx_sales_region_date ON sales(region, date)") ``` ### 2. Filter Early Apply filters as early as possible to reduce data scanned: ```python # Good: Filter before join df = ( db.table("orders") .select() .where(col("status") == "active") # Filter early .join(customers, on=[col("orders.customer_id") == col("customers.id")]) ) # Less efficient: Join before filter df = ( db.table("orders") .select() .join(customers, on=[col("orders.customer_id") == col("customers.id")]) .where(col("status") == "active") # Filter after join ) ``` ### 3. Select Only Needed Columns Avoid `SELECT *` when you only need specific columns: ```python # Good: Select only needed columns df = db.table("users").select("id", "name", "email") # Less efficient: Select all columns df = db.table("users").select() # SELECT * ``` ### 4. Use LIMIT for Exploratory Queries When exploring data, use `limit()` to avoid fetching large result sets: ```python # Good: Limit results during exploration df = db.table("orders").select().limit(100) # Avoid: Fetching all rows df = db.table("orders").select() # May return millions of rows ``` ### 5. Optimize Joins - Use appropriate join types (INNER vs LEFT) - Ensure join columns are indexed - Consider join order for complex queries ```python # Ensure join columns are indexed db.execute("CREATE INDEX idx_orders_customer_id ON orders(customer_id)") db.execute("CREATE INDEX idx_customers_id ON customers(id)") # Use appropriate join type df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner") ``` ### 6. Use Aggregations Efficiently Group by indexed columns when possible: ```python # Create index on grouping column db.execute("CREATE INDEX idx_orders_category ON orders(category)") # Group by indexed column df = ( db.table("orders") .select() .group_by("category") .agg(sum(col("amount")).alias("total")) ) ``` ### 7. Connection Pooling Configure connection pooling for better performance: ```python db = connect( "postgresql://user:pass@host/dbname", pool_size=10, # Number of connections to maintain max_overflow=5, # Additional connections allowed pool_timeout=30, # Timeout for getting connection pool_recycle=3600, # Recycle connections after 1 hour pool_pre_ping=True, # Verify connections before use ) ``` ### 8. Use Streaming for Large Results For large result sets, use streaming to avoid loading everything into memory: ```python # Stream results instead of loading all at once for chunk in df.collect(stream=True): process_chunk(chunk) ``` ### 9. Batch Operations Use batch inserts/updates when modifying data: ```python # Batch insert is more efficient than individual inserts table.insert_many([ {"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, # ... many rows ]) ``` ### 10. Query Timeout Set appropriate query timeouts to prevent long-running queries: ```python # Set timeout via configuration db = connect( "postgresql://user:pass@host/dbname", query_timeout=30.0 # 30 seconds ) # Or via environment variable # MOLTRES_QUERY_TIMEOUT=30.0 ``` ## Performance Monitoring Moltres provides performance monitoring hooks: ```python from moltres.engine import register_performance_hook def log_slow_queries(sql: str, elapsed: float, metadata: dict): if elapsed > 1.0: # Log queries taking more than 1 second print(f"Slow query ({elapsed:.2f}s): {sql[:200]}") print(f" Rows: {metadata.get('rowcount', 'N/A')}") register_performance_hook("query_end", log_slow_queries) ``` ## Database-Specific Optimizations ### PostgreSQL - Use `EXPLAIN ANALYZE` to understand query plans - Consider partitioning for large tables - Use materialized views for complex aggregations - Enable query statistics: `pg_stat_statements` ```python # View query plan df = db.table("orders").select().where(col("amount") > 100) sql = df.to_sql() db.execute(f"EXPLAIN ANALYZE {sql}") ``` ### SQLite - Use WAL mode for better concurrency: `PRAGMA journal_mode=WAL` - Increase cache size: `PRAGMA cache_size = -64000` (64MB) - Use appropriate page size: `PRAGMA page_size = 4096` ```python db.execute("PRAGMA journal_mode=WAL") db.execute("PRAGMA cache_size = -64000") ``` ### MySQL - Use InnoDB engine for better performance - Configure buffer pool size appropriately - Use query cache for read-heavy workloads ## Common Performance Issues ### Issue 1: Full Table Scans **Symptom:** Queries are slow even with small result sets **Solution:** Add indexes on filtered columns ```python # Check if index exists db.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)") # Use indexed column in filter df = db.table("users").select().where(col("email") == "user@example.com") ``` ### Issue 2: Large Result Sets **Symptom:** Memory usage spikes when collecting results **Solution:** Use streaming or add LIMIT ```python # Use streaming for chunk in df.collect(stream=True): process_chunk(chunk) # Or add limit df = df.limit(1000) ``` ### Issue 3: N+1 Query Problem **Symptom:** Many small queries instead of one large query **Solution:** Use joins instead of loops ```python # Bad: N+1 queries for customer in customers: orders = db.table("orders").select().where(col("customer_id") == customer["id"]) # Good: Single query with join df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")]) ``` ### Issue 4: Inefficient Aggregations **Symptom:** Aggregations are slow **Solution:** Ensure grouping columns are indexed ```python # Create index on grouping column db.execute("CREATE INDEX idx_sales_category ON sales(category)") # Group by indexed column df = db.table("sales").select().group_by("category").agg(sum(col("amount"))) ``` ## Benchmarking To benchmark Moltres queries: ```python import time start = time.perf_counter() results = df.collect() elapsed = time.perf_counter() - start print(f"Query took {elapsed:.3f} seconds") print(f"Returned {len(results)} rows") ``` ## Performance Comparison Moltres performance characteristics: - **SQL Pushdown:** Operations execute in the database, not in Python - **Lazy Evaluation:** Queries are optimized before execution - **Connection Pooling:** Efficient connection reuse - **Streaming:** Support for large result sets without memory issues Compared to: - **Pandas:** Moltres is faster for large datasets (database handles processing) - **PySpark:** Moltres has lower overhead (no cluster setup) - **Raw SQL:** Similar performance (Moltres compiles to SQL) ## Summary 1. **Index frequently queried columns** 2. **Filter early in the query chain** 3. **Select only needed columns** 4. **Use LIMIT for exploration** 5. **Optimize joins with indexes** 6. **Configure connection pooling** 7. **Use streaming for large results** 8. **Monitor slow queries** 9. **Use database-specific optimizations** 10. **Benchmark and profile queries** For more information, see: - [Best Practices Guide](./BEST_PRACTICES.md) - [Troubleshooting Guide](./TROUBLESHOOTING.md) - [Deployment Guide](./DEPLOYMENT.md)