# Best Practices Guide This guide provides recommendations and patterns for using Moltres effectively in production. ## Code Organization ### 1. Connection Management **Best Practice:** Create database connections at the application level and reuse them. ```python # Good: Single connection per application db = connect("postgresql://user:pass@host/dbname") # Use the same connection throughout def get_users(): return db.table("users").select().collect() def get_orders(): return db.table("orders").select().collect() ``` **Avoid:** Creating new connections for each query ```python # Bad: New connection for each query def get_users(): db = connect("postgresql://user:pass@host/dbname") # Don't do this return db.table("users").select().collect() ``` ### 2. Query Composition **Best Practice:** Compose queries step by step for readability ```python # Good: Clear, readable query composition df = ( db.table("orders") .select("id", "customer_id", "amount", "status") .where(col("status") == "active") .where(col("amount") > 100) .order_by(col("amount").desc()) .limit(100) ) ``` **Avoid:** Overly complex one-liners ```python # Less readable: Everything in one line df = db.table("orders").select("id", "customer_id", "amount", "status").where(col("status") == "active").where(col("amount") > 100).order_by(col("amount").desc()).limit(100) ``` ### 3. Column References **Best Practice:** Use `col()` consistently for column references ```python from moltres import col # Good: Explicit column references df = ( db.table("users") .select(col("id"), col("name"), col("email")) .where(col("active") == True) ) ``` **Avoid:** Mixing string and column references inconsistently ```python # Less clear: Mixed usage df = ( db.table("users") .select("id", "name", col("email")) # Inconsistent .where(col("active") == True) ) ``` ## Error Handling ### 1. Handle Database Errors **Best Practice:** Catch and handle specific exceptions ```python from moltres.utils.exceptions import ( ExecutionError, CompilationError, DatabaseConnectionError, QueryTimeoutError, ) try: results = df.collect() except QueryTimeoutError as e: print(f"Query timed out: {e}") # Handle timeout except ExecutionError as e: print(f"Query failed: {e}") print(f"Suggestion: {e.suggestion}") # Handle execution error except DatabaseConnectionError as e: print(f"Connection failed: {e}") # Handle connection error ``` ### 2. Validate Inputs **Best Practice:** Validate inputs before building queries ```python def get_user_orders(user_id: int, limit: int = 100): if not isinstance(user_id, int) or user_id < 0: raise ValueError("user_id must be a positive integer") if limit < 0 or limit > 1000: raise ValueError("limit must be between 0 and 1000") return ( db.table("orders") .select() .where(col("user_id") == user_id) .limit(limit) .collect() ) ``` ## Security ### 1. SQL Injection Prevention **Best Practice:** Use column expressions, not string concatenation ```python # Good: Safe - uses parameterized queries user_id = 123 df = db.table("users").select().where(col("id") == user_id) # Bad: Vulnerable to SQL injection user_id = "123; DROP TABLE users;" sql = f"SELECT * FROM users WHERE id = {user_id}" # DON'T DO THIS ``` ### 2. Connection String Security **Best Practice:** Use environment variables for credentials ```python import os # Good: Credentials from environment dsn = os.environ.get("MOLTRES_DSN") db = connect(dsn) # Avoid: Hardcoded credentials db = connect("postgresql://user:password@host/dbname") # Don't commit this ``` ## Performance ### 1. Use Indexes **Best Practice:** Create indexes on frequently queried columns ```python # Create indexes for common queries db.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)") db.execute("CREATE INDEX IF NOT EXISTS idx_orders_customer_id ON orders(customer_id)") ``` ### 2. Filter Early **Best Practice:** Apply filters as early as possible ```python # Good: Filter before join df = ( db.table("orders") .select() .where(col("status") == "active") # Filter early .join(customers, on=[col("orders.customer_id") == col("customers.id")]) ) ``` ### 3. Select Only Needed Columns **Best Practice:** Avoid `SELECT *` when possible ```python # Good: Select only needed columns df = db.table("users").select("id", "name", "email") # Less efficient: Select all columns df = db.table("users").select() # SELECT * ``` ### 4. Use Streaming for Large Results **Best Practice:** Stream large result sets ```python # Good: Stream large results for chunk in df.collect(stream=True): process_chunk(chunk) # Avoid: Loading everything into memory results = df.collect() # May cause memory issues ``` ## Testing ### 1. Test Query Generation **Best Practice:** Verify SQL generation during development ```python # Check generated SQL df = db.table("users").select().where(col("age") > 18) sql = df.to_sql() print(sql) # Verify the SQL is correct ``` ### 2. Use Test Databases **Best Practice:** Use separate test databases ```python # Test database test_db = connect("sqlite:///test.db") # Production database prod_db = connect("postgresql://user:pass@host/prod_db") ``` ### 3. Test Error Handling **Best Practice:** Test error scenarios ```python import pytest from moltres.utils.exceptions import ExecutionError def test_invalid_query(): df = db.table("nonexistent").select() with pytest.raises(ExecutionError): df.collect() ``` ## Configuration ### 1. Environment Variables **Best Practice:** Use environment variables for configuration ```bash # .env file or environment export MOLTRES_DSN="postgresql://user:pass@host/dbname" export MOLTRES_POOL_SIZE=10 export MOLTRES_QUERY_TIMEOUT=30.0 ``` ```python # Application code db = connect() # Uses MOLTRES_DSN from environment ``` ### 2. Connection Pooling **Best Practice:** Configure connection pooling appropriately ```python db = connect( "postgresql://user:pass@host/dbname", pool_size=10, # Match your workload max_overflow=5, # Allow some overflow pool_pre_ping=True, # Verify connections ) ``` ## Code Patterns ### 1. Query Builders **Best Practice:** Create reusable query builders ```python def build_user_query(active_only: bool = False, min_age: int = None): df = db.table("users").select("id", "name", "email") if active_only: df = df.where(col("active") == True) if min_age is not None: df = df.where(col("age") >= min_age) return df # Use the builder active_users = build_user_query(active_only=True, min_age=18).collect() ``` ### 2. CTEs for Complex Queries **Best Practice:** Use CTEs for complex queries ```python # Good: Use CTEs for readability active_orders = ( db.table("orders") .select() .where(col("status") == "active") .cte("active_orders") ) high_value = ( active_orders .select() .where(col("amount") > 1000) .cte("high_value") ) result = high_value.select().collect() ``` ### 3. Type Hints **Best Practice:** Use type hints for better IDE support ```python from typing import List, Dict from moltres import DataFrame def get_active_users() -> List[Dict[str, object]]: df: DataFrame = ( db.table("users") .select() .where(col("active") == True) ) return df.collect() ``` ## Documentation ### 1. Document Complex Queries **Best Practice:** Add comments for complex logic ```python # Calculate monthly revenue by category # Filters active orders, groups by month and category, # and calculates total revenue df = ( db.table("orders") .select() .where(col("status") == "active") .where(col("date") >= "2024-01-01") .group_by( func.date_trunc("month", col("date")).alias("month"), col("category") ) .agg(sum(col("amount")).alias("revenue")) ) ``` ### 2. Document Schema Assumptions **Best Practice:** Document expected schema ```python def get_user_orders(user_id: int): """ Get orders for a user. Assumes: - orders table has columns: id, user_id, amount, status, date - user_id column is indexed for performance """ return ( db.table("orders") .select() .where(col("user_id") == user_id) .collect() ) ``` ## Common Patterns ### Pattern 1: Pagination ```python def get_paginated_results(page: int, page_size: int = 20): offset = page * page_size return ( db.table("items") .select() .order_by(col("id")) .limit(page_size) .offset(offset) # If supported .collect() ) ``` ### Pattern 2: Conditional Filtering ```python def search_users(name: str = None, email: str = None, active: bool = None): df = db.table("users").select() if name: df = df.where(col("name").like(f"%{name}%")) if email: df = df.where(col("email") == email) if active is not None: df = df.where(col("active") == active) return df.collect() ``` ### Pattern 3: Aggregations with Conditions ```python from moltres.expressions.functions import sum, when # Sum with conditions df = ( db.table("orders") .select() .group_by("category") .agg( sum(when(col("status") == "active", col("amount"), 0)).alias("active_revenue"), sum(col("amount")).alias("total_revenue") ) ) ``` ## Summary 1. **Reuse database connections** 2. **Compose queries clearly** 3. **Use `col()` consistently** 4. **Handle errors appropriately** 5. **Validate inputs** 6. **Prevent SQL injection** 7. **Use environment variables for secrets** 8. **Create indexes for performance** 9. **Filter early in queries** 10. **Select only needed columns** 11. **Use streaming for large results** 12. **Test query generation** 13. **Use type hints** 14. **Document complex queries** 15. **Follow common patterns** For more information, see: - [Performance Guide](./PERFORMANCE.md) - [Troubleshooting Guide](./TROUBLESHOOTING.md) - [Migration Guide](./MIGRATION_SPARK.md)