# Frequently Asked Questions (FAQ) Common questions about Moltres and their answers. ## General Questions ### What is Moltres? Moltres is a Python library that provides a PySpark-like DataFrame API for SQL databases. It compiles DataFrame operations to SQL and executes them in your database, providing SQL pushdown without requiring a Spark cluster. ### How is Moltres different from PySpark? - **No cluster required:** Moltres works with any SQL database, no Spark cluster needed - **SQL pushdown:** Operations are compiled to SQL and executed in the database - **Simpler deployment:** Just connect to your existing database - **Lower overhead:** No cluster management or resource allocation ### How is Moltres different from Pandas? - **SQL execution:** Moltres executes queries in the database, not in Python memory - **Lazy evaluation:** Queries are optimized before execution - **Scalability:** Can handle larger datasets (limited by database, not Python memory) - **SQL pushdown:** Operations compile to SQL for better performance ### Which databases are supported? Moltres supports any database with a SQLAlchemy driver, including: - SQLite - PostgreSQL - MySQL - SQL Server - Oracle - And many others ## Usage Questions ### How do I connect to a database? ```python from moltres import connect # SQLite db = connect("sqlite:///example.db") # PostgreSQL db = connect("postgresql://user:pass@host/dbname") # MySQL db = connect("mysql://user:pass@host/dbname") ``` ### How do I read data from a table? ```python # Select all columns df = db.table("users").select() # Select specific columns df = db.table("users").select("id", "name", "email") # With filtering from moltres import col df = db.table("users").select().where(col("active") == True) ``` ### How do I read data from files? ```python # CSV df = db.load.csv("data.csv") # JSON df = db.load.json("data.json") # Parquet df = db.load.parquet("data.parquet") # With options df = db.load.csv("data.csv", header=True, delimiter=",") ``` ### How do I filter rows? ```python from moltres import col # Single condition df = db.table("users").select().where(col("age") > 18) # Multiple conditions df = db.table("users").select().where( (col("age") > 18) & (col("active") == True) ) ``` ### How do I join tables? ```python customers = db.table("customers").select() orders = db.table("orders").select() # Inner join from moltres import col df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="inner") # Left join df = customers.join(orders, on=[col("customers.id") == col("orders.customer_id")], how="left") ``` ### How do I aggregate data? ```python from moltres import col from moltres.expressions.functions import sum, avg, count df = ( db.table("orders") .select() .group_by("category") .agg( sum(col("amount")).alias("total"), avg(col("amount")).alias("average"), count("*").alias("count") ) ) ``` ### How do I execute a query and get results? ```python # Collect all results results = df.collect() # Returns list of dicts # Stream results (for large datasets) for chunk in df.collect(stream=True): process_chunk(chunk) # Show first few rows df.show(10) ``` ### How do I see the generated SQL? ```python sql = df.to_sql() print(sql) ``` ## Performance Questions ### Why is my query slow? Common causes: 1. **Missing indexes** - Create indexes on filtered/joined columns 2. **Full table scans** - Add WHERE clauses to filter early 3. **Large result sets** - Use LIMIT or streaming 4. **Inefficient joins** - Ensure join columns are indexed See [Performance Guide](./PERFORMANCE.md) for details. ### How do I optimize queries? 1. Create indexes on frequently queried columns 2. Filter early in the query chain 3. Select only needed columns 4. Use LIMIT for exploration 5. Configure connection pooling See [Performance Guide](./PERFORMANCE.md) for more tips. ### Does Moltres cache results? No, Moltres doesn't cache results. Each query executes fresh in the database. For caching, use: - Database query cache (if supported) - Application-level caching (Redis, Memcached) - Materialized views in the database ## Error Questions ### What does "CompilationError" mean? A `CompilationError` means Moltres couldn't compile your DataFrame operation to SQL. Common causes: - Unsupported operation for your SQL dialect - Invalid column references - Missing join conditions Check the error message and suggestion for details. ### What does "ExecutionError" mean? An `ExecutionError` means the SQL query failed to execute. Common causes: - Table/column doesn't exist - SQL syntax error - Data type mismatch - Constraint violation Check the error message and suggestion for details. ### How do I debug query errors? 1. Use `df.to_sql()` to see the generated SQL 2. Check the error message and suggestion 3. Verify table/column names exist 4. Test the SQL directly in your database 5. See [Debugging Guide](./DEBUGGING.md) for more help ## Configuration Questions ### How do I configure connection pooling? ```python db = connect( "postgresql://user:pass@host/dbname", pool_size=10, # Number of connections max_overflow=5, # Additional connections pool_timeout=30, # Timeout in seconds pool_pre_ping=True, # Verify connections ) ``` ### How do I set query timeout? ```python # Via configuration db = connect( "postgresql://user:pass@host/dbname", query_timeout=30.0 # 30 seconds ) # Via environment variable # MOLTRES_QUERY_TIMEOUT=30.0 ``` ### How do I enable SQL logging? ```python db = connect( "postgresql://user:pass@host/dbname", echo=True # Enable SQL logging ) ``` ## Feature Questions ### Does Moltres support UDFs (User-Defined Functions)? No, Moltres doesn't support Python UDFs since operations are pushed down to SQL. Instead: - Use SQL functions available in your database - Use Moltres expression functions (see `moltres.expressions.functions`) - For complex logic, use raw SQL expressions ### Does Moltres support window functions? Yes! Moltres supports window functions: ```python from moltres.expressions.window import Window from moltres.expressions.functions import row_number, rank window = Window.partition_by(col("category")).order_by(col("amount").desc()) df = df.with_column("rank", rank().over(window)) ``` ### Does Moltres support recursive queries? Yes! Use recursive CTEs: ```python initial = db.table("seed").select(...) recursive = initial.select(...) # References CTE df = initial.recursive_cte("name", recursive) ``` ### Does Moltres support LATERAL joins? Yes, for PostgreSQL and MySQL 8.0+: ```python df = customers.join( orders.select().where(col("customer_id") == col("customers.id")), how="left", lateral=True ) ``` ### Can I use raw SQL? Yes, you can execute raw SQL: ```python # Execute raw SQL results = db.execute("SELECT * FROM users WHERE age > 18") # Use in subqueries subquery = db.execute("SELECT id FROM active_users") df = db.table("orders").select().where(col("user_id").isin(subquery)) ``` ## Migration Questions ### How do I migrate from PySpark? See the [Migration Guide](./MIGRATION_SPARK.md) for detailed instructions. Key differences: - Replace `SparkSession` with `connect()` - Update imports (`pyspark.sql.functions` → `moltres.expressions.functions`) - Use `col()` consistently - Replace UDFs with SQL functions ### How do I migrate from Pandas? 1. Replace `pd.read_csv()` with `db.load.csv()` 2. Replace pandas operations with Moltres DataFrame operations 3. Use `df.collect()` to get results as list of dicts 4. Convert to pandas if needed: `pd.DataFrame(df.collect())` ## Troubleshooting ### My query returns no results 1. Check if the table has data: `db.table("users").select().count()` 2. Verify filter conditions: `df.to_sql()` to see SQL 3. Check for NULL values: Use `isnull()` / `isnotnull()` ### My query is too slow 1. Check indexes: `EXPLAIN` the query 2. Filter early: Apply WHERE clauses before joins 3. Use LIMIT: Test with small result sets first 4. See [Performance Guide](./PERFORMANCE.md) ### I get "table does not exist" error 1. Verify table name spelling 2. Check database connection 3. Ensure table is in the correct schema 4. Use `db.execute("SHOW TABLES")` to list tables ### Column names are case-sensitive SQL column names may be case-sensitive depending on your database: - SQLite: Case-insensitive - PostgreSQL: Case-sensitive (use quotes) - MySQL: Depends on configuration Use exact column names as they appear in your database. ## Getting Help - **Documentation:** See [README](../README.md) and other guides - **Examples:** See [Examples Guide](./EXAMPLES.md) - **Issues:** Report on GitHub - **Questions:** Check this FAQ first For more help: - [Performance Guide](./PERFORMANCE.md) - [Best Practices](./BEST_PRACTICES.md) - [Debugging Guide](./DEBUGGING.md) - [Troubleshooting Guide](./TROUBLESHOOTING.md)