# Moltres Package Plan ```{admonition} Archived :class: warning This document outlines an early package and architecture plan for Moltres. It is kept for maintainers and is not part of the primary user docs. ``` ## 1. Name **Moltres** – inspired by the legendary fire Pokémon, evoking speed, power, and a spark-like DataFrame API. ## 2. Project Goal Provide a **PySpark DataFrame API** that executes lazily on **real SQL databases**, supporting **select, joins, aggregations, inserts, updates, deletes**, without needing Spark. ## 3. High-Level Architecture ### Core Layers (Bottom → Top) 1. **Expression System**: Columns, literals, functions → produce a symbolic expression tree (like PySpark Column or Polars Expr) 2. **Logical Plan Builder**: select, join, filter, groupby → produce a logical plan tree (Project, Filter, Join, Aggregate, etc.) 3. **SQL Compiler**: Converts the logical plan into SQL (ANSI + dialect adaptations) 4. **Execution Engine**: Uses SQLAlchemy for DB connections, executes SQL, returns DataFrame (pandas/polars) 5. **Mutation Engine**: Supports `insert`, `update`, `delete` on underlying SQL tables ## 4. Package Directory Structure ```text moltres/ __init__.py config.py engine/ connection.py execution.py dialects.py expressions/ column.py expr.py functions.py logical/ plan.py operators.py dataframe/ dataframe.py groupby.py sql/ compiler.py builders.py io/ read.py write.py table/ table.py mutations.py utils/ exceptions.py typing.py inspector.py tests/ ... ``` ## 5. API Design ### Connect to DB ```python from moltres import connect db = connect("postgresql://user:pass@host/db") ``` ### Select & Filter ```python t = db.table("customers") df = ( t.select("id", "name", (col("spend") * 1.1).alias("adj_spend")) .where(col("active") == True) .order_by(col("created_at").desc()) ) df.show() ``` ### Joins ```python df = orders.join(customers, on="customer_id").select( customers["name"], orders["total"], ) ``` ### GroupBy & Aggregations ```python df = t.groupBy("country").agg( sum(col("spend")).alias("total_spend"), count("*").alias("n"), ) ``` ### Insert ```python t.insert(new_df) ``` ### Update ```python t.update( where=col("status") == "new", set={"status": lit("processed")} ) ``` ### Delete ```python t.delete(col("created_at") < "2024-01-01") ``` ## 6. Internal Components ### Expression System - Column expressions (`col("spend") + 1`, `col("country").like("%US%")`) - Functions (`sum`, `avg`, `upper`, `concat`, etc.) ### Logical Plan Nodes - Project, Filter, Aggregate, Join, Limit, Sort, TableScan ### SQL Compiler - Translates logical plan → SQL - Handles aliases, expressions, joins, groupings - Supports multiple SQL dialects ### Execution Engine - Uses SQLAlchemy for connections - Executes SQL - Returns pandas/polars DataFrame ### Mutation Engine - INSERT, UPDATE, DELETE - Returns row count or status ## 7. Development Roadmap ### Week 1: Foundation - Project structure - Column, Literal, basic expressions - TableScan - Minimal DataFrame wrapper ### Week 2: Logical Plan + Compiler - select / where / limit - Expression → SQL conversion - Execute queries → pandas ### Week 3: Full Query Support - joins - groupBy / aggregates - orderBy - Dialect support (Postgres + SQLite) ### Week 4: Mutation - insert DataFrame - update - delete ### Week 5: Polars Integration - Convert query results to polars - Accept polars DataFrames for insert ### Week 6: Stabilize API + Docs - Documentation (sphinx/mkdocs) - Examples - Benchmark suite - PyPI packaging ## 8. Testing Plan - Unit tests: expressions, logical plan, SQL compiler, mutation queries - Integration tests: SQLite + PostgreSQL - Performance tests: compare Sparklet vs raw SQL ## 9. Value Proposition - PySpark familiarity, SQL execution - No cluster required - Safer than writing dynamic SQL - Composable & testable - Wraps existing SQL databases - Lets analysts/engineers write PySpark-style code anywhere