Moltres Package Plan

Archived

This document outlines an early package and architecture plan for Moltres. It is kept for maintainers and is not part of the primary user docs.

1. Name

Moltres – inspired by the legendary fire Pokémon, evoking speed, power, and a spark-like DataFrame API.

2. Project Goal

Provide a PySpark DataFrame API that executes lazily on real SQL databases, supporting select, joins, aggregations, inserts, updates, deletes, without needing Spark.

3. High-Level Architecture

Core Layers (Bottom → Top)

  1. Expression System: Columns, literals, functions → produce a symbolic expression tree (like PySpark Column or Polars Expr)

  2. Logical Plan Builder: select, join, filter, groupby → produce a logical plan tree (Project, Filter, Join, Aggregate, etc.)

  3. SQL Compiler: Converts the logical plan into SQL (ANSI + dialect adaptations)

  4. Execution Engine: Uses SQLAlchemy for DB connections, executes SQL, returns DataFrame (pandas/polars)

  5. Mutation Engine: Supports insert, update, delete on underlying SQL tables

4. Package Directory Structure

moltres/
    __init__.py
    config.py
    engine/
        connection.py
        execution.py
        dialects.py
    expressions/
        column.py
        expr.py
        functions.py
    logical/
        plan.py
        operators.py
    dataframe/
        dataframe.py
        groupby.py
    sql/
        compiler.py
        builders.py
    io/
        read.py
        write.py
    table/
        table.py
        mutations.py
    utils/
        exceptions.py
        typing.py
        inspector.py
    tests/
        ...

5. API Design

Connect to DB

from moltres import connect

db = connect("postgresql://user:pass@host/db")

Select & Filter

t = db.table("customers")

df = (
    t.select("id", "name", (col("spend") * 1.1).alias("adj_spend"))
     .where(col("active") == True)
     .order_by(col("created_at").desc())
)

df.show()

Joins

df = orders.join(customers, on="customer_id").select(
    customers["name"],
    orders["total"],
)

GroupBy & Aggregations

df = t.groupBy("country").agg(
    sum(col("spend")).alias("total_spend"),
    count("*").alias("n"),
)

Insert

t.insert(new_df)

Update

t.update(
    where=col("status") == "new",
    set={"status": lit("processed")}
)

Delete

t.delete(col("created_at") < "2024-01-01")

6. Internal Components

Expression System

  • Column expressions (col("spend") + 1, col("country").like("%US%"))

  • Functions (sum, avg, upper, concat, etc.)

Logical Plan Nodes

  • Project, Filter, Aggregate, Join, Limit, Sort, TableScan

SQL Compiler

  • Translates logical plan → SQL

  • Handles aliases, expressions, joins, groupings

  • Supports multiple SQL dialects

Execution Engine

  • Uses SQLAlchemy for connections

  • Executes SQL

  • Returns pandas/polars DataFrame

Mutation Engine

  • INSERT, UPDATE, DELETE

  • Returns row count or status

7. Development Roadmap

Week 1: Foundation

  • Project structure

  • Column, Literal, basic expressions

  • TableScan

  • Minimal DataFrame wrapper

Week 2: Logical Plan + Compiler

  • select / where / limit

  • Expression → SQL conversion

  • Execute queries → pandas

Week 3: Full Query Support

  • joins

  • groupBy / aggregates

  • orderBy

  • Dialect support (Postgres + SQLite)

Week 4: Mutation

  • insert DataFrame

  • update

  • delete

Week 5: Polars Integration

  • Convert query results to polars

  • Accept polars DataFrames for insert

Week 6: Stabilize API + Docs

  • Documentation (sphinx/mkdocs)

  • Examples

  • Benchmark suite

  • PyPI packaging

8. Testing Plan

  • Unit tests: expressions, logical plan, SQL compiler, mutation queries

  • Integration tests: SQLite + PostgreSQL

  • Performance tests: compare Sparklet vs raw SQL

9. Value Proposition

  • PySpark familiarity, SQL execution

  • No cluster required

  • Safer than writing dynamic SQL

  • Composable & testable

  • Wraps existing SQL databases

  • Lets analysts/engineers write PySpark-style code anywhere