Moltres Package Plan

Archived

This document outlines an early package and architecture plan for Moltres. It is kept for maintainers and is not part of the primary user docs.

1. Name

Moltres – inspired by the legendary fire Pokémon, evoking speed, power, and a spark-like DataFrame API.

2. Project Goal

Provide a PySpark DataFrame API that executes lazily on real SQL databases, supporting select, joins, aggregations, inserts, updates, deletes, without needing Spark.

3. High-Level Architecture

Core Layers (Bottom → Top)

Expression System: Columns, literals, functions → produce a symbolic expression tree (like PySpark Column or Polars Expr)
Logical Plan Builder: select, join, filter, groupby → produce a logical plan tree (Project, Filter, Join, Aggregate, etc.)
SQL Compiler: Converts the logical plan into SQL (ANSI + dialect adaptations)
Execution Engine: Uses SQLAlchemy for DB connections, executes SQL, returns DataFrame (pandas/polars)
Mutation Engine: Supports insert, update, delete on underlying SQL tables

4. Package Directory Structure

moltres/
    __init__.py
    config.py
    engine/
        connection.py
        execution.py
        dialects.py
    expressions/
        column.py
        expr.py
        functions.py
    logical/
        plan.py
        operators.py
    dataframe/
        dataframe.py
        groupby.py
    sql/
        compiler.py
        builders.py
    io/
        read.py
        write.py
    table/
        table.py
        mutations.py
    utils/
        exceptions.py
        typing.py
        inspector.py
    tests/
        ...

5. API Design

Connect to DB

from moltres import connect

db = connect("postgresql://user:pass@host/db")

Select & Filter

t = db.table("customers")

df = (
    t.select("id", "name", (col("spend") * 1.1).alias("adj_spend"))
     .where(col("active") == True)
     .order_by(col("created_at").desc())
)

df.show()

Joins

df = orders.join(customers, on="customer_id").select(
    customers["name"],
    orders["total"],
)

GroupBy & Aggregations

df = t.groupBy("country").agg(
    sum(col("spend")).alias("total_spend"),
    count("*").alias("n"),
)

Insert

t.insert(new_df)

Update

t.update(
    where=col("status") == "new",
    set={"status": lit("processed")}
)

Delete

t.delete(col("created_at") < "2024-01-01")

6. Internal Components

Expression System

Column expressions (col("spend") + 1, col("country").like("%US%"))
Functions (sum, avg, upper, concat, etc.)

Logical Plan Nodes

Project, Filter, Aggregate, Join, Limit, Sort, TableScan

SQL Compiler

Translates logical plan → SQL
Handles aliases, expressions, joins, groupings
Supports multiple SQL dialects

Execution Engine

Uses SQLAlchemy for connections
Executes SQL
Returns pandas/polars DataFrame

Mutation Engine

INSERT, UPDATE, DELETE
Returns row count or status

7. Development Roadmap

Week 1: Foundation

Project structure
Column, Literal, basic expressions
TableScan
Minimal DataFrame wrapper

Week 2: Logical Plan + Compiler

select / where / limit
Expression → SQL conversion
Execute queries → pandas

Week 3: Full Query Support

joins
groupBy / aggregates
orderBy
Dialect support (Postgres + SQLite)

Week 4: Mutation

insert DataFrame
update
delete

Week 5: Polars Integration

Convert query results to polars
Accept polars DataFrames for insert

Week 6: Stabilize API + Docs

Documentation (sphinx/mkdocs)
Examples
Benchmark suite
PyPI packaging

8. Testing Plan

Unit tests: expressions, logical plan, SQL compiler, mutation queries
Integration tests: SQLite + PostgreSQL
Performance tests: compare Sparklet vs raw SQL

9. Value Proposition

PySpark familiarity, SQL execution
No cluster required
Safer than writing dynamic SQL
Composable & testable
Wraps existing SQL databases
Lets analysts/engineers write PySpark-style code anywhere