Query Optimization Resources

# Query Optimization Resources Last semester, I did a lot of reading on query optimization. These are some of my favorite resources I found in my journey. **Overview:** Query optimization mainly has about 3 algorithms - volcano, cascades, and system R. They were each created in the 90’s and 80’s, and the two former are much more efficient, with the latter still being used due to its simplicity. The two categories of optimizers are cost-based and rule-based. Cost-based generates query plans with costs associated with them. Table cardinality, data distribution, indexes, and join order can all affect how much a plan costs. The database evaluates the cost of various plans to choose a final one to execute. Generating things like data distribution histograms can involve overhead or complexity that some databases don't want. Rule-based optimizers don’t need to gather all these statistics to make plans - they just use heuristics and operate on information that can be gathered statically from the state of the database, and choose a rule based on that. A similar example is how a compiler will reduce expressions or inline expressions. “Heavier” databases will use cost-based, and smaller databases, such as in-memory stores, will use rule-based. There are exceptions of course; SQLite uses a cost based optimizer. ## Resources I would start here, Query Optimizer lecture series from the CMU database group. Andy Pavlo is an amazing lecturer. [https://www.youtube.com/watch?v=YWtH10gfcY0&list=PLSE8ODhjZXjYCZfIbmEWH7f6MnYqyPwCE](https://www.youtube.com/watch?v=YWtH10gfcY0&list=PLSE8ODhjZXjYCZfIbmEWH7f6MnYqyPwCE) Survey paper on the state of query optimizers. TMU is probably the best database research institution along with CMU. If you see “Thomas Neumann” on a paper, it will be a good read. [https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf](https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf) More applied paper on how the MySQL query optimizer is used. [https://openproceedings.org/2022/conf/edbt/paper-87.pdf](https://openproceedings.org/2022/conf/edbt/paper-87.pdf) Query engine that databricks uses for lakehouse data. [https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf](https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf) Article on optimizing SQL, not query optimizers [https://jnidzwetzki.github.io/2025/06/03/art-of-query-optimization.html](https://jnidzwetzki.github.io/2025/06/03/art-of-query-optimization.html) Implementation of a minimal query optimizer from DuckDB, great read! The source code of the DuckDB optimizer is also good to read if you want to see the code of an optimizer. [https://duckdb.org/2024/11/14/optimizers#filter-pull-up--filter-pushdown](https://duckdb.org/2024/11/14/optimizers#filter-pull-up--filter-pushdown) More of a spec, SQLite optimizer. [https://sqlite.org/optoverview.html](https://sqlite.org/optoverview.html) ANALYZE is ran to gather database statistics to help with query optimization. This is an article from Alex Miller on how to implement it. This guy has a lot of great articles for databases! [https://transactional.blog/sqlite/analyze](https://transactional.blog/sqlite/analyze) One of the most recent landmark papers in query optimization [https://arxiv.org/pdf/2510.05907](https://arxiv.org/pdf/2510.05907) Query space pruning in large-scale object storage - Snowflake [https://arxiv.org/pdf/2504.11540](https://arxiv.org/pdf/2504.11540) Update from earlier paper on “how good are query optimizers?” [https://www.vldb.org/pvldb/vol18/p5531-viktor.pdf](https://www.vldb.org/pvldb/vol18/p5531-viktor.pdf)