Reduce the pain of code review and knowledge transfer with better SQL format and documentation.
Have you ever come across a complicated SQL query for analysis?
Did you struggle to understand the code itself and the business logic underneath?
I did. And sometimes, it was with my queries in the past!
To save time for everyone who read my code (including myself), I have tried to apply two templates to my query and found that they are very helpful to:
- Increase the quality of the code itself
- Reduce code review time
- And improve knowledge transfer
What you can expect
In this article, I will share with you these two templates that I use.
Template #1: Document context and assumptions in your SQL query
Template #2: Format SQL query
To demonstrate their usage, I will go through an example using MySQL to summarize sales before and after COVID-19.
I hope these templates come in handy! This is especially when remote work is our new normal after COVID-19 started, increasing the importance of over-communication to make sure everyone is on the same page.
Template #1: Document context and assumptions in SQL query
List this information before you write a query:
- Important business context for this query
- Expectations for query result
- Any assumptions made for the business logic and data
/* CONTEXT: - add a brief description of why we need this queryRESULT EXPECTATION - add a brief description of your expectations for the query result ASSUMPTION: - add assumption about business logic- add assumption about data */
2. Template in Action
Apply this template to our sales summary example:
/* CONTEXT: - Our company wants to understand if COVID has any impact on sales in stores around Chicago RESULT EXPECTATION: - This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01 ASSUMPTION: - Dates before 2020-03-01 are considered "Before COVID"- Each transaction has a unique id, so we do not expect duplications in our transaction table - There are some spam transactions we have identified after COVID, so we will filter these out */
Having a brief description with business context, result expectation, and assumptions before our queries has many benefits:
- It keeps us focused on the main goal of the query while writing it
- It helps readers quickly establish a high-level understanding of the value and expectations of our queries
- It assists code reviewers to create initial tests for the queries based on the expectations
Remember, this step is an iterating process. We might have to go back and forth to improve the documentation as we write the query.
Template #2: Format SQL query
1. Template Rules
There are many rules out there to format the SQL query. To keep it simple, these are the main rules I follow:
- Highlight reserved keywords (eg. SELECT, WHERE) using upper-case
- Clearly show where a query or subquery starts and ends using indentation
- For long and complicated query, include a comment before any major sub-queries or joins for context
- Give reference for where columns come from with their source table or descriptive table alias
2. Template in Action
This query is for demonstration only
There are many benefits to readability with this query format. Here are some:
- Understand the overall structure of the query (ie. what columns are chosen, how many joins are there, which filters are applied)
- Save time to identify the start and end of subqueries for testing, because we can see the opening and closing parentheses of a subquery in the same vertical line
- Avoid getting lost in a complicated query with comments throughout the query
Combining Template #1 and Template #2
Combining these two templates in our example, we will have:
/* CONTEXT: - Our company wants to understand if COVID has any impact on sales in stores around Chicago RESULT EXPECTATION - This query returns total sales (in USD) for each of our stores in Chicago every month before and after COVID, starting from 2019-03-01 ASSUMPTION: - Dates before 2020-03-01 are considered "Before COVID" - Each transaction has a unique id, so we do not expect duplications in our transaction table - There are some spam transactions we have identified after COVID, so we will filter these out */ SELECT store_info.id, store_info.name AS store_name, DATE_FORMAT(transactions.date, "%Y-%m") AS transaction_month, SUM(transactions.total_amount) AS total_amount FROM transactions LEFT JOIN -- get all stores in Chicago ( SELECT id, name FROM stores WHERE city = 'Chicago' ) AS store_info ON transactions.branch_id = store_info.id WHERE transactions.date >= '2019-03-01' -- filter spam transactions AND transactions.id NOT IN ( SELECT id FROM spam_transactions ) GROUP BY store_info.id, store_info.name, DATE_FORMAT(transactions.date, "%Y-%m")
These templates are not the only templates out there. Finding out what works best for you and your team is a trial and error process.
After you finalize a formatting style, manually styling every query can be tiring. Many SQL IDE has an option to automate this process for you. However, if it still doesn’t meet your need, there are tools out there. Try searching with these keywords: “SQL formatter”
For Python users, I’ve been playing around with sqlparse. You can get more information here.