This is my note/workaround for the article “25 SQL tips to level up your data engineering skills” by Start Data Engineering, authored by Joseph Machado.

setup

Joseph provided a ready-to-go project regarding this advanced SQL transformation topics in this repo using DuckDB in Jupyter Notebook. We can either run it in the Github codespace or in our local machine with minimal setup.

We use the TPC-H database for demonstration:

The TPC-H Schema, source: TPC Benchmark H Standard Specification

#1 handy functions for data wrangling

#1.1 `QUALIFY`

We use QUALIFY to filter the output column of WINDOW function without creating more CTEs/Subqueries. It’s not available in many traditional/on-prem RDBMS (MySQL, Oracle, MSSQL, PostgreSQL) but available in mordern/cloud-based databases (Snowflake, BigQuery, Databricks, MS Fabric, Teradata).

Show the code

SELECT
    o_orderkey,
    o_totalprice,
    RANK() OVER (ORDER BY o_totalprice DESC) AS price_rank -- rank the order by `o_totalprice` in desc
FROM orders
QUALIFY price_rank <= 10; -- filter top 10
/* would need a CTE or Sub-query without QUALIFY */

#1.2 `DISTINCT ON`

Orginated from PostgreSQL, now support by some cloud-based like Snowflake but not in MySQL, SQL Server, Oracle, DISTINCT ON allows us to get 1 detailed row (first or last) for a particular partition.

Show the code

SELECT DISTINCT ON (o_custkey) -- distinct only `o_custkey`
    o_custkey,
    o_orderdate,
    o_totalprice
FROM orders
ORDER BY o_custkey, o_orderdate DESC; -- make sure to get the latest order by `o_orderdate`

#1.3 This can also be done with `ROW_NUMBER()` + `QUALIFY`:

Show the code

SELECT
    o_custkey,
    o_orderdate,
    o_totalprice,
    ROW_NUMBER() OVER (PARTITION BY o_custkey ORDER BY o_orderdate DESC) AS rn
FROM orders
QUALIFY rn = 1
ORDER BY o_custkey DESC; -- ROW_NUMBER() shuffle the order of the data

#1.4 `STRUCT_PACK()`

STRUCT_PACK() is a function primarily associated with DuckDB:

Creates a compact, binary representation of multiple values;
Allows you to pack different data types into a single column;
Useful for data compression and efficient storage.

Below picture depicts how struct works:

Show the code

WITH order_struct AS (
    SELECT
        o_orderkey,
        STRUCT_PACK(o_orderdate, o_totalprice, o_orderkey) AS order_info
    FROM orders
)
SELECT
    MIN(order_info) AS min_order_date, -- get min of information from left to right
    MAX(order_info) AS max_order_date_price -- get of information from left to right,
    -- if there are many txn in that latest day, get the transaction with max price
FROM order_struct;

#1.5 `BOOL_OR()` & `BOOL_AND()`

BOOL_OR() & BOOL_AND() allows you to check a logical statement along all rows of a columns, supported in PostgreSQL, Snowflake, DuckDB, BigQuery, Databricks:

Show the code

SELECT
    o_custkey,
    BOOL_OR(cast(o_shippriority as boolean)) AS has_atleast_one_priority_order, -- check whether AT LEAST 1 order of that customer has Is Priority = True
    BOOL_AND(cast(o_shippriority as boolean)) AS has_all_priority_order -- check whether ALL orders of that customer has Is Priority = True
FROM orders
GROUP BY o_custkey;

#1.6 `EXCLUDE()`

When you want to select all (*) columns excet few ones:

Show the code

SELECT * EXCLUDE (o_orderdate, o_totalprice)
FROM orders;

#1.7 `GROUP BY ALL` saves the day

Repeating all the columns listed in the SELECT statement in GROUP BY is annoying, just use ALL:

Show the code

SELECT
    o_orderkey,
    o_custkey,
    o_orderstatus,
    SUM(o_totalprice) AS total_price
FROM orders
GROUP BY ALL;

#1.8 `COUNT_IF()`

Filter over rows in specific column:

Show the code

SELECT
    o_custkey,
    COUNT_IF(o_totalprice > 100000) AS high_value_orders,
    -- equivalent to SUM(CASE WHEN o_totalprice > 100000 THEN 1 ELSE 0 END)
    COUNT(o_totalprice) as all_orders
FROM orders
GROUP BY o_custkey;

#1.9 `STRING_AGG()`

Concatenate rows of string in a GROUP BY statement:

Show the code

SELECT STRING_AGG(c_name, ', ') AS customer_names
FROM customer;

#1.10 Null handling with `COALESCE()`

Handling null value in a column with value from another column or default value:

Show the code

WITH fake_orders AS (
    SELECT 1 AS o_orderkey, 100 AS o_totalprice, NULL AS discount
    UNION ALL
    SELECT 2 AS o_orderkey, 200 AS o_totalprice, 20 AS discount
    UNION ALL
    SELECT 3 AS o_orderkey, 300 AS o_totalprice, NULL AS discount
)
SELECT
    o_orderkey,
    o_totalprice,
    discount,
    COALESCE(discount, o_totalprice * 0.10) AS final_discount
FROM fake_orders;

#1.11 `GENERATE_SERIES()`

This helps you generate a sequence/series of data over a range with an interval which facilitating data simulation or joining with other tables.

Show the code

SELECT *
FROM generate_series(1, 10);

Show the code

SELECT *
FROM generate_series('2024-01-01'::DATE, '2024-01-10'::DATE, INTERVAL 1 DAY);

#1.12 `UNNEST()`

Unpacking data wrapped in JSON array or list [...] using UNNEST():

Show the code

WITH nested_data AS (
    SELECT 1 AS id, [10, 20, 30] AS values
    UNION ALL
    SELECT 2 AS id, [40, 50] AS values
)
SELECT
    id,
    UNNEST(values) AS flattened_value
FROM nested_data;

#2 `SET` operations

#2.1 `EXISTS`

EXISTS is used to test for the existence of rows in a subquery, return TRUE if any row in the subquery returned. I think it’s often used for logic check to correlate queries.

It checks the existence like INNER JOIN but does not join full rows or select columns from the subquery.

Show the code

SELECT
    c_custkey,
    c_name
FROM customer
WHERE EXISTS (
    SELECT o_orderkey
    FROM orders
    WHERE 1=1
    -- AND o_totalprice > 5000000 -- Return customer who has at least 1 order with total price > 500k
    AND o_custkey = c_custkey  -- Return customer who has at least 1 order
);

#2.2 `INTERSECT`

Below query return all c_custkey that appears in both customer and orders table.

Show the code

SELECT c_custkey
FROM customer
INTERSECT
SELECT o_custkey
FROM orders;

#2.3 `EXCEPT`

And the below query return all c_custkey that appears customer but not in orders table.

Show the code

SELECT c_custkey
FROM customer
EXCEPT
SELECT o_custkey
FROM orders;

Using this, we can perform delta/data diff checking!

#3 macros

We can create UDF (User Define Function) or Macros in SQL for abstracting complex logic and reusing it in different parts of the query, this improves both readability and maintainability. When executing, the macros are expanded inline, which avoids the overhead of function calls. Below is a simple function/macro:

Show the code

CREATE MACRO percentage(numerator, denominator) AS (
    (CAST(numerator AS DOUBLE) / CAST(denominator AS DOUBLE)) * 100
);

#4 jinja2

Jinja is a fast, expressive, extensible templating engine.

In data engineering, Jinja2 is widely used in:

DBT (Data Build Tool)
- Template SQL transformations
- Parameterize database queries
- Dynamic model generation
Airflow
- Dynamic DAG (Directed Acyclic Graph) generation
- Parameterize task configurations
Great Expectations
- Configure data validation rules
- Generate dynamic expectations

Show the code

from jinja2 import Template

# Define a Jinja2 SQL template with a loop
sql_template = """
SELECT o_orderkey, o_custkey, o_totalprice
FROM orders
WHERE o_totalprice > {{ price_threshold }}
{% if customer_keys %}
  AND o_custkey IN (
    {% for custkey in customer_keys %}
      {{ custkey }}{% if not loop.last %}, {% endif %}
    {% endfor %}
  )
{% endif %}
ORDER BY o_totalprice DESC;
"""

# Render the template with dynamic parameters
template = Template(sql_template)
# Parameters to be passed to the template
params = {
    "price_threshold": 20000,
    "customer_keys": [1001, 1002, 1003]  # A list of customer keys to filter on
}
# Render the SQL query (do not execute, just generate SQL)
rendered_sql = template.render(params)

print(rendered_sql)

And the query be generated:

Show the code

SELECT o_orderkey, o_custkey, o_totalprice
FROM orders
WHERE o_totalprice > 20000

  AND o_custkey IN (

      1001,

      1002,

      1003

  )

ORDER BY o_totalprice DESC;

#5 metadata

The database documentation is normally stored in the information_schema, the first schema that created in DB:

Show the code

-- Information about our tables are stored here
SELECT schema_name,
    view_name
    FROM duckdb_views();

Dive into the table tables:

Show the code

SELECT
table_catalog, table_schema, table_name
from information_schema.tables;

Other useful query for DB admin in DuckDB:

Show the code

-- database level settings
SELECT * FROM duckdb_settings();

-- list of all tables in our DuckDB
SELECT schema_name,
    table_name
    FROM duckdb_tables();

#6 de-duplicate

We can use UPSERT (or MERGE INTO) - INsert new data and UPdate existing data. The syntaxes vary among DBs:

PostgreSQL: INSERT ... ON CONFLICT
MySQL: INSERT ... ON DUPLICATE KEY UPDATE
SQLite: INSERT OR REPLACE
SQL Server: MERGE statement
Oracle: MERGE statement

For eg., in DuckDB:

Show the code

DROP TABLE IF EXISTS dim_customer_scd2;
-- Create a Slowly Changing Dimension (SCD Type 2) table for customer
CREATE TABLE dim_customer_scd2 (
    c_custkey INTEGER PRIMARY KEY,
    c_name VARCHAR,
    c_address VARCHAR,
    c_nationkey INTEGER,
    c_phone VARCHAR,
    c_acctbal DOUBLE,
    c_mktsegment VARCHAR,
    c_comment VARCHAR,
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN
);

-- Insert current data from the TPCH customer table into the SCD2 table
INSERT INTO dim_customer_scd2
SELECT
    c_custkey,
    c_name,
    c_address,
    c_nationkey,
    c_phone,
    c_acctbal,
    c_mktsegment,
    c_comment,
    '2024-10-17' AS valid_from,
    NULL AS valid_to,  -- NULL means it's the current active record
    TRUE AS is_current
FROM customer;

This query create a new SCD2 table and UPSERT data into it. When managing a table, we can handle (INSERT or UPDATE) like this:

Show the code

INSERT INTO dim_customer_scd2 (
    c_custkey,
    c_name,
    c_address,
    c_nationkey,
    c_phone,
    c_acctbal,
    c_mktsegment,
    c_comment,
    valid_from,
    valid_to,
    is_current
)
VALUES
    (1, 'Customer#000000001', 'New Address 1', 15, '25-989-741-2988', 711.56, 'BUILDING', 'comment1', '2024-10-18', NULL, TRUE),
    (2, 'Customer#000000002', 'New Address 2', 18, '12-423-790-3665', 879.49, 'FURNITURE', 'comment2', '2024-10-18', NULL, TRUE),
    (1501, 'Customer#000001501', 'New Address 1501', 24, '11-345-678-9012', 500.50, 'MACHINERY', 'comment1501', '2024-10-18', NULL, TRUE),
    (1502, 'Customer#000001502', 'New Address 1502', 21, '22-456-789-0123', 600.75, 'AUTOMOBILE', 'comment1502', '2024-10-18', NULL, TRUE)
ON CONFLICT (c_custkey) DO
-- Handle existing customers (Customer#000000001 and Customer#000000002) for SCD Type 2
UPDATE SET valid_to = EXCLUDED.valid_from, is_current = FALSE
WHERE dim_customer_scd2.c_custkey = EXCLUDED.c_custkey AND dim_customer_scd2.is_current = TRUE;

#7 `JOIN`s that you’ve never ever seen before

#7.1 `ASOF JOIN`

An ASOF JOIN (As-Of Join) is a time-series specific join operation that matches rows based on the closest timestamp, typically used in financial and time-series data analysis.

Key characteristics:
- Matches each row from one table to the most recent row in another table by time
- Useful for tracking historical state changes
- Common in financial databases and time-series analysis
Example use case:
- Joining stock prices with trading orders
- Matching historical prices at the nearest timestamp
- Tracking changes in reference data over time

For eg. “give me the value of the property as of this time”:

Show the code

WITH stock_prices AS (
    SELECT 'APPL' AS ticker, TIMESTAMP '2001-01-01 00:00:00' AS "when", 1 AS price
    UNION ALL
    SELECT 'APPL', TIMESTAMP '2001-01-01 00:01:00', 2
    UNION ALL
    SELECT 'APPL', TIMESTAMP '2001-01-01 00:02:00', 3
    UNION ALL
    SELECT 'MSFT', TIMESTAMP '2001-01-01 00:00:00', 1
    UNION ALL
    SELECT 'MSFT', TIMESTAMP '2001-01-01 00:01:00', 2
    UNION ALL
    SELECT 'MSFT', TIMESTAMP '2001-01-01 00:02:00', 3
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2001-01-01 00:00:00', 1
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2001-01-01 00:01:00', 2
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2001-01-01 00:02:00', 3
),
portfolio_holdings AS (
    SELECT 'APPL' AS ticker, TIMESTAMP '2000-12-31 23:59:30' AS "when", 5.16 AS shares
    UNION ALL
    SELECT 'APPL', TIMESTAMP '2001-01-01 00:00:30', 2.94
    UNION ALL
    SELECT 'APPL', TIMESTAMP '2001-01-01 00:01:30', 24.13
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2000-12-31 23:59:30', 9.33
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2001-01-01 00:00:30', 23.45
    UNION ALL
    SELECT 'GOOG', TIMESTAMP '2001-01-01 00:01:30', 10.58
    UNION ALL
    SELECT 'DATA', TIMESTAMP '2000-12-31 23:59:30', 6.65
    UNION ALL
    SELECT 'DATA', TIMESTAMP '2001-01-01 00:00:30', 17.95
    UNION ALL
    SELECT 'DATA', TIMESTAMP '2001-01-01 00:01:30', 18.37
)
SELECT h.ticker,
    h.when,
    p.when as stock_price_ts,
    price,
    shares,
    price * shares AS value
FROM portfolio_holdings h
ASOF JOIN stock_prices p
       ON h.ticker = p.ticker
      AND h.when >= p.when
ORDER BY 1, 2;

Without ASOF JOIN, we must use a WINDOW function to achieve this.

#7.2 `ANTI JOIN`

The pattern is to get rows in one table that are not in another table:

Show the code

SELECT c.c_custkey
FROM customer c
LEFT JOIN orders o
ON c.c_custkey = o.o_custkey
WHERE o.o_custkey IS NULL
ORDER BY c.c_custkey
LIMIT 5;

Some DBs have native support for ANTI JOIN:

Show the code

SELECT c.c_custkey
FROM customer c
ANTI JOIN orders o
ON c.c_custkey = o.o_custkey
ORDER BY c.c_custkey
LIMIT 5;

#7.3 `LATERAL JOIN`

LATERAL JOIN is a powerful SQL join type that allows a subquery in the FROM clause to reference columns from preceding tables in the same FROM clause.

Enables per-row dynamic subqueries
Allows correlated subqueries in the FROM clause
Supported in PostgreSQL, BigQuery, some other modern databases

For e.g, for each row in the orders table (o), the subquery in the LATERAL JOIN selects line items (l) that match certain conditions.

Show the code

SELECT
    o.o_orderkey,
    o.o_totalprice,
    l.l_linenumber,
    l.l_extendedprice
FROM orders o,
LATERAL (
    SELECT l.l_linenumber,
    l_extendedprice
    FROM lineitem l
    WHERE l.l_orderkey = o.o_orderkey
    AND l.l_linenumber <= 2
    AND l.l_extendedprice < (o.o_totalprice / 2)
) AS l
ORDER BY 1, 3;

For each row in the orders table (o), the subquery in the LATERAL JOIN counts the number of line items (lineitem_count) related to that order.

Show the code

SELECT
    o.o_orderkey,
    o.o_totalprice,
    l.lineitem_count
FROM orders o,
LATERAL (
    SELECT COUNT(*) AS lineitem_count
    FROM lineitem l
    WHERE l.l_orderkey = o.o_orderkey
) AS l;

#8 use cases

Some real business use cases:

#8.1 `PIVOT`

We often want to change a value of a column into individual columns, we can use CASE ... WHEN ...:

Show the code

SELECT
    o_custkey,
    SUM(CASE WHEN o_orderstatus = 'F' THEN o_totalprice ELSE 0 END) AS fulfilled_total,
    SUM(CASE WHEN o_orderstatus = 'O' THEN o_totalprice ELSE 0 END) AS open_total,
    SUM(CASE WHEN o_orderstatus = 'P' THEN o_totalprice ELSE 0 END) AS pending_total
FROM orders
GROUP BY o_custkey
ORDER BY o_custkey;

But also leverage PIVOT:

Show the code

FROM orders
PIVOT (
    sum(o_totalprice)
    FOR
        o_orderstatus IN ('F', 'O', 'P')
    GROUP BY o_custkey
)
    ORDER BY o_custkey;

#8.1 `CUBE`

Show the code

SELECT
    o_orderpriority,
    o_orderstatus,
    EXTRACT(YEAR FROM o_orderdate) AS order_year,
    SUM(o_totalprice) AS total_sales
FROM orders
GROUP BY CUBE (o_orderpriority, o_orderstatus, order_year)
ORDER BY 1,2,3;

CUBE (o_custkey, o_orderstatus, order_year): This will group the data and calculate subtotals and grand totals for all possible combinations of o_custkey, o_orderstatus, and order_year.
It will generate all combinations of the grouping columns, including:
- Grouping by just o_custkey
- Grouping by just o_orderstatus
- Grouping by just order_year
- Grouping by all three together
- Grouping by pairs of columns
- A grand total (no grouping by any column)
- SUM(o_totalprice): For each combination of groupings, it sums the total order price.

Use cases:

OLAP Reporting: CUBE is commonly used in OLAP scenarios where you need to analyze data from multiple perspectives. For instance, you may want to generate reports that show total sales by customer, by order status, by year, and all possible combinations of these dimensions.

Sales Analysis: In sales analysis, CUBE can help create pivot-like summaries that show how different attributes (e.g., region, product, time period) contribute to the overall sales.

Financial Reports: Financial departments often use CUBE to calculate totals and subtotals across dimensions like departments, time periods, and account categories, making it easier to prepare comprehensive financial reports.

Happy learning!

resource

Article by Start Data Engineering: https://www.startdataengineering.com/post/n-sql-tips-de/;
Repo and workbook: …concepts/sql_tips/sql_tips.ipynb

Show the code

cat("I added this R code chunk and somehow knitr engine successfully rendered SQL formatting block code.\nWorkaround: https://github.com/quarto-dev/quarto-cli/issues/2137")

I added this R code chunk and somehow knitr engine successfully rendered SQL formatting block code.
Workaround: https://github.com/quarto-dev/quarto-cli/issues/2137

--- title: "Advanced SQL" description: "SQL things that I wish I could know earlier." author: - name: "Tuan Le Khac" url: https://lktuan.github.io/ categories: [til, sql] date: 12-06-2024 date-modified: 12-06-2024 image: duckdb.jpg draft: false format: html: engine: knitr code-overflow: wrap code-tools: true code-fold: show code-annotations: hover execute: eval: false --- This is my note/workaround for the article "[25 SQL tips to level up your data engineering skills](https://www.startdataengineering.com/post/n-sql-tips-de/)" by Start Data Engineering, authored by [Joseph Machado](https://github.com/josephmachado). # setup Joseph provided a ready-to-go project regarding this advanced SQL transformation topics in this [repo](https://github.com/josephmachado/adv_data_transformation_in_sql) using **DuckDB** in **Jupyter Notebook**. We can either run it in the Github codespace or in our local machine with minimal setup. We use the TPC-H database for demonstration: ![The TPC-H Schema, source: [TPC Benchmark H Standard Specification](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf)](sample-data-tpch-schema.png) # #1 handy functions for data wrangling ## #1.1 `QUALIFY` We use `QUALIFY` to filter the output column of `WINDOW` function without creating more CTEs/Subqueries. It's not available in many traditional/on-prem RDBMS (MySQL, Oracle, MSSQL, PostgreSQL) but available in mordern/cloud-based databases (Snowflake, BigQuery, Databricks, MS Fabric, Teradata). ```{sql} SELECT o_orderkey, o_totalprice, RANK() OVER (ORDER BY o_totalprice DESC) AS price_rank -- rank the order by `o_totalprice` in desc FROM orders QUALIFY price_rank <= 10; -- filter top 10 /* would need a CTE or Sub-query without QUALIFY */ ``` ![output - first 10 rows](1_qualify.png) ## #1.2 `DISTINCT ON` Orginated from PostgreSQL, now support by some cloud-based like Snowflake but not in MySQL, SQL Server, Oracle, `DISTINCT ON` allows us to get 1 detailed row (first or last) for a particular partition. ```{sql} SELECT DISTINCT ON (o_custkey) -- distinct only `o_custkey` o_custkey, o_orderdate, o_totalprice FROM orders ORDER BY o_custkey, o_orderdate DESC; -- make sure to get the latest order by `o_orderdate` ``` ![output - first 10 rows](2_distinct_on.png) ## #1.3 This can also be done with `ROW_NUMBER()` + `QUALIFY`: ```{sql} SELECT o_custkey, o_orderdate, o_totalprice, ROW_NUMBER() OVER (PARTITION BY o_custkey ORDER BY o_orderdate DESC) AS rn FROM orders QUALIFY rn = 1 ORDER BY o_custkey DESC; -- ROW_NUMBER() shuffle the order of the data ``` ![output - first 10 rows](3_rn_qualify.png) ## #1.4 `STRUCT_PACK()` `STRUCT_PACK()` is a function primarily associated with DuckDB: - Creates a compact, binary representation of multiple values; - Allows you to pack different data types into a single column; - Useful for data compression and efficient storage. Below picture depicts how struct works: ![struct packing](struck_pack.png) ```{sql} WITH order_struct AS ( SELECT o_orderkey, STRUCT_PACK(o_orderdate, o_totalprice, o_orderkey) AS order_info FROM orders ) SELECT MIN(order_info) AS min_order_date, -- get min of information from left to right MAX(order_info) AS max_order_date_price -- get of information from left to right, -- if there are many txn in that latest day, get the transaction with max price FROM order_struct; ``` ![output](4_struct.png) ## #1.5 `BOOL_OR()` & `BOOL_AND()` `BOOL_OR()` & `BOOL_AND()` allows you to check a logical statement along all rows of a columns, supported in PostgreSQL, Snowflake, DuckDB, BigQuery, Databricks: ```{sql} SELECT o_custkey, BOOL_OR(cast(o_shippriority as boolean)) AS has_atleast_one_priority_order, -- check whether AT LEAST 1 order of that customer has Is Priority = True BOOL_AND(cast(o_shippriority as boolean)) AS has_all_priority_order -- check whether ALL orders of that customer has Is Priority = True FROM orders GROUP BY o_custkey; ``` ![output - first 10 rows](5_bool_or_and.png) ## #1.6 `EXCLUDE()` When you want to select all (`*`) columns excet few ones: ```{sql} SELECT * EXCLUDE (o_orderdate, o_totalprice) FROM orders; ``` ## #1.7 `GROUP BY ALL` saves the day Repeating all the columns listed in the `SELECT` statement in `GROUP BY` is annoying, just use `ALL`: ```{sql} SELECT o_orderkey, o_custkey, o_orderstatus, SUM(o_totalprice) AS total_price FROM orders GROUP BY ALL; ``` ## #1.8 `COUNT_IF()` Filter over rows in specific column: ```{sql} SELECT o_custkey, COUNT_IF(o_totalprice > 100000) AS high_value_orders, -- equivalent to SUM(CASE WHEN o_totalprice > 100000 THEN 1 ELSE 0 END) COUNT(o_totalprice) as all_orders FROM orders GROUP BY o_custkey; ``` ![output - first 10 rows](8_count_if.png) ## #1.9 `STRING_AGG()` Concatenate rows of string in a `GROUP BY` statement: ```{sql} SELECT STRING_AGG(c_name, ', ') AS customer_names FROM customer; ``` ## #1.10 Null handling with `COALESCE()` Handling null value in a column with value from another column or default value: ```{sql} WITH fake_orders AS ( SELECT 1 AS o_orderkey, 100 AS o_totalprice, NULL AS discount UNION ALL SELECT 2 AS o_orderkey, 200 AS o_totalprice, 20 AS discount UNION ALL SELECT 3 AS o_orderkey, 300 AS o_totalprice, NULL AS discount ) SELECT o_orderkey, o_totalprice, discount, COALESCE(discount, o_totalprice * 0.10) AS final_discount FROM fake_orders; ``` ![output](10_coalesce.png) ## #1.11 `GENERATE_SERIES()` This helps you generate a sequence/series of data over a range with an interval which facilitating data simulation or joining with other tables. ```{sql} SELECT * FROM generate_series(1, 10); ``` ![output](11a_gen.png) ```{sql} SELECT * FROM generate_series('2024-01-01'::DATE, '2024-01-10'::DATE, INTERVAL 1 DAY); ``` ![output](11b_gen.png) ## #1.12 `UNNEST()` Unpacking data wrapped in JSON array or list `[...]` using `UNNEST()`: ```{sql} WITH nested_data AS ( SELECT 1 AS id, [10, 20, 30] AS values UNION ALL SELECT 2 AS id, [40, 50] AS values ) SELECT id, UNNEST(values) AS flattened_value FROM nested_data; ``` ![output](12_unnest.png) # #2 `SET` operations ## #2.1 `EXISTS` `EXISTS` is used to test for the existence of rows in a subquery, return TRUE if any row in the subquery returned. I think it's often used for logic check to correlate queries. It checks the existence like `INNER JOIN` but does not join full rows or select columns from the subquery. ```{sql} SELECT c_custkey, c_name FROM customer WHERE EXISTS ( SELECT o_orderkey FROM orders WHERE 1=1 -- AND o_totalprice > 5000000 -- Return customer who has at least 1 order with total price > 500k AND o_custkey = c_custkey -- Return customer who has at least 1 order ); ``` ## #2.2 `INTERSECT` Below query return all `c_custkey` that appears in both `customer` and `orders` table. ```{sql} SELECT c_custkey FROM customer INTERSECT SELECT o_custkey FROM orders; ``` ## #2.3 `EXCEPT` And the below query return all `c_custkey` that appears `customer` **but not in** `orders` table. ```{sql} SELECT c_custkey FROM customer EXCEPT SELECT o_custkey FROM orders; ``` Using this, we can perform delta/data diff checking! # #3 macros We can create UDF (User Define Function) or Macros in SQL for abstracting complex logic and reusing it in different parts of the query, this improves both readability and maintainability. When executing, the macros are expanded inline, which avoids the overhead of function calls. Below is a simple function/macro: ```{sql} CREATE MACRO percentage(numerator, denominator) AS ( (CAST(numerator AS DOUBLE) / CAST(denominator AS DOUBLE)) * 100 ); ``` # #4 jinja2 > [Jinja](https://pypi.org/project/Jinja2/) is a fast, expressive, extensible templating engine. In data engineering, Jinja2 is widely used in: + DBT (Data Build Tool) - Template SQL transformations - Parameterize database queries - Dynamic model generation + Airflow - Dynamic DAG (Directed Acyclic Graph) generation - Parameterize task configurations + Great Expectations - Configure data validation rules - Generate dynamic expectations ```{python} #| eval: false from jinja2 import Template # Define a Jinja2 SQL template with a loop sql_template = """ SELECT o_orderkey, o_custkey, o_totalprice FROM orders WHERE o_totalprice > {{ price_threshold }} {% if customer_keys %} AND o_custkey IN ( {% for custkey in customer_keys %} {{ custkey }}{% if not loop.last %}, {% endif %} {% endfor %} ) {% endif %} ORDER BY o_totalprice DESC; """ # Render the template with dynamic parameters template = Template(sql_template) # Parameters to be passed to the template params = { "price_threshold": 20000, "customer_keys": [1001, 1002, 1003] # A list of customer keys to filter on } # Render the SQL query (do not execute, just generate SQL) rendered_sql = template.render(params) print(rendered_sql) ``` And the query be generated: ```{sql} SELECT o_orderkey, o_custkey, o_totalprice FROM orders WHERE o_totalprice > 20000 AND o_custkey IN ( 1001, 1002, 1003 ) ORDER BY o_totalprice DESC; ``` # #5 metadata The database documentation is normally stored in the `information_schema`, the first schema that created in DB: ```{sql} -- Information about our tables are stored here SELECT schema_name, view_name FROM duckdb_views(); ``` ![output - first 10 rows](information_schema.png) Dive into the table `tables`: ```{sql} SELECT table_catalog, table_schema, table_name from information_schema.tables; ``` ![output - first 10 rows](information_schema_table.png) Other useful query for DB admin in DuckDB: ```{sql} -- database level settings SELECT * FROM duckdb_settings(); -- list of all tables in our DuckDB SELECT schema_name, table_name FROM duckdb_tables(); ``` # #6 de-duplicate We can use UPSERT (or MERGE INTO) - INsert new data and UPdate existing data. The syntaxes vary among DBs: - PostgreSQL: `INSERT ... ON CONFLICT` - MySQL: `INSERT ... ON DUPLICATE KEY UPDATE` - SQLite: `INSERT OR REPLACE` - SQL Server: `MERGE` statement - Oracle: `MERGE` statement For eg., in DuckDB: ```{sql} DROP TABLE IF EXISTS dim_customer_scd2; -- Create a Slowly Changing Dimension (SCD Type 2) table for customer CREATE TABLE dim_customer_scd2 ( c_custkey INTEGER PRIMARY KEY, c_name VARCHAR, c_address VARCHAR, c_nationkey INTEGER, c_phone VARCHAR, c_acctbal DOUBLE, c_mktsegment VARCHAR, c_comment VARCHAR, valid_from DATE, valid_to DATE, is_current BOOLEAN ); -- Insert current data from the TPCH customer table into the SCD2 table INSERT INTO dim_customer_scd2 SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment, '2024-10-17' AS valid_from, NULL AS valid_to, -- NULL means it's the current active record TRUE AS is_current FROM customer; ``` This query create a new SCD2 table and UPSERT data into it. When managing a table, we can handle (INSERT or UPDATE) like this: ```{sql} INSERT INTO dim_customer_scd2 ( c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment, valid_from, valid_to, is_current ) VALUES (1, 'Customer#000000001', 'New Address 1', 15, '25-989-741-2988', 711.56, 'BUILDING', 'comment1', '2024-10-18', NULL, TRUE), (2, 'Customer#000000002', 'New Address 2', 18, '12-423-790-3665', 879.49, 'FURNITURE', 'comment2', '2024-10-18', NULL, TRUE), (1501, 'Customer#000001501', 'New Address 1501', 24, '11-345-678-9012', 500.50, 'MACHINERY', 'comment1501', '2024-10-18', NULL, TRUE), (1502, 'Customer#000001502', 'New Address 1502', 21, '22-456-789-0123', 600.75, 'AUTOMOBILE', 'comment1502', '2024-10-18', NULL, TRUE) ON CONFLICT (c_custkey) DO -- Handle existing customers (Customer#000000001 and Customer#000000002) for SCD Type 2 UPDATE SET valid_to = EXCLUDED.valid_from, is_current = FALSE WHERE dim_customer_scd2.c_custkey = EXCLUDED.c_custkey AND dim_customer_scd2.is_current = TRUE; ``` # #7 `JOIN`s that you've never ever seen before ## #7.1 `ASOF JOIN` An ASOF JOIN (As-Of Join) is a time-series specific join operation that matches rows based on the closest timestamp, typically used in financial and time-series data analysis. + Key characteristics: - Matches each row from one table to the most recent row in another table by time - Useful for tracking historical state changes - Common in financial databases and time-series analysis + Example use case: - Joining stock prices with trading orders - Matching historical prices at the nearest timestamp - Tracking changes in reference data over time For eg. "**give me the value of the property as of this time**": ```{sql} WITH stock_prices AS ( SELECT 'APPL' AS ticker, TIMESTAMP '2001-01-01 00:00:00' AS "when", 1 AS price UNION ALL SELECT 'APPL', TIMESTAMP '2001-01-01 00:01:00', 2 UNION ALL SELECT 'APPL', TIMESTAMP '2001-01-01 00:02:00', 3 UNION ALL SELECT 'MSFT', TIMESTAMP '2001-01-01 00:00:00', 1 UNION ALL SELECT 'MSFT', TIMESTAMP '2001-01-01 00:01:00', 2 UNION ALL SELECT 'MSFT', TIMESTAMP '2001-01-01 00:02:00', 3 UNION ALL SELECT 'GOOG', TIMESTAMP '2001-01-01 00:00:00', 1 UNION ALL SELECT 'GOOG', TIMESTAMP '2001-01-01 00:01:00', 2 UNION ALL SELECT 'GOOG', TIMESTAMP '2001-01-01 00:02:00', 3 ), portfolio_holdings AS ( SELECT 'APPL' AS ticker, TIMESTAMP '2000-12-31 23:59:30' AS "when", 5.16 AS shares UNION ALL SELECT 'APPL', TIMESTAMP '2001-01-01 00:00:30', 2.94 UNION ALL SELECT 'APPL', TIMESTAMP '2001-01-01 00:01:30', 24.13 UNION ALL SELECT 'GOOG', TIMESTAMP '2000-12-31 23:59:30', 9.33 UNION ALL SELECT 'GOOG', TIMESTAMP '2001-01-01 00:00:30', 23.45 UNION ALL SELECT 'GOOG', TIMESTAMP '2001-01-01 00:01:30', 10.58 UNION ALL SELECT 'DATA', TIMESTAMP '2000-12-31 23:59:30', 6.65 UNION ALL SELECT 'DATA', TIMESTAMP '2001-01-01 00:00:30', 17.95 UNION ALL SELECT 'DATA', TIMESTAMP '2001-01-01 00:01:30', 18.37 ) SELECT h.ticker, h.when, p.when as stock_price_ts, price, shares, price * shares AS value FROM portfolio_holdings h ASOF JOIN stock_prices p ON h.ticker = p.ticker AND h.when >= p.when ORDER BY 1, 2; ``` ![output](asof_join.png) Without `ASOF JOIN`, we must use a WINDOW function to achieve this. ## #7.2 `ANTI JOIN` The pattern is to get rows in one table that are not in another table: ```{sql} SELECT c.c_custkey FROM customer c LEFT JOIN orders o ON c.c_custkey = o.o_custkey WHERE o.o_custkey IS NULL ORDER BY c.c_custkey LIMIT 5; ``` Some DBs have native support for `ANTI JOIN`: ```{sql} SELECT c.c_custkey FROM customer c ANTI JOIN orders o ON c.c_custkey = o.o_custkey ORDER BY c.c_custkey LIMIT 5; ``` ## #7.3 `LATERAL JOIN` LATERAL JOIN is a powerful SQL join type that allows a subquery in the FROM clause to reference columns from preceding tables in the same FROM clause. - Enables per-row dynamic subqueries - Allows correlated subqueries in the FROM clause - Supported in PostgreSQL, BigQuery, some other modern databases For e.g, for each row in the orders table (`o`), the subquery in the LATERAL JOIN selects line items (`l`) that match certain conditions. ```{sql} SELECT o.o_orderkey, o.o_totalprice, l.l_linenumber, l.l_extendedprice FROM orders o, LATERAL ( SELECT l.l_linenumber, l_extendedprice FROM lineitem l WHERE l.l_orderkey = o.o_orderkey AND l.l_linenumber <= 2 AND l.l_extendedprice < (o.o_totalprice / 2) ) AS l ORDER BY 1, 3; ``` ![output - first 10 rows](lateral_join_1.png) For each row in the orders table (`o`), the subquery in the LATERAL JOIN counts the number of line items (`lineitem_count`) related to that order. ```{sql} SELECT o.o_orderkey, o.o_totalprice, l.lineitem_count FROM orders o, LATERAL ( SELECT COUNT(*) AS lineitem_count FROM lineitem l WHERE l.l_orderkey = o.o_orderkey ) AS l; ``` ![output - first 10 rows](lateral_join_2.png) # #8 use cases Some real business use cases: ## #8.1 `PIVOT` We often want to change a value of a column into individual columns, we can use `CASE ... WHEN ...`: ```{sql} SELECT o_custkey, SUM(CASE WHEN o_orderstatus = 'F' THEN o_totalprice ELSE 0 END) AS fulfilled_total, SUM(CASE WHEN o_orderstatus = 'O' THEN o_totalprice ELSE 0 END) AS open_total, SUM(CASE WHEN o_orderstatus = 'P' THEN o_totalprice ELSE 0 END) AS pending_total FROM orders GROUP BY o_custkey ORDER BY o_custkey; ``` But also leverage `PIVOT`: ```{sql} FROM orders PIVOT ( sum(o_totalprice) FOR o_orderstatus IN ('F', 'O', 'P') GROUP BY o_custkey ) ORDER BY o_custkey; ``` ![output - first 10 rows](pivot.png) ## #8.1 `CUBE` ```{sql} SELECT o_orderpriority, o_orderstatus, EXTRACT(YEAR FROM o_orderdate) AS order_year, SUM(o_totalprice) AS total_sales FROM orders GROUP BY CUBE (o_orderpriority, o_orderstatus, order_year) ORDER BY 1,2,3; ``` ![output - first 10 rows](cube.png) * `CUBE (o_custkey, o_orderstatus, order_year)`: This will group the data and calculate subtotals and grand totals for all possible combinations of `o_custkey`, `o_orderstatus`, and `order_year`. * It will generate all combinations of the grouping columns, including: - Grouping by just o_custkey - Grouping by just o_orderstatus - Grouping by just order_year - Grouping by all three together - Grouping by pairs of columns - A grand total (no grouping by any column) - `SUM(o_totalprice)`: For each combination of groupings, it sums the total order price. Use cases: > 1. OLAP Reporting: CUBE is commonly used in OLAP scenarios where you need to analyze data from multiple perspectives. For instance, you may want to generate reports that show total sales by customer, by order status, by year, and all possible combinations of these dimensions. > 2. Sales Analysis: In sales analysis, CUBE can help create pivot-like summaries that show how different attributes (e.g., region, product, time period) contribute to the overall sales. > 3. Financial Reports: Financial departments often use CUBE to calculate totals and subtotals across dimensions like departments, time periods, and account categories, making it easier to prepare comprehensive financial reports. Happy learning! # resource 1. Article by Start Data Engineering: <https://www.startdataengineering.com/post/n-sql-tips-de/>; 2. Repo and workbook: [...concepts/sql_tips/sql_tips.ipynb](https://github.com/josephmachado/adv_data_transformation_in_sql/blob/main/concepts/sql_tips/sql_tips.ipynb) ```{r} #| eval: true #| code-fold: true cat("I added this R code chunk and somehow knitr engine successfully rendered SQL formatting block code.\nWorkaround: https://github.com/quarto-dev/quarto-cli/issues/2137") ```

setup

#1 handy functions for data wrangling

#1.1 QUALIFY

#1.2 DISTINCT ON

#1.3 This can also be done with ROW_NUMBER() + QUALIFY:

#1.4 STRUCT_PACK()

#1.5 BOOL_OR() & BOOL_AND()

#1.6 EXCLUDE()

#1.7 GROUP BY ALL saves the day

#1.8 COUNT_IF()

#1.9 STRING_AGG()

#1.10 Null handling with COALESCE()

#1.11 GENERATE_SERIES()

#1.12 UNNEST()

#2 SET operations

#2.1 EXISTS

#2.2 INTERSECT

#2.3 EXCEPT

#3 macros

#4 jinja2

#5 metadata

#6 de-duplicate

#7 JOINs that you’ve never ever seen before

#7.1 ASOF JOIN

#7.2 ANTI JOIN

#7.3 LATERAL JOIN

#8 use cases

#8.1 PIVOT

#8.1 CUBE

resource

#1.1 `QUALIFY`

#1.2 `DISTINCT ON`

#1.3 This can also be done with `ROW_NUMBER()` + `QUALIFY`:

#1.4 `STRUCT_PACK()`

#1.5 `BOOL_OR()` & `BOOL_AND()`

#1.6 `EXCLUDE()`

#1.7 `GROUP BY ALL` saves the day

#1.8 `COUNT_IF()`

#1.9 `STRING_AGG()`

#1.10 Null handling with `COALESCE()`

#1.11 `GENERATE_SERIES()`

#1.12 `UNNEST()`

#2 `SET` operations

#2.1 `EXISTS`

#2.2 `INTERSECT`

#2.3 `EXCEPT`

#7 `JOIN`s that you’ve never ever seen before

#7.1 `ASOF JOIN`

#7.2 `ANTI JOIN`

#7.3 `LATERAL JOIN`

#8.1 `PIVOT`

#8.1 `CUBE`