Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

How to remember SQL window functions (analysis functions)

When I started learning SQL, the window functions (analytic functions) were the most difficult to understand.

Although you do not need to use the window function when using a database simply as a place to store and retrieve data, it is a very useful function when you want to analyze the data.

It is possible to write a similar query without using the window function, but the query can often be written shorter with the window function.

I have not completely mastered it myself, but here I would like to summarize some rough ideas about window functions.

What you can't do with regular SQL

In normal SQL, FROM specifies the target table from which data is to be retrieved, a row of records is extracted, and SELECT performs an operation using the data in that single row of records and outputs the results.

As a sample, let's look at the daily corona new cases data by city.

SELECT
  dt,
  city,
  cases
FROM tbl_covid
dt city cases
2022-01-01 Tokyo 86
2022-01-01 Osaka 70
... ... ...

The data output in each row is only what is contained in the records of one row retrieved, and data from other rows cannot be used.

For example, it is not possible to obtain today's number of people and yesterday's number of people and then produce the difference.

How to use data from other rows

However, there are cases where we want to use data from other rows.

For example, when you want to determine how much more or less than the average of all data.

In such cases, this can be done by calling a further query (subquery) within the query.

SELECT
  tbl_base.dt,
  tbl_base.city,
  tbl_base.cases -
    (
      SELECT
        AVG(tbl_sub.cases)
      FROM tbl AS tbl_sub
    )
  AS diff_avg_cases
FROM tbl AS tbl_base

This can be written using the window function as follows

SELECT
  tbl_base.dt,
  tbl_base.city,
  tbl_base.cases -
    (
      AVG(tbl_base.cases) OVER()
    )
  AS diff_avg_cases
FROM tbl AS tbl_base

Aggregate function" + "OVER()" is a window function.

Window function image

Here is a diagram of how the window function is called.

  1. Target table is specified from FROM
  2. A single row of records is retrieved from the table
  3. The column values are taken from the record, computed with SELECT and output
  4. The window function in SELECT calls the target table again
  5. The data of the target range (window) is retrieved from the table
  6. The data in the range is calculated with the aggregate function, and the resulting value is passed to SELECT.

It is called a window function because it operates on a target range (window) from a target table.

Also, a window function is an aggregate function that performs an operation on a set of rows, and the result of the operation is always a single value.

The first SELECT extracts a single row of records, whereas the window function extracts multiple records as a window, which is very different.

PARTITION BY

The PARTITION BY can be used to specify a range column for the window function.

For example, in the above example, we obtained the average of all data in the table, but to find the difference from the average of all cities on the same day, we can aggregate the data by fixing dt with PARTITION BY.

SELECT
  tbl_base.dt,
  tbl_base.city,
  tbl_base.cases -
    AVG(tbl_base.cases) OVER(
      PARTITION BY tbl_base.dt
    )
  AS diff_avg_cases
FROM tbl AS tbl_base

If you want to find the difference from the average of other days in the city, instead of the average of other cities, you can use PARTITION BY and fix city to compute the difference.

SELECT
  tbl_base.dt,
  tbl_base.city,
  tbl_base.cases -
    AVG(tbl_base.cases) OVER(
      PARTITION BY tbl_base.city
    )
  AS diff_avg_cases
FROM tbl AS tbl_base

Frame and ORDER BY

As mentioned above, PARTITION BY determines the range of the window function's table.

Then, using "frame," you can specify the start and end position within that range, further narrowing down the target range.

For example, the 7-day moving average for each city, often seen in the news, is the query below.

SELECT
  tbl_base.dt,
  tbl_base.city,
  AVG(tbl_base.cases) OVER(
    PARTITION BY tbl_base.city
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS mov_avg_7d
FROM tbl AS tbl_base

Specify ROWS BETWEEN followed by "start line" and "end line".

The 6 PRECEDING indicates six rows before the current row, i.e., six days before the current row, and CURRENT ROW indicates the current row, and the average of these rows is taken, so a 7-day average is tabulated.

Actually, the query is wrong: to get a 7-day moving average on the 6 previous and current rows, the data must be sorted in ascending order by date.

Then, ORDER BY is used to specify the order of the data.

Based on the above, the correct query is as follows

SELECT
  tbl_base.dt,
  tbl_base.city,
  AVG(tbl_base.cases) OVER(
    PARTITION BY tbl_base.city
    ORDER BY tbl_base.dt ASC
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS mov_avg_7d
FROM tbl AS tbl_base

Window Image

The figure below illustrates how the window function is called and the target range is determined.

  1. Groups are fixed with PARTITION BY
  2. The ORDER BY determines the sequence of data.
  3. The ROWS BETWEEN determines the start and end rows of the range.

The structure is as follows: PARTITION BY specifies the column, ROWS BETWEEN specifies the start and end, and ORDER BY is required to specify the start and end of the column.

summary

Again, the window function has the following flow

  1. Call the table again from within SELECT
  2. Retrieve a specified range of data from a table
  3. The retrieved data is passed to an aggregate function for calculation.

If you keep in mind that "the table is called again" and "the aggregate function is executed on the retrieved record group," you should be able to get a general idea of the window function.

application

There are many ways to specify the start and end position of a tally function and row, so it is a good idea to take a quick look at what is available.

For reference, examples of common uses of window functions are listed below.

Cumulative number of cases since the beginning of the month, by each city

SELECT
  tbl_base.*,
  SUM(tbl_base.cases) OVER(
    PARTITION BY tbl_base.city, DATE_TRUNC(tbl_base.dt, MONTH)
    ORDER BY tbl_base.dt ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS val
FROM tbl AS tbl_base
  • The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW specifies from the beginning of the group to the current line.

line number

SELECT
  tbl_base.*,
  ROW_NUMBER() OVER(
    ORDER BY tbl_base.dt ASC, tbl_base.city ASC
  ) AS val
FROM tbl AS tbl_base

Previous value

SELECT
  tbl_base.*,
  LAG(tbl_base.cases) OVER(
    PARTITION BY tbl_base.city
    ORDER BY tbl_base.dt ASC
  ) AS val
FROM tbl AS tbl_base

End of month line

WITH

tbl_num AS
(
SELECT
  tbl_base.*,
  ROW_NUMBER() OVER(
    PARTITION BY tbl_base.city, DATE_TRUNC(tbl_base.dt, MONTH)
    ORDER BY tbl_base.dt DESC
  ) AS num
FROM tbl AS tbl_base
)

SELECT
  * EXCEPT(num)
FROM tbl_num
WHERE num=1
  • Grouped by month, sorted and numbered in descending order by date, and extracting the first line of the sequence with num=1.

Some functions, such as the ROW_NUMBER() and LAG() functions, do not allow you to specify a start or end position, so please read the documentation to confirm usage each time.

Other Comments, etc.

If you are using a subquery that calls ROWS BETWEEN from SELECT and the query becomes heavy or causes an error, you may be able to lighten the load by rewriting the query using the window function.

If there is a case in which you want to use values from other rows in SELECT, please remind them of the window function.

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com