Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

Memo for a slightly elaborate use of BigQuery (etc.)

I usually use Google BigQuery only for simple aggregate SQL, but there are many other useful functions besides aggregate SQL.

However, since I don't use it often and when I do try to use it, I start by looking for a manual, so I'll put together a brief description of how to use it for myself.

We have grouped them into several categories, but this time we have listed those that do not fit into any of those categories as others.

Memo for Slightly Elaborate Usage of BigQuery" article list

array (programming, programing)

Create an array from columns

SELECT
  ARRAY<STRING>[col01, col02] AS array_data
FROM tbl

Retrieve an element from an array by specifying an index

  • Use OFFSET() for index designation
  • Using SAFE_OFFSET() will result in NULL when there is no element.
SELECT
  array_data[SAFE_OFFSET(2)] AS array_value
FROM tbl

Create an array with GROUP BY

SELECT
  col01,
  ARRAY_AGG(col02) AS array_data
FROM tbl
GROUP BY col01
  • Specifies the sequence of elements in an array
SELECT
  col01,
  ARRAY_AGG(col02 ORDER BY col03, col04) AS array_data
FROM tbl
GROUP BY col01

Expand array to rows

  • Rows with zero array elements are not expanded.
SELECT
  tt.col01,
  array_value
FROM tbl AS tt
CROSS JOIN UNNEST(tt.array_data) AS array_value
  • Expand rows with zero array elements as NULL
SELECT
  tt.col01,
  array_value
FROM tbl AS tt
LEFT JOIN UNNEST(tt.array_data) AS array_value

JSON

  • BigQuery has an JSON type in addition to the basic types
  • Elements of JSON can be referenced by access operators in the same way as regular JSON
WITH

tbl AS
(
SELECT
  JSON'{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_val
)

SELECT
  json_val.id AS id,
  json_val.name AS name,
  json_val.items[0] AS first_item
FROM tbl

Element type is JSON.

  • When accessing an element, even if the element is a scalar value, the type is JSON, so it cannot be operated on as is.

The following is an error because it is an operation of type JSON and INT64.

SELECT
  json_val.id * 10 AS id_by_10
FROM tbl
  • To retrieve a scalar value from type JSON, it is necessary to perform a conversion of each type

    • BOOL()
    • INT64()
    • FLOAT64()
    • STRING()

The following is converted from type JSON to INT64, so there is no error

SELECT
  INT64(json_val.id) * 10 AS id_by_10
FROM tbl

Obtaining an Array

  • Accessing an array of type JSON will result in an array of type JSON, not an array in BigQuery
  • Use JSON_QUERY_ARRAY() to extract elements from an array of type JSON and store them in a BigQuery array type
  • Values stored in the array are of type JSON

    • ARRAY<JSON>
SELECT
  JSON_QUERY_ARRAY(json_val.items) AS items_array
FROM tbl

Convert stringified JSON to type JSON

  • Using PARSE_JSON()
WITH

tbl AS
(
SELECT
  '{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_str_val
)

SELECT
  PARSE_JSON(json_str_val).id AS id,
  PARSE_JSON(json_str_val).name AS name,
  PARSE_JSON(json_str_val).items[0] AS first_item
FROM tbl

Data import from JSON

  • The following patterns are used to import newline-delimited JSON data into a table

    • Import the whole as a single JSON
    • element as JSON type
    • String-format JSON data is stored in the JSON element.

Import the whole as a single JSON

data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}}
{"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
json_data: JSON

element as JSON type

data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}}
{"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
event_date: DATE,
json_data: JSON

String JSON data is stored in the JSON element.

  • JSON data is held as string and converted to type JSON using PARSE_JSON() in SQL as needed
data.json
{"event_date": "2023-01-01", "json_str_data": "{\"id\": 123, \"name\": \"ABC\"}"}
{"event_date": "2023-01-02", "json_str_data": "{\"id\": 456, \"name\": \"DEF\"}"}
schema
event_date: DATE,
json_str_data: STRING
SQL
SELECT
  PARSE_JSON(json_str_data) AS json_data
FROM tbl

bq load

  • Normal and date partitioning tables
#!/bin/bash

dataset=dataset_name
table=table_name

bq load \
  --replace=true \
  --source_format=NEWLINE_DELIMITED_JSON \
  ${dataset}.${table} \
    ./data.json
  • Partitioning table by uptake time
#!/bin/bash

dataset=dataset_name
table=table_name
partitiontime=20220101

bq load \
  --replace=true \
  --source_format=NEWLINE_DELIMITED_JSON \
  ${dataset}.${table}"\$"${partitiontime} \
    ./2022-01-01.json

Caution when using bq load in date partitioning tables

When --replace=true is selected, all data except for the data to be loaded is deleted.

To prevent the value of bq load from changing after repeated bq load in the date partitioning table, the following procedure is used.

  • Align the data to be loaded with a single date.
  • Delete in advance any date data to be loaded from the table
  • Execute bq load with --replace=false

Delete data for a specified date

DELETE FROM `project_name.dataset_name`.table_name
WHERE DATE(column_name)="2023-01-01"

bq query

  • Values in a query can be parameterized

    • Parameterize with @param_name
    • Passing values with --parameter=<param_naem>::<value>

sample.sql

SELECT
  *
FROM tbl
WHERE user_id=@param_id

query.sh

#!/bin/bash

cat sample.sql | bq query \
  --parameter=param_id::123 \

Sample data creation

  • How to create each record by writing only the values of each record side by side
SELECT
  *
FROM UNNEST(
  ARRAY<STRUCT<val01 INT64, val02 STRING>>
  [
    (1, "A"),
    (2, "B"),
    (3, "C")
  ]
)
val01 val02
1 A
2 B
3 C

Other useful syntax

EXCEPT()

  • * outputs all columns, while EXCEPT() does not output certain columns.
SELECT
  * EXCEPT(user_id, created_at)
FROM tbl

DISTINCT

  • Remove table duplicates
SELECT
  DISTINCT *
FROM tbl

ANY_VALUE()

  • When all the column values are known to be the same even after GROUP BY, retrieve any one record from the column.
SELECT
  user_id,
  ANY_VALUE(age) AS age
FROM tbl
GROUP BY user_id

Frequency

  • Using APPROX_TOP_COUNT(col, x)

    • Column to get col
    • Specify the number of x occurrences to be obtained. In this case, set to 1 because we only want to know the mode.
  • The result is an array of structures with count for the number of occurrences and value for the value

    • Take the first element of the array and extract value from that element.
  • Note that the counted values include NULL. | user_id | val | | -------- | --- | | 1 | A | | 1 | A | | 1 | B | | 2 | A | | 2 | B | | 2 | B |
WITH
tbl AS (
SELECT
  *
FROM UNNEST(ARRAY<STRUCT<user_id INT64, val STRING>>
[
  (1, "A"),
  (1, "A"),
  (1, "B"),

  (2, "A"),
  (2, "B"),
  (2, "B")
]))

SELECT
  user_id,
  APPROX_TOP_COUNT(val, 1)[OFFSET(0)].count AS top_count,
  APPROX_TOP_COUNT(val, 1)[OFFSET(0)].value AS top_count_value,
FROM tbl
GROUP BY user_id

median

  • Using PERCENTILE_CONT(col, x)

    • col is the column to get
    • The x is the location to be extracted when the total is set to 1. The median value is 0.5.
  • It is a window function, so all rows will have the median value.
  • Use one value for GROUP BY and ANY_VALUE() | user_id | val | | -------- | --- | | 1 | 0 | | 1 | 0 | | 1 | 7 | | 1 | 10 | | 1 | 10 |
WITH
tbl AS (
SELECT
  *
FROM UNNEST(ARRAY<STRUCT<user_id INT64, val INT64>>
[
  (1, 0),
  (1, 0),
  (1, 7),
  (1, 10),
  (1, 10)
]))

,tbl_res AS
(
SELECT
  user_id,
  PERCENTILE_CONT(val, 0.5) OVER(PARTITION BY user_id) AS median
FROM tbl
)

SELECT
  user_id,
  ANY_VALUE(median) AS median
FROM tbl_res
GROUP BY user_id

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com