Memo for a slightly elaborate use of BigQuery (etc.)

I usually use Google BigQuery only for simple aggregate SQL, but there are many other useful functions besides aggregate SQL.

However, since I don't use it often and when I do try to use it, I start by looking for a manual, so I'll put together a brief description of how to use it for myself.

We have grouped them into several categories, but this time we have listed those that do not fit into any of those categories as others.

Memo for Slightly Elaborate Usage of BigQuery" article list

array (programming, programing)

Create an array from columns

SELECT
  ARRAY<STRING>[col01, col02] AS array_data
FROM tbl

Retrieve an element from an array by specifying an index

Use OFFSET() for index designation
Using SAFE_OFFSET() will result in NULL when there is no element.

SELECT
  array_data[SAFE_OFFSET(2)] AS array_value
FROM tbl

Create an array with `GROUP BY`

SELECT
  col01,
  ARRAY_AGG(col02) AS array_data
FROM tbl
GROUP BY col01

Specifies the sequence of elements in an array

SELECT
  col01,
  ARRAY_AGG(col02 ORDER BY col03, col04) AS array_data
FROM tbl
GROUP BY col01

Expand array to rows

Rows with zero array elements are not expanded.

SELECT
  tt.col01,
  array_value
FROM tbl AS tt
CROSS JOIN UNNEST(tt.array_data) AS array_value

Expand rows with zero array elements as NULL

SELECT
  tt.col01,
  array_value
FROM tbl AS tt
LEFT JOIN UNNEST(tt.array_data) AS array_value

`JSON`型

BigQuery has an JSON type in addition to the basic types
Elements of JSON can be referenced by access operators in the same way as regular JSON

WITH

tbl AS
(
SELECT
  JSON'{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_val
)

SELECT
  json_val.id AS id,
  json_val.name AS name,
  json_val.items[0] AS first_item
FROM tbl

Element type is `JSON`.

When accessing an element, even if the element is a scalar value, the type is JSON, so it cannot be operated on as is.

The following is an error because it is an operation of type JSON and INT64.

SELECT
  json_val.id * 10 AS id_by_10
FROM tbl

To retrieve a scalar value from type JSON, it is necessary to perform a conversion of each type
- BOOL()
- INT64()
- FLOAT64()
- STRING()

The following is converted from type JSON to INT64, so there is no error

SELECT
  INT64(json_val.id) * 10 AS id_by_10
FROM tbl

Obtaining an Array

Accessing an array of type JSON will result in an array of type JSON, not an array in BigQuery
Use JSON_QUERY_ARRAY() to extract elements from an array of type JSON and store them in a BigQuery array type
Values stored in the array are of type JSON
- ARRAY<JSON>

SELECT
  JSON_QUERY_ARRAY(json_val.items) AS items_array
FROM tbl

Convert stringified `JSON` to type `JSON`

Using PARSE_JSON()

WITH

tbl AS
(
SELECT
  '{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_str_val
)

SELECT
  PARSE_JSON(json_str_val).id AS id,
  PARSE_JSON(json_str_val).name AS name,
  PARSE_JSON(json_str_val).items[0] AS first_item
FROM tbl

Data import from `JSON`

The following patterns are used to import newline-delimited JSON data into a table
- Import the whole as a single JSON
- element as JSON type
- String-format JSON data is stored in the JSON element.

Import the whole as a single `JSON`

data.json

{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}}
{"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}

schema

json_data: JSON

element as `JSON` type

data.json

{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}}
{"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}

schema

event_date: DATE,
json_data: JSON

String `JSON` data is stored in the `JSON` element.

JSON data is held as string and converted to type JSON using PARSE_JSON() in SQL as needed

data.json

{"event_date": "2023-01-01", "json_str_data": "{\"id\": 123, \"name\": \"ABC\"}"}
{"event_date": "2023-01-02", "json_str_data": "{\"id\": 456, \"name\": \"DEF\"}"}

schema

event_date: DATE,
json_str_data: STRING

SQL

SELECT
  PARSE_JSON(json_str_data) AS json_data
FROM tbl

`bq load`

Normal and date partitioning tables

#!/bin/bash

dataset=dataset_name
table=table_name

bq load \
  --replace=true \
  --source_format=NEWLINE_DELIMITED_JSON \
  ${dataset}.${table} \
    ./data.json

Partitioning table by uptake time

#!/bin/bash

dataset=dataset_name
table=table_name
partitiontime=20220101

bq load \
  --replace=true \
  --source_format=NEWLINE_DELIMITED_JSON \
  ${dataset}.${table}"\$"${partitiontime} \
    ./2022-01-01.json

Caution when using `bq load` in date partitioning tables

When --replace=true is selected, all data except for the data to be loaded is deleted.

To prevent the value of bq load from changing after repeated bq load in the date partitioning table, the following procedure is used.

Align the data to be loaded with a single date.
Delete in advance any date data to be loaded from the table
Execute bq load with --replace=false

Delete data for a specified date

DELETE FROM `project_name.dataset_name`.table_name
WHERE DATE(column_name)="2023-01-01"

`bq query`

Values in a query can be parameterized
- Parameterize with @param_name
- Passing values with --parameter=<param_naem>::<value>

sample.sql

SELECT
  *
FROM tbl
WHERE user_id=@param_id

query.sh

#!/bin/bash

cat sample.sql | bq query \
  --parameter=param_id::123 \

Sample data creation

How to create each record by writing only the values of each record side by side

SELECT
  *
FROM UNNEST(
  ARRAY<STRUCT<val01 INT64, val02 STRING>>
  [
    (1, "A"),
    (2, "B"),
    (3, "C")
  ]
)

val01	val02
1	A
2	B
3	C

Other useful syntax

`EXCEPT()`

* outputs all columns, while EXCEPT() does not output certain columns.

SELECT
  * EXCEPT(user_id, created_at)
FROM tbl

`DISTINCT`

Remove table duplicates

SELECT
  DISTINCT *
FROM tbl

`ANY_VALUE()`

When all the column values are known to be the same even after GROUP BY, retrieve any one record from the column.

SELECT
  user_id,
  ANY_VALUE(age) AS age
FROM tbl
GROUP BY user_id

Frequency

Using APPROX_TOP_COUNT(col, x)
- Column to get col
- Specify the number of x occurrences to be obtained. In this case, set to 1 because we only want to know the mode.
The result is an array of structures with count for the number of occurrences and value for the value
- Take the first element of the array and extract value from that element.
Note that the counted values include NULL. | user_id | val | | -------- | --- | | 1 | A | | 1 | A | | 1 | B | | 2 | A | | 2 | B | | 2 | B |

WITH
tbl AS (
SELECT
  *
FROM UNNEST(ARRAY<STRUCT<user_id INT64, val STRING>>
[
  (1, "A"),
  (1, "A"),
  (1, "B"),

  (2, "A"),
  (2, "B"),
  (2, "B")
]))

SELECT
  user_id,
  APPROX_TOP_COUNT(val, 1)[OFFSET(0)].count AS top_count,
  APPROX_TOP_COUNT(val, 1)[OFFSET(0)].value AS top_count_value,
FROM tbl
GROUP BY user_id

median

Using PERCENTILE_CONT(col, x)
- col is the column to get
- The x is the location to be extracted when the total is set to 1. The median value is 0.5.
It is a window function, so all rows will have the median value.
Use one value for GROUP BY and ANY_VALUE() | user_id | val | | -------- | --- | | 1 | 0 | | 1 | 0 | | 1 | 7 | | 1 | 10 | | 1 | 10 |

WITH
tbl AS (
SELECT
  *
FROM UNNEST(ARRAY<STRUCT<user_id INT64, val INT64>>
[
  (1, 0),
  (1, 0),
  (1, 7),
  (1, 10),
  (1, 10)
]))

,tbl_res AS
(
SELECT
  user_id,
  PERCENTILE_CONT(val, 0.5) OVER(PARTITION BY user_id) AS median
FROM tbl
)

SELECT
  user_id,
  ANY_VALUE(median) AS median
FROM tbl_res
GROUP BY user_id

Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

Memo for a slightly elaborate use of BigQuery (etc.)

Memo for Slightly Elaborate Usage of BigQuery" article list

array (programming, programing)

Create an array from columns

Retrieve an element from an array by specifying an index

Create an array with `GROUP BY`

Expand array to rows

`JSON`型

Element type is `JSON`.

Obtaining an Array

Convert stringified `JSON` to type `JSON`

Data import from `JSON`

Import the whole as a single `JSON`

data.json

schema

element as `JSON` type

data.json

schema

String `JSON` data is stored in the `JSON` element.

data.json

schema

SQL

`bq load`

Caution when using `bq load` in date partitioning tables

Delete data for a specified date

`bq query`

sample.sql

query.sh

Sample data creation

Other useful syntax

`EXCEPT()`

`DISTINCT`

`ANY_VALUE()`

Frequency

median

Memo for Slightly Elaborate Usage of BigQuery" article list

array (programming, programing)

Create an array from columns

Retrieve an element from an array by specifying an index

Create an array with GROUP BY

Expand array to rows

JSON型

Element type is JSON.

Obtaining an Array

Convert stringified JSON to type JSON

Data import from JSON

Import the whole as a single JSON

data.json

schema

element as JSON type

data.json

schema

String JSON data is stored in the JSON element.

data.json

schema

SQL

bq load

Caution when using bq load in date partitioning tables

Delete data for a specified date

bq query

sample.sql

query.sh

Sample data creation

Other useful syntax

EXCEPT()

DISTINCT

ANY_VALUE()

Frequency

median

Related Category Articles

Create an array with `GROUP BY`

`JSON`型

Element type is `JSON`.

Convert stringified `JSON` to type `JSON`

Data import from `JSON`

Import the whole as a single `JSON`

element as `JSON` type

String `JSON` data is stored in the `JSON` element.

`bq load`

Caution when using `bq load` in date partitioning tables

`bq query`

`EXCEPT()`

`DISTINCT`

`ANY_VALUE()`