I usually use Google BigQuery only for simple aggregate SQL, but there are many other useful functions besides aggregate SQL.
However, since I don't use it often and when I do try to use it, I start by looking for a manual, so I'll put together a brief description of how to use it for myself.
We have grouped them into several categories, but this time we have listed those that do not fit into any of those categories as others.
Memo for Slightly Elaborate Usage of BigQuery" article list
- A note on a slightly more elaborate use of BigQuery (views, materialized views, table functions)
- Memo for slightly elaborate usage of BigQuery (table creation and data update)
- A note on a slightly more elaborate use of BigQuery (scheduled queries)
- Notes on a slightly more elaborate use of BigQuery (programming)
- Memo for a slightly elaborate use of BigQuery (etc.)
array (programming, programing)
Create an array from columns
SELECT ARRAY<STRING>[col01, col02] AS array_data FROM tbl
Retrieve an element from an array by specifying an index
- Use
OFFSET()for index designation - Using
SAFE_OFFSET()will result inNULLwhen there is no element.
SELECT array_data[SAFE_OFFSET(2)] AS array_value FROM tbl
Create an array with GROUP BY
SELECT col01, ARRAY_AGG(col02) AS array_data FROM tbl GROUP BY col01
- Specifies the sequence of elements in an array
SELECT col01, ARRAY_AGG(col02 ORDER BY col03, col04) AS array_data FROM tbl GROUP BY col01
Expand array to rows
- Rows with zero array elements are not expanded.
SELECT tt.col01, array_value FROM tbl AS tt CROSS JOIN UNNEST(tt.array_data) AS array_value
- Expand rows with zero array elements as
NULL
SELECT tt.col01, array_value FROM tbl AS tt LEFT JOIN UNNEST(tt.array_data) AS array_value
JSON型
- BigQuery has an
JSONtype in addition to the basic types - Elements of
JSONcan be referenced by access operators in the same way as regularJSON
WITH tbl AS ( SELECT JSON'{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_val ) SELECT json_val.id AS id, json_val.name AS name, json_val.items[0] AS first_item FROM tbl
Element type is JSON.
- When accessing an element, even if the element is a scalar value, the type is
JSON, so it cannot be operated on as is.
The following is an error because it is an operation of type JSON and INT64.
SELECT json_val.id * 10 AS id_by_10 FROM tbl
To retrieve a scalar value from type
JSON, it is necessary to perform a conversion of each typeBOOL()INT64()FLOAT64()STRING()
The following is converted from type JSON to INT64, so there is no error
SELECT INT64(json_val.id) * 10 AS id_by_10 FROM tbl
Obtaining an Array
- Accessing an array of type
JSONwill result in an array of typeJSON, not an array in BigQuery - Use
JSON_QUERY_ARRAY()to extract elements from an array of typeJSONand store them in a BigQuery array type Values stored in the array are of type
JSONARRAY<JSON>
SELECT JSON_QUERY_ARRAY(json_val.items) AS items_array FROM tbl
Convert stringified JSON to type JSON
- Using
PARSE_JSON()
WITH tbl AS ( SELECT '{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_str_val ) SELECT PARSE_JSON(json_str_val).id AS id, PARSE_JSON(json_str_val).name AS name, PARSE_JSON(json_str_val).items[0] AS first_item FROM tbl
Data import from JSON
The following patterns are used to import newline-delimited
JSONdata into a table- Import the whole as a single
JSON - element as
JSONtype - String-format
JSONdata is stored in theJSONelement.
- Import the whole as a single
Import the whole as a single JSON
data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}} {"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
json_data: JSON
element as JSON type
data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}} {"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
event_date: DATE, json_data: JSON
String JSON data is stored in the JSON element.
JSONdata is held as string and converted to typeJSONusingPARSE_JSON()in SQL as needed
data.json
{"event_date": "2023-01-01", "json_str_data": "{\"id\": 123, \"name\": \"ABC\"}"} {"event_date": "2023-01-02", "json_str_data": "{\"id\": 456, \"name\": \"DEF\"}"}
schema
event_date: DATE, json_str_data: STRING
SQL
SELECT PARSE_JSON(json_str_data) AS json_data FROM tbl
bq load
- Normal and date partitioning tables
#!/bin/bash dataset=dataset_name table=table_name bq load \ --replace=true \ --source_format=NEWLINE_DELIMITED_JSON \ ${dataset}.${table} \ ./data.json
- Partitioning table by uptake time
#!/bin/bash dataset=dataset_name table=table_name partitiontime=20220101 bq load \ --replace=true \ --source_format=NEWLINE_DELIMITED_JSON \ ${dataset}.${table}"\$"${partitiontime} \ ./2022-01-01.json
Caution when using bq load in date partitioning tables
When --replace=true is selected, all data except for the data to be loaded is deleted.
To prevent the value of bq load from changing after repeated bq load in the date partitioning table, the following procedure is used.
- Align the data to be loaded with a single date.
- Delete in advance any date data to be loaded from the table
- Execute
bq loadwith--replace=false
Delete data for a specified date
DELETE FROM `project_name.dataset_name`.table_name WHERE DATE(column_name)="2023-01-01"
bq query
Values in a query can be parameterized
- Parameterize with
@param_name - Passing values with
--parameter=<param_naem>::<value>
- Parameterize with
sample.sql
SELECT * FROM tbl WHERE user_id=@param_id
query.sh
#!/bin/bash cat sample.sql | bq query \ --parameter=param_id::123 \
Sample data creation
- How to create each record by writing only the values of each record side by side
SELECT * FROM UNNEST( ARRAY<STRUCT<val01 INT64, val02 STRING>> [ (1, "A"), (2, "B"), (3, "C") ] )
| val01 | val02 |
|---|---|
| 1 | A |
| 2 | B |
| 3 | C |
Other useful syntax
EXCEPT()
*outputs all columns, whileEXCEPT()does not output certain columns.
SELECT * EXCEPT(user_id, created_at) FROM tbl
DISTINCT
- Remove table duplicates
SELECT DISTINCT * FROM tbl
ANY_VALUE()
- When all the column values are known to be the same even after
GROUP BY, retrieve any one record from the column.
SELECT user_id, ANY_VALUE(age) AS age FROM tbl GROUP BY user_id
Frequency
Using
APPROX_TOP_COUNT(col, x)- Column to get
col - Specify the number of
xoccurrences to be obtained. In this case, set to 1 because we only want to know the mode.
- Column to get
The result is an array of structures with
countfor the number of occurrences andvaluefor the value- Take the first element of the array and extract
valuefrom that element.
- Take the first element of the array and extract
- Note that the counted values include
NULL. | user_id | val | | -------- | --- | | 1 | A | | 1 | A | | 1 | B | | 2 | A | | 2 | B | | 2 | B |
WITH tbl AS ( SELECT * FROM UNNEST(ARRAY<STRUCT<user_id INT64, val STRING>> [ (1, "A"), (1, "A"), (1, "B"), (2, "A"), (2, "B"), (2, "B") ])) SELECT user_id, APPROX_TOP_COUNT(val, 1)[OFFSET(0)].count AS top_count, APPROX_TOP_COUNT(val, 1)[OFFSET(0)].value AS top_count_value, FROM tbl GROUP BY user_id
median
Using
PERCENTILE_CONT(col, x)colis the column to get- The
xis the location to be extracted when the total is set to 1. The median value is 0.5.
- It is a window function, so all rows will have the median value.
- Use one value for
GROUP BYandANY_VALUE()| user_id | val | | -------- | --- | | 1 | 0 | | 1 | 0 | | 1 | 7 | | 1 | 10 | | 1 | 10 |
WITH tbl AS ( SELECT * FROM UNNEST(ARRAY<STRUCT<user_id INT64, val INT64>> [ (1, 0), (1, 0), (1, 7), (1, 10), (1, 10) ])) ,tbl_res AS ( SELECT user_id, PERCENTILE_CONT(val, 0.5) OVER(PARTITION BY user_id) AS median FROM tbl ) SELECT user_id, ANY_VALUE(median) AS median FROM tbl_res GROUP BY user_id