I usually use Google BigQuery only for simple aggregate SQL, but there are many other useful functions besides aggregate SQL.
However, since I don't use it often and when I do try to use it, I start by looking for a manual, so I'll put together a brief description of how to use it for myself.
We have grouped them into several categories, but this time we have listed those that do not fit into any of those categories as others.
Memo for Slightly Elaborate Usage of BigQuery" article list
- A note on a slightly more elaborate use of BigQuery (views, materialized views, table functions)
- Memo for slightly elaborate usage of BigQuery (table creation and data update)
- A note on a slightly more elaborate use of BigQuery (scheduled queries)
- Notes on a slightly more elaborate use of BigQuery (programming)
- Memo for a slightly elaborate use of BigQuery (etc.)
array (programming, programing)
Create an array from columns
SELECT ARRAY<STRING>[col01, col02] AS array_data FROM tbl
Retrieve an element from an array by specifying an index
- Use
OFFSET()
for index designation - Using
SAFE_OFFSET()
will result inNULL
when there is no element.
SELECT array_data[SAFE_OFFSET(2)] AS array_value FROM tbl
Create an array with GROUP BY
SELECT col01, ARRAY_AGG(col02) AS array_data FROM tbl GROUP BY col01
- Specifies the sequence of elements in an array
SELECT col01, ARRAY_AGG(col02 ORDER BY col03, col04) AS array_data FROM tbl GROUP BY col01
Expand array to rows
- Rows with zero array elements are not expanded.
SELECT tt.col01, array_value FROM tbl AS tt CROSS JOIN UNNEST(tt.array_data) AS array_value
- Expand rows with zero array elements as
NULL
SELECT tt.col01, array_value FROM tbl AS tt LEFT JOIN UNNEST(tt.array_data) AS array_value
JSON
型
- BigQuery has an
JSON
type in addition to the basic types - Elements of
JSON
can be referenced by access operators in the same way as regularJSON
WITH tbl AS ( SELECT JSON'{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_val ) SELECT json_val.id AS id, json_val.name AS name, json_val.items[0] AS first_item FROM tbl
Element type is JSON
.
- When accessing an element, even if the element is a scalar value, the type is
JSON
, so it cannot be operated on as is.
The following is an error because it is an operation of type JSON
and INT64
.
SELECT json_val.id * 10 AS id_by_10 FROM tbl
To retrieve a scalar value from type
JSON
, it is necessary to perform a conversion of each typeBOOL()
INT64()
FLOAT64()
STRING()
The following is converted from type JSON
to INT64
, so there is no error
SELECT INT64(json_val.id) * 10 AS id_by_10 FROM tbl
Obtaining an Array
- Accessing an array of type
JSON
will result in an array of typeJSON
, not an array in BigQuery - Use
JSON_QUERY_ARRAY()
to extract elements from an array of typeJSON
and store them in a BigQuery array type Values stored in the array are of type
JSON
ARRAY<JSON>
SELECT JSON_QUERY_ARRAY(json_val.items) AS items_array FROM tbl
Convert stringified JSON
to type JSON
- Using
PARSE_JSON()
WITH tbl AS ( SELECT '{"id": 123, "name": "ABC", "items": ["itemA", "itemB"]}' AS json_str_val ) SELECT PARSE_JSON(json_str_val).id AS id, PARSE_JSON(json_str_val).name AS name, PARSE_JSON(json_str_val).items[0] AS first_item FROM tbl
Data import from JSON
The following patterns are used to import newline-delimited
JSON
data into a table- Import the whole as a single
JSON
- element as
JSON
type - String-format
JSON
data is stored in theJSON
element.
- Import the whole as a single
Import the whole as a single JSON
data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}} {"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
json_data: JSON
element as JSON
type
data.json
{"event_date": "2023-01-01", "json_data": {"id": 123, "name": "ABC"}} {"event_date": "2023-01-02", "json_data": {"id": 456, "name": "DEF"}}
schema
event_date: DATE, json_data: JSON
String JSON
data is stored in the JSON
element.
JSON
data is held as string and converted to typeJSON
usingPARSE_JSON()
in SQL as needed
data.json
{"event_date": "2023-01-01", "json_str_data": "{\"id\": 123, \"name\": \"ABC\"}"} {"event_date": "2023-01-02", "json_str_data": "{\"id\": 456, \"name\": \"DEF\"}"}
schema
event_date: DATE, json_str_data: STRING
SQL
SELECT PARSE_JSON(json_str_data) AS json_data FROM tbl
bq load
- Normal and date partitioning tables
#!/bin/bash dataset=dataset_name table=table_name bq load \ --replace=true \ --source_format=NEWLINE_DELIMITED_JSON \ ${dataset}.${table} \ ./data.json
- Partitioning table by uptake time
#!/bin/bash dataset=dataset_name table=table_name partitiontime=20220101 bq load \ --replace=true \ --source_format=NEWLINE_DELIMITED_JSON \ ${dataset}.${table}"\$"${partitiontime} \ ./2022-01-01.json
Caution when using bq load
in date partitioning tables
When --replace=true
is selected, all data except for the data to be loaded is deleted.
To prevent the value of bq load
from changing after repeated bq load
in the date partitioning table, the following procedure is used.
- Align the data to be loaded with a single date.
- Delete in advance any date data to be loaded from the table
- Execute
bq load
with--replace=false
Delete data for a specified date
DELETE FROM `project_name.dataset_name`.table_name WHERE DATE(column_name)="2023-01-01"
bq query
Values in a query can be parameterized
- Parameterize with
@param_name
- Passing values with
--parameter=<param_naem>::<value>
- Parameterize with
sample.sql
SELECT * FROM tbl WHERE user_id=@param_id
query.sh
#!/bin/bash cat sample.sql | bq query \ --parameter=param_id::123 \
Sample data creation
- How to create each record by writing only the values of each record side by side
SELECT * FROM UNNEST( ARRAY<STRUCT<val01 INT64, val02 STRING>> [ (1, "A"), (2, "B"), (3, "C") ] )
val01 | val02 |
---|---|
1 | A |
2 | B |
3 | C |
Other useful syntax
EXCEPT()
*
outputs all columns, whileEXCEPT()
does not output certain columns.
SELECT * EXCEPT(user_id, created_at) FROM tbl
DISTINCT
- Remove table duplicates
SELECT DISTINCT * FROM tbl
ANY_VALUE()
- When all the column values are known to be the same even after
GROUP BY
, retrieve any one record from the column.
SELECT user_id, ANY_VALUE(age) AS age FROM tbl GROUP BY user_id
Frequency
Using
APPROX_TOP_COUNT(col, x)
- Column to get
col
- Specify the number of
x
occurrences to be obtained. In this case, set to 1 because we only want to know the mode.
- Column to get
The result is an array of structures with
count
for the number of occurrences andvalue
for the value- Take the first element of the array and extract
value
from that element.
- Take the first element of the array and extract
- Note that the counted values include
NULL
. | user_id | val | | -------- | --- | | 1 | A | | 1 | A | | 1 | B | | 2 | A | | 2 | B | | 2 | B |
WITH tbl AS ( SELECT * FROM UNNEST(ARRAY<STRUCT<user_id INT64, val STRING>> [ (1, "A"), (1, "A"), (1, "B"), (2, "A"), (2, "B"), (2, "B") ])) SELECT user_id, APPROX_TOP_COUNT(val, 1)[OFFSET(0)].count AS top_count, APPROX_TOP_COUNT(val, 1)[OFFSET(0)].value AS top_count_value, FROM tbl GROUP BY user_id
median
Using
PERCENTILE_CONT(col, x)
col
is the column to get- The
x
is the location to be extracted when the total is set to 1. The median value is 0.5.
- It is a window function, so all rows will have the median value.
- Use one value for
GROUP BY
andANY_VALUE()
| user_id | val | | -------- | --- | | 1 | 0 | | 1 | 0 | | 1 | 7 | | 1 | 10 | | 1 | 10 |
WITH tbl AS ( SELECT * FROM UNNEST(ARRAY<STRUCT<user_id INT64, val INT64>> [ (1, 0), (1, 0), (1, 7), (1, 10), (1, 10) ])) ,tbl_res AS ( SELECT user_id, PERCENTILE_CONT(val, 0.5) OVER(PARTITION BY user_id) AS median FROM tbl ) SELECT user_id, ANY_VALUE(median) AS median FROM tbl_res GROUP BY user_id