I wanted to get a list of Google Cloud Storage subdirectories using the GCP Python library, I got stuck, so here's a note on how and why.
Libraries in other languages are API wrappers as well as Python libraries, so I think they can be applied.
NG Example
At first, I tried the following method to see if I could get a subdirectory, but to no avail.
from google.cloud import storage client = storage.Clinet() bucket = client.get_bucket('xxxx') dirs = bucket.list_blobs(prefix="a/", delimiter="/") [print(x) for x in dirs] # "a/"
Hammer Point
Gooble Cloud Storage, like AWS S3, is in reality a huge Key/Value database consisting of file paths and contents.
There is really no such thing as a directory, but rather a pseudo directory is expressed by using "/" in the middle of a path (key).
Therefore, even if the blob listing function is used, the directory cannot be obtained properly as it is.
Examples of what's working
I'll see if I can find a way that works.
GSUTIL
$ gsutil ls gs://xxxx/a/ gs://xxxx/a/ gs://xxxx/a/01/ gs://xxxx/a/02/
REST API
Set the following in Try this API
- bucket:xxxx
- delimiter:/
- prefix:a/
result
{ "kind": "storage#objects", "prefixes": [ "a/01/", "a/02/" ], "items":[{ "kind":"storage#object", "id":"xxxx/a//1234567890", ...} ] }
Description of REST APIs and policies that are working well
bucket.list_blobs() calls the above REST API.
What you want is the REST API return value "prefixes", but only the "xxxx/a/" specified in the first "prefix", stored in "items", will be returned as a blob.
Therefore, the iterator returned by bucket.list_blobs() is also only "xxxx/a/" blob, and "prefixes" cannot be obtained as is.
bucket.list_blobs() stores all blobs, but if the actual number of blobs is large, it is divided into pages and the REST API is called multiple times to retrieve them. The response for each call is the return value of the above REST API.
So "prefixes" are first obtained from the response, i.e., the page, and then from the "prefixes" property of that page.
The page information is then stored in the "pages" property of the blob iterator returned by bucket.list_blobs().
The "pages" is an iterator of pages, and the "prefixes" of the items in that iterator will be the final list of subdirectories to be retrieved.
Based on the above, the acquisition of subdirectory listings is as follows
procedure
from google.cloud import storage client = storage.Clinet() bucket = client.get_bucket('xxxx') blobs = bucket.list_blobs(prefix="a/", delimiter="/") dirs = [] for page in blobs.pages: dirs.extend(page.prefixes) [ print(x) for x in dirs ] # "a/01/" # "a/02/"
impressions
As was the case with AWS S3 before, if you assume that there is such a thing as a directory in storage, you are hooked.
For example, in "Azure Blob," if you delete all files in a directory, the directory also disappears, indicating that there is no such thing as a directory to begin with.
AWS S3", "GCP Cloud Storage", and "Azure Blob" all have similar specifications, so you may want to remember that the storage directory is not a list of blobs, but a list of prefixes when "delimiter" is set to "/".
Reference
- https://github.com/googleapis/google-cloud-python/issues/920#issuecomment-313384847
- https://stackoverflow.com/questions/51379101/how-to-get-list-blobs-to-behave-like-gsutil
Related Articles
www.ekwbtblog.com [https://www.kwbtblog.com/entry/2020/01/29/173223:embed:cite] [https://www.kwbtblog.com/entry/2018/12/14/015706:embed:cite]