We originally used AWS Redshift, but it was quite expensive, so we switched to Google BigQuery.
After that, Redshift Spectrum came out, and I was wondering if it might be better than BigQuery, so I touched it and here are my impressions.
Also, once Redshift Spectrum is set up for use, AWS Athena can be used in conjunction with Redshift Spectrum.
What is Redshift Spectrum?
AWS Redshift Spectrum is a part of Redshift's functionality that allows you to use files placed in S3 directly as Redshift tables, without having to go to the trouble of loading data into the tables.
This has the advantage of reducing database fees.
treatment
Just because an S3 file can be read directly as a table does not mean that it can be used immediately by specifying the S3 directory....
Even if the data itself is in S3, the table definition (table name, column name, and data type) must first be prepared.
Table definitions are done automatically using AWS Glue.
Glue can crawl the files under an S3 directory and automatically create table definitions for the data in that directory.
I was so excited that they would do the table definitions automatically. I was so excited, but it was not such a good idea....
In many cases, automatically created table definitions cannot be used as they are, so they must be modified manually. Moreover, even if you want to define the definitions clearly on your side, the automatic identification may change the definitions without your permission, and this may be a hindrance to the process. Furthermore, the table definition technology uses Hive's metastore, and it is not easy to adjust the table definitions as intended.
Once a table is defined, it can be handled by Redshift in the same way as a normal table.
Redshft Spectrum Usage
The speed was about the same as I would expect. It was neither slow nor fast.
What I was wondering about was the fees. Where the data is now in S3 files, Redshift charges the same fees as before.
To begin with, Redshift does not charge high database capacity fees, but rather high clustering fees to calculate, so the fees remained high even when the data source was Redshift Spectrum.
Athena
What is AWS Athena?
Athena, like Redshft Spectrum, is a database service that can go directly to S3 files as tables.
Table definitions are also required as well as Redshft Spectrum.
Table definitions can be created in Athena specifically for Athena, but they can also be created in Glue during Redshft Spectrum; Glue also has data processing capabilities, and I got the sense that AWS seems to have made Glue a data processing hub.
Athena Usage
Athena uses a technology called "Presto," but it was difficult to use because of the quirks in the way the queries were written: queries for Redshft Spectrum did not work with Athena as is, and general queries did not work with Athena as is. In the end, I had to write queries for Athena while looking at Presto's reference.
I understood the atmosphere, so I was not inclined to actively use it at this point, as I would think about it then if I needed it.
Conclusion.
- Redshft Spectrum is still too expensive, so I think BigQuery is the way to go.
- Athena is "table definition preparation" and "dialect strong query creation"... I feel that it is too much trouble.