How to access AWS S3 from Spark (Google Dataproc).
procedure
Spark Configuration
The following Spark and Haddop settings will allow you to read and write AWS S3 files from Spark.
Load the following AWS-related jar files into Spark
- aws-java-sdk-bundle-xxxx.jar
- hadoop-aws-xxxx.jar
Set the following parameters in the Hadoop configuration file "core-site.xml
- fs.s3a.access.key: Access key for AWS S3
- fs.s3a.secret.key: AWS S3 secret
Dataproc Configuration
Because Dataproce is a managed service, it is not possible to tinker with Spark directly.
Instead, using Dataproce's "Cluster Properties" and "Initialization Actions", Spark and Hadoop settings at the time of cluster creation.
Create the following initialization action and upload it to Google Cloud Storage.
init_action.sh
#!/bin/bash JAR_PATH=/usr/local/lib/jars mkdir $JAR_PATH # AWS S3 HADOOP_PATH=/usr/lib/hadoop-mapreduce HADOOP_AWS=hadoop-aws-2.9.2.jar AWS_JAVA_SDK_BUNDLE=aws-java-sdk-bundle-1.11.199.jar ln -s $HADOOP_PATH/$HADOOP_AWS $JAR_PATH/hadoop-aws.jar ln -s $HADOOP_PATH/$AWS_JAVA_SDK_BUNDLE $JAR_PAth/aws-java-sdk-bundle.jar
In the initialization action, make a place for the jar file to be added to Spark, There we set up symbolic links to "aws-java-sdk-bundle-xxxx.jar" and "hadoop-aws-xxxx.jar".
Clustering
impression
- AWS-related jar files are enclosed from version 1.4 or later of the image, so make sure your cluster image is version 1.4 or later.
initializing action
- Specify the initialization action file you just uploaded to Google Cloud Storage
- This action is performed when a cluster is created
cluster property
[spark]:[spark.jars]:[/usr/local/lib/jars/*]
- Configure the Spark configuration "spark-defaults.conf" in "spark
- Make Spark load the JDBC jar file in "spark.jars" "/usr/local/lib/jars/*".
[core]:[fs.s3a.access.key]:[xxxx]、[core]:[fs.s3a.secret.key]:[xxxx]
- Configure the Hadoop configuration file "core-site.xml" in "core
- Set the AWS S3 access key and secret in "fs.s3a.access.key and secret.key
The name of the AWS jar file may change, so if you cannot find the file, SSH login to the master node of the cluster and check the file name.
Example of use
df = spark.read.format("json").load("s3a://xxxx/xxxx/test.json") df.show()
Other Comments, etc.
Both reading and writing are possible.
Note that the S3 path is "s3a", not "s3".