Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

How to access AWS S3 from Spark (Google Dataproc)

How to access AWS S3 from Spark (Google Dataproc).

procedure

Spark Configuration

The following Spark and Haddop settings will allow you to read and write AWS S3 files from Spark.

  • Load the following AWS-related jar files into Spark

    • aws-java-sdk-bundle-xxxx.jar
    • hadoop-aws-xxxx.jar
  • Set the following parameters in the Hadoop configuration file "core-site.xml

    • fs.s3a.access.key: Access key for AWS S3
    • fs.s3a.secret.key: AWS S3 secret

Dataproc Configuration

Because Dataproce is a managed service, it is not possible to tinker with Spark directly.

Instead, using Dataproce's "Cluster Properties" and "Initialization Actions", Spark and Hadoop settings at the time of cluster creation.

Create the following initialization action and upload it to Google Cloud Storage.

init_action.sh

#!/bin/bash

JAR_PATH=/usr/local/lib/jars
mkdir $JAR_PATH

# AWS S3
HADOOP_PATH=/usr/lib/hadoop-mapreduce
HADOOP_AWS=hadoop-aws-2.9.2.jar
AWS_JAVA_SDK_BUNDLE=aws-java-sdk-bundle-1.11.199.jar

ln -s $HADOOP_PATH/$HADOOP_AWS           $JAR_PATH/hadoop-aws.jar
ln -s $HADOOP_PATH/$AWS_JAVA_SDK_BUNDLE  $JAR_PAth/aws-java-sdk-bundle.jar

In the initialization action, make a place for the jar file to be added to Spark, There we set up symbolic links to "aws-java-sdk-bundle-xxxx.jar" and "hadoop-aws-xxxx.jar".

Clustering

How to access AWS S3 from Spark (Google Dataproc)

  • impression

    • AWS-related jar files are enclosed from version 1.4 or later of the image, so make sure your cluster image is version 1.4 or later.
  • initializing action

    • Specify the initialization action file you just uploaded to Google Cloud Storage
    • This action is performed when a cluster is created
  • cluster property

    • [spark]:[spark.jars]:[/usr/local/lib/jars/*]

      • Configure the Spark configuration "spark-defaults.conf" in "spark
      • Make Spark load the JDBC jar file in "spark.jars" "/usr/local/lib/jars/*".
    • [core]:[fs.s3a.access.key]:[xxxx]、[core]:[fs.s3a.secret.key]:[xxxx]

      • Configure the Hadoop configuration file "core-site.xml" in "core
      • Set the AWS S3 access key and secret in "fs.s3a.access.key and secret.key

The name of the AWS jar file may change, so if you cannot find the file, SSH login to the master node of the cluster and check the file name.

Example of use

df = spark.read.format("json").load("s3a://xxxx/xxxx/test.json")
df.show()

Other Comments, etc.

Both reading and writing are possible.

Note that the S3 path is "s3a", not "s3".

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com