Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

How to access MySQL from Spark (Google Dataproc)

How to access MySQL from Spark (Google Dataproc).

Since it is accessed using JDBC, it can be applied to other RDBs such as PostgreSQL.

procedure

Spark Configuration

The following Spark settings will allow you to read and write MySQL data from Spark.

  • Download the MySQL JDBC jar file

  • Load the JDBC jar file into Spark

    • Set the path to the jar file to "spark.jars" in "spark-defaults.conf
  • Set JDBC jar file to ClassPath in Spark

    • Set the path to the jar file to "spark.driver.extraClassPath" in "spark-defaults.conf

Dataproc Configuration

Because Dataproce is a managed service, it is not possible to tinker with Spark directly.

Instead, using Dataproce's "Cluster Properties" and "Initialization Actions", Spark configuration at the time of cluster creation.

In advance, upload the MySQL JDBC jar file to Google Cloud Storage.

Create the following initialization action and upload it to Google Cloud Storage.

init_action.sh

#!/bin/bash

JAR_PATH=/usr/local/lib/jars
mkdir $JAR_PATH

# JDBC MySQL
JDBC_MYSQL=mysql-connector-java-8.0.16.jar
gsutil cp gs://xxxx/xxxx/$JDBC_MYSQL $JAR_PATH/

The initialization action creates a location for the jar file to be added to Spark and downloads the JDBC jar file that has been uploaded to Storage.

Clustering

How to access MySQL from Spark (Google Dataproc)

  • initializing action

    • Specify the initialization action file you just uploaded to Google Cloud Storage
    • This action is performed when a cluster is created
  • project access

    • check
    • In the initialization action, the JDBC jar file is downloaded from Storage, so the cluster is granted access to Storage.
  • cluster property

    • [spark]:[spark.jars]:[/usr/local/lib/jars/*]

      • Configure the Spark configuration "spark-defaults.conf" in "spark
      • Make Spark load the JDBC jar file in "spark.jars" "/usr/local/lib/jars/*".
    • [spark]:[spark.driver.extraClassPath]:[/usr/local/lib/jars/*]

      • Configure the Spark configuration "spark-defaults.conf" in "spark
      • Make Spark load the JDBC jar file as ClassPass information in "spark.driver.extraClassPath" "/usr/local/lib/jars/*".

Example of use

options = {
    "url":"jdbc:mysql://<server>:3306/<db_name>",
    "driver":"com.mysql.jdbc.Driver",
    "dbtable":"<table_name>",
    "user":"xxxx",
    "password":"xxxx"
}

df = spark.read.format("jdbc").options(**options).load()
df.show()

Other Comments, etc.

Both reading and writing are possible.

I think it is better to set "spark.executor.heartbeatInterval" and "spark.network.timeout" in "spark-defaults.conf" longer because it is easy to time out. I think it is better to set them longer.

If you are using Spark 2.4 or later, it is convenient to use "query" instead of "dbtable" to get data by query, The image should be version 1.4 or higher with Spark 2.4 available.

Reference Articles

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com