How to access MySQL from Spark (Google Dataproc).
Since it is accessed using JDBC, it can be applied to other RDBs such as PostgreSQL.
procedure
Spark Configuration
The following Spark settings will allow you to read and write MySQL data from Spark.
Download the MySQL JDBC jar file
Load the JDBC jar file into Spark
- Set the path to the jar file to "spark.jars" in "spark-defaults.conf
Set JDBC jar file to ClassPath in Spark
- Set the path to the jar file to "spark.driver.extraClassPath" in "spark-defaults.conf
Dataproc Configuration
Because Dataproce is a managed service, it is not possible to tinker with Spark directly.
Instead, using Dataproce's "Cluster Properties" and "Initialization Actions", Spark configuration at the time of cluster creation.
In advance, upload the MySQL JDBC jar file to Google Cloud Storage.
Create the following initialization action and upload it to Google Cloud Storage.
init_action.sh
#!/bin/bash JAR_PATH=/usr/local/lib/jars mkdir $JAR_PATH # JDBC MySQL JDBC_MYSQL=mysql-connector-java-8.0.16.jar gsutil cp gs://xxxx/xxxx/$JDBC_MYSQL $JAR_PATH/
The initialization action creates a location for the jar file to be added to Spark and downloads the JDBC jar file that has been uploaded to Storage.
Clustering
initializing action
- Specify the initialization action file you just uploaded to Google Cloud Storage
- This action is performed when a cluster is created
project access
- check
- In the initialization action, the JDBC jar file is downloaded from Storage, so the cluster is granted access to Storage.
cluster property
[spark]:[spark.jars]:[/usr/local/lib/jars/*]
- Configure the Spark configuration "spark-defaults.conf" in "spark
- Make Spark load the JDBC jar file in "spark.jars" "/usr/local/lib/jars/*".
[spark]:[spark.driver.extraClassPath]:[/usr/local/lib/jars/*]
- Configure the Spark configuration "spark-defaults.conf" in "spark
- Make Spark load the JDBC jar file as ClassPass information in "spark.driver.extraClassPath" "/usr/local/lib/jars/*".
Example of use
options = { "url":"jdbc:mysql://<server>:3306/<db_name>", "driver":"com.mysql.jdbc.Driver", "dbtable":"<table_name>", "user":"xxxx", "password":"xxxx" } df = spark.read.format("jdbc").options(**options).load() df.show()
Other Comments, etc.
Both reading and writing are possible.
I think it is better to set "spark.executor.heartbeatInterval" and "spark.network.timeout" in "spark-defaults.conf" longer because it is easy to time out. I think it is better to set them longer.
If you are using Spark 2.4 or later, it is convenient to use "query" instead of "dbtable" to get data by query, The image should be version 1.4 or higher with Spark 2.4 available.
Reference Articles
- https://stackoverflow.com/questions/32958311/spark-adding-jdbc-driver-jar-to-google-dataproc
- https://stackoverflow.com/questions/52046004/how-do-i-connect-spark-to-jdbc-driver-in-zeppelin
- http://mogile.web.fc2.com/spark/sql-data-sources-jdbc.html