How to access Microsoft SQL Server (Azure Database) from Spark (Google Dataproc).
procedure
Spark Configuration
The following Spark settings will allow you to read and write SQL Server data from Spark.
Download the MS SQL Server JDBC jar file
Load the JDBC jar file into Spark
- Set the path to the jar file to "spark.jars" in "spark-defaults.conf
Set JDBC jar file to ClassPath in Spark
- Set the path to the jar file to "spark.driver.extraClassPath" in "spark-defaults.conf
Dataproc Configuration
Because Dataproce is a managed service, it is not possible to tinker with Spark directly.
Instead, using Dataproce's "Cluster Properties" and "Initialization Actions", Spark configuration at the time of cluster creation.
In advance, upload the MS SQL Server JDBC jar file to Google Cloud Storage.
Create the following initialization action and upload it to Google Cloud Storage.
init_action.sh
#!/bin/bash JAR_PATH=/usr/local/lib/jars mkdir $JAR_PATH # JDBC SQL Server JDBC_MSSQL=mssql-jdbc-7.2.2.jre8.jar gsutil cp gs://bnobq-olive-dataproc/init-action/jdbc/$JDBC_MSSQL $JAR_PATH/
The initialization action creates a location for the jar file to be added to Spark and downloads the JDBC jar file that has been uploaded to Storage.
Clustering
initializing action
- Specify the initialization action file you just uploaded to Google Cloud Storage
- This action is performed when a cluster is created
project access
- check
- In the initialization action, the JDBC jar file is downloaded from Storage, so the cluster is granted access to Storage.
cluster property
[spark]:[spark.jars]:[/usr/local/lib/jars/*]
- Configure the Spark configuration "spark-defaults.conf" in "spark
- Make Spark load the JDBC jar file in "spark.jars" "/usr/local/lib/jars/*".
[spark]:[spark.driver.extraClassPath]:[/usr/local/lib/jars/*]
- Configure the Spark configuration "spark-defaults.conf" in "spark
- Make Spark load the JDBC jar file as ClassPass information in "spark.driver.extraClassPath" "/usr/local/lib/jars/*".
[dataproc]:[dataproc.conscrypt.provider.enable]:[false]
- If not set, set to "false" to avoid ssl-related errors
Example of use
options = { "url":"jdbc:sqlserver://<server>:1433;databaseName=<db_name>", "driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver", "dbtable":"<table_name>", "user":"xxxx", "password":"xxxx" } df = spark.read.format("jdbc").options(**options).load() df.show()
Other Comments, etc.
Both reading and writing are possible.
I think it is better to set "spark.executor.heartbeatInterval" and "spark.network.timeout" in "spark-defaults.conf" longer because it is easy to time out. I think it is better to set them longer.
If you are using Spark 2.4 or later, it is convenient to use "query" instead of "dbtable" to get data by query, The image should be version 1.4 or higher with Spark 2.4 available.
In Scala, there is Dedicated connector, Python does not have Dedicated connector, so we use JDBC for generic access.
Reference Articles
- https://kontext.tech/docs/DataAndBusinessIntelligence/p/connect-to-sql-server-in-spark-pyspark
- https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/issues/177
- https://stackoverflow.com/questions/32958311/spark-adding-jdbc-driver-jar-to-google-dataproc
- https://stackoverflow.com/questions/52046004/how-do-i-connect-spark-to-jdbc-driver-in-zeppelin
- http://mogile.web.fc2.com/spark/sql-data-sources-jdbc.html