JDBC

JDBC data source
Properties supported in this source are shown below ( * indicates required fields )
Property
Description
Name * 
Name of the data source
Description
Description of the data source
Connection * 
Pre-defined JDBC connection
Driver Class Name * 
Class name of the JDBC driver to connect to the database﻿﻿﻿Example: oracle.jdbc.driver.OracleDriver
Table or Query * 
Database table that should be read OR a query that will be used to read data from the JDBC source. It is not allowed to specify query and Partition Column options at the same time
Schema
Source schema to assist during the design of the pipeline
Select Fields / Columns
Comma separated list of fields / column names to select from source﻿﻿﻿Default: *
Filter Expression
SQL where clause for filtering records﻿﻿﻿Example: date = '2022-01-01',year=22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Partition Column
Partition Column, Lower Bound and Upper Bound options must all be specified if any of them is specified. In addition, Number of Partitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. Partition Column must be a numeric, date, or timestamp column from the table in question. Notice that Lower Bound and Upper Bound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned﻿﻿﻿Example: city
Lower Bound
Partition Column, Lower Bound and Upper Bound options must all be specified if any of them is specified. In addition, Number of Partitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. Partition Column must be a numeric, date, or timestamp column from the table in question. Notice that Lower Bound and Upper Bound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned﻿﻿﻿Example: pincode
Upper Bound
Partition Column, Lower Bound and Upper Bound options must all be specified if any of them is specified. In addition, Number of Partitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. Partition Column must be a numeric, date, or timestamp column from the table in question. Notice that Lower Bound and Upper Bound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned﻿﻿﻿Example: country
Number of Partitions
The maximum number of partitions that can be used for parallelism in table reading. This also determines the maximum number of concurrent JDBC connections﻿﻿﻿Example: 6
Query Timeout in Seconds
The number of seconds the driver will wait for a Statement object to execute to the given number of seconds. Zero means there is no limit. In the write path, this option depends on how JDBC drivers implement the API setQueryTimeout, e.g., the h2 JDBC driver checks the timeout of each query instead of an entire JDBC batch. It defaults to 0.﻿﻿﻿Example: 10
Num Rows Fetched per Round Trip
The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows)﻿﻿﻿Example: 4
Optional Session Open SQL
After each database session is opened to the remote DB and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code﻿﻿﻿Example: option("sessionInitStatement", "BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;")
Custom Schema if Any
The custom schema to use for reading data from JDBC connectors. You can also specify partial fields, and the others use the default type mapping. The column names should be identical to the corresponding column names of database table. Users can specify the corresponding data types of Spark SQL instead of using the defaults﻿﻿﻿Example: id DECIMAL(38, 0), name STRING,id DECIMAL(38, 0)﻿﻿﻿Default:
Pushdown Predicate
Option to enable or disable predicate push-down into the JDBC data source. If enabled, Spark will push down filters to the JDBC data source as much as possible. Otherwise, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. Predicate push-down is usually turned off when the predicate filtering is performed faster by Spark than by the JDBC data source﻿﻿﻿Default: true
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
 ﻿
﻿
﻿﻿﻿