MongoDB

MongoDB database data source
PropertiesProperties supported in this source are shown below ( * indicates required fields )
Property
Description
Name * 
Name of the data source
Description
Description of the data source
Connection  * 
Pre-defined MongoDB connection
Database * 
Database to connect.﻿﻿﻿Example: customerdb
Collection * 
Collection data to fetch﻿﻿﻿Example: products
Schema
Source schema to assist during the design of the pipeline
Select Fields / Columns
Comma separated list of fields / column names to select from source﻿﻿﻿Default: *
Filter Expression
SQL where clause for filtering records﻿﻿﻿Example: date = '2022-01-01',year=22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Partitioner
The partitioner full class name.﻿﻿﻿Example: com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner﻿Default: com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner
Keep Alive (ms)
The length of time to keep a MongoClient available for sharing.﻿﻿﻿Example: 100,000﻿Default: 5,000
Sample Size
The number of documents to sample from the collection when inferring the schema﻿﻿﻿Example: 1,000﻿Default: 1,000
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
 ﻿
﻿
﻿﻿﻿