Properties supported in this source are shown below ( * indicates required fields )
Property
Description
Name *
Name of the data source
Description
Description of the data source
Connection *
Pre-defined Cassandra connection.
Table *
Cassandra Table name to query from.Example: table_test
Schema
Source schema to assist during the design of the pipeline
Keyspace *
Cassandra Keyspace to read data from.Example: pass
Select Fields / Columns
Comma separated list of fields / columns to select from sourceExample: firstName, lastName, address1, address2, city, zipcodeDefault: *
Filter Expression
SQL where clause for filtering records. This is also used to load partitions from the sourceExample: date=2022-01-01,year = 22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column valuesDefault: false
Connections Per Executor
Minimum number of remote connections per host set on each executor JVM. Default value is estimated automatically based on the total number of executors in the cluster.
Sets read parallelism for join with cassandra tablesExample: 512Default: 512
Input Fetch Size in Rows
Number of CQL rows fetched per driver requestExample: 1,000Default: 1,000
Input Reads per Second
Sets max requests or pages per core per second, unlimited by defaultExample: 10000Default: None
Input Split Size
Approximate amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelismExample: 1,024Default: 512
Input Metrics
Sets whether to record connector specific metrics on writeDefault: true
Enable Pushdown
Enables pushing down predicates to Cassandra when applicableDefault: true
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given stringExample: _
Cache
MEMORY_ONLY: Persist data in memory only in deserialized formatMEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on diskMEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized formatDISK_ONLY: Persist the data partitions only on diskMEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodesOFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabledDefault: NONE