Cassandra

Cassandra database data source
PropertiesProperties supported in this source are shown below ( * indicates required fields )
Property
Description
Name *
Name of the data source
Description
Description of the data source
Connection *
Pre-defined Cassandra connection.
Table *
Cassandra Table name to query from.﻿﻿﻿Example: table_test
Schema
Source schema to assist during the design of the pipeline
Keyspace *
Cassandra Keyspace to read data from.﻿﻿﻿Example: pass
Select Fields / Columns
Comma separated list of fields / columns to select from source﻿﻿﻿Example: firstName, lastName, address1, address2, city, zipcode﻿Default: *
Filter Expression
SQL where clause for filtering records. This is also used to load partitions from the source﻿﻿﻿Example: date=2022-01-01,year = 22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Connections Per Executor
Minimum number of remote connections per host set on each executor JVM. Default value is estimated automatically based on the total number of executors in the cluster.
Read Consistency Level
Consistency level to use when reading. Refer  https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/dml/dmlConfigConsistency.html#Readconsistencylevels  for details.
Concurent Reads
Sets read parallelism for join with cassandra tables﻿﻿﻿Example: 512﻿Default: 512
Input Fetch Size in Rows
Number of CQL rows fetched per driver request﻿﻿﻿Example: 1,000﻿Default: 1,000
Input Reads per Second
Sets max requests or pages per core per second, unlimited by default﻿﻿﻿Example: 10000﻿Default: None
Input Split Size
Approximate amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism﻿﻿﻿Example: 1,024﻿﻿﻿Default: 512
Input Metrics
Sets whether to record connector specific metrics on write﻿﻿﻿Default: true
Enable Pushdown
Enables pushing down predicates to Cassandra when applicable﻿﻿﻿Default: true
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
﻿﻿﻿