Redshift

Redshift database data source
PropertiesProperties supported in this source are shown below ( * indicates required fields )
Property
Description
Name * 
Name of the data source
Description
Description of the data source
Connection * 
Pre-defined Redshift connection
Table or Query
Database table that should be read OR a query that will be used to read data from the Redshift source﻿﻿﻿Example: dbtable
Schema
Source schema to assist during the design of the pipeline
Select Fields / Columns
Comma separated list of fields / column names to select from source﻿﻿﻿Default: *
Filter Expression
SQL where clause for filtering records﻿﻿﻿Example: date = '2022-01-01',year=22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Distribution Style
Distribution style to be used when creating a table. When using KEY, you must also set a distribution key
Distribution Key
The name of a column in the table to use as the distribution key when creating a table
Sort Key Spec
Sort Keys supported by Redshift
Include Column List
Default: false
Preactions
A semicolon-separated list of SQL commands that are executed before data is transferred between Spark and Redshift
Postactions
A semicolon-separated list of SQL commands that are executed after data is transferred between Spark and Redshift
Extra Copy Options
A list extra options to append to the Redshift COPY command when loading data, e.g. TRUNCATECOLUMNS or MAXERROR (see the Redshift docs for other options)
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
 ﻿
﻿
﻿﻿﻿