Json

Json file data source
PropertiesProperties supported in this source are shown below ( * indicates required fields )
Property
Description
Name * 
Name of the data source
Description
Description of the data source
Processing Mode
Select for batch and un-select for streaming. If 'Batch' is selected the value of the switch is set to true. If 'Streaming' is selected the value of the switch is set to false.﻿﻿﻿Default: true
Infer Schema
Check if schema should be inferred from data﻿﻿﻿Default: false
Path * 
Path to file location﻿﻿﻿Example: s3a://[bucketpath]﻿﻿﻿Default:
Schema
Source schema to assist during the design of the pipeline
Filename Column
Adds the absolute path of the file being read as a new column with the provided name﻿﻿﻿Example: file_name
Select Fields / Columns
Comma separated list of fields / column names to select from source﻿﻿﻿Default: *
Filter Expression
SQL where clause for filtering records﻿﻿﻿Example: date = '2022-01-01',year=22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Path Glob Filter
Optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.
Recursive File Lookup
Recursively load files and it disables partition inferring. If your folder structure is partitioned with columnName=value (Eg. processDate=2022-01026), then using the recursive option WILL NOT read the partitions correctly.﻿﻿﻿Default: false
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Ignore Corrupt Files
If selected, jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned
Ignore Missing Files
Select to ignore missing files while reading data from files
Modified Before
An optional timestamp to only include files with modification times occurring before the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss﻿﻿﻿Example: 2020-06-01T13:00:00
Modified After
An optional timestamp to only include files with modification times occurring after the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss﻿﻿﻿Example: 2020-06-01T13:00:00
Time Zone
Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of timeZone are supported:﻿Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.﻿Zone offset: It should be in the format '(+
Primitives As String
Infers all primitive values as a string type﻿﻿﻿Default: false
Prefers Decimal
Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles﻿﻿﻿Default: false
Allow Comments
Ignores Java/C++ style comment in JSON records. A comment enclosed in / and / MUST be written on a single line. It SHOULD NOT span multiple lines, even if the multiLine property is set to true. The multiLine property is only applicable to json attributes, not to comments.﻿﻿﻿Default: false
Allow Unquoted Field Names
Allows unquoted JSON field names﻿﻿﻿Default: false
Allow Single Quotes
Allows single quotes in addition to double quotes﻿﻿﻿Default: true
Allow Numeric Leading Zeros
Allows leading zeros in numbers (e.g. 00012)﻿﻿﻿Default: false
Allow Backslash Escaping Any Character
Allows accepting quoting of all character using backslash quoting mechanism﻿﻿﻿Default: false
Mode
Allows a mode for dealing with corrupt records during parsing.﻿PERMISSIVE: When it meets a corrupted record, puts the malformed string into a field configured by Column Name Of Corrupt Record, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named Column Name Of Corrupt Record in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a Column Name Of Corrupt Record field in an output schema.﻿DROPMALFORMED: Ignores the whole corrupted records.﻿FAILFAST: Throws an exception when it meets corrupted records.﻿﻿﻿Default: PERMISSIVE
Column Name Of Corrupt Record
Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
Date Format
String that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default it is null, which means try to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().﻿﻿﻿Default: yyyy-MM-dd
Timestamp Format
Sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp type﻿﻿﻿Default: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Multi-Line
Parse one record, which may span across multiple lines per JSONL file or parse multiple records enclosed in a JSON array. The below example will be parsed as two records.﻿﻿﻿Example:﻿Default: false
Allow Unquoted Control Chars
Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.﻿﻿﻿Default: false
Encoding
Allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified and Multi-Line is set to true, it will be detected automatically.
Line Separator
Defines the line separator that should be used for parsing. Default is \n, \r, \r\n. If using default as line separator, this field should be left empty.
Sampling Ratio
Defines fraction of input JSON objects used for schema inferring﻿﻿﻿Default: 1
Drop Field If All Null
Whether to ignore column of all null values or empty array/struct during schema inference﻿﻿﻿Default: false
Locale
Sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.﻿﻿﻿Default: en-US
Allow Non-Numeric Numbers
Allows JSON parser to recognize set of "Not-a-Number" (NaN) tokens as legal floating number values:﻿+INF for positive infinity, as well as alias of +Infinity and Infinity﻿-INF for negative infinity, alias -Infinity﻿NaN for other not-a-numbers, like result of division by zero﻿﻿﻿Default: true
Watermark Field Name
Field name to be used as watermark. If unspecified in streaming mode, the default field name is 'tempWatermark'.﻿﻿﻿Example: myConsumerWatermark﻿Default: tempWatermark
Watermark Value
Watermark value setting﻿Example: 10 seconds,2 minutes
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
 ﻿
﻿
﻿﻿﻿