Json

Json file data source

Properties

Properties supported in this source are shown below ( * indicates required fields )
Property
Description
Name *
Name of the data source
Description
Description of the data source
Processing Mode
Select for batch and un-select for streaming. If 'Batch' is selected the value of the switch is set to true. If 'Streaming' is selected the value of the switch is set to false.Default: true
Infer Schema
Check if schema should be inferred from dataDefault: false
Path *
Path to file locationExample: s3a://[bucketpath]Default:
Schema
Source schema to assist during the design of the pipeline
Filename Column
Adds the absolute path of the file being read as a new column with the provided nameExample: file_name
Select Fields / Columns
Comma separated list of fields / column names to select from sourceDefault: *
Filter Expression
SQL where clause for filtering recordsExample: date = '2022-01-01',year=22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column valuesDefault: false
Path Glob Filter
Optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.
Recursive File Lookup
Recursively load files and it disables partition inferring. If your folder structure is partitioned with columnName=value (Eg. processDate=2022-01026), then using the recursive option WILL NOT read the partitions correctly.Default: false
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given stringExample: _
Ignore Corrupt Files
If selected, jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned
Ignore Missing Files
Select to ignore missing files while reading data from files
Modified Before
An optional timestamp to only include files with modification times occurring before the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ssExample: 2020-06-01T13:00:00
Modified After
An optional timestamp to only include files with modification times occurring after the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ssExample: 2020-06-01T13:00:00
Time Zone
Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of timeZone are supported:Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.Zone offset: It should be in the format '(+
Primitives As String
Infers all primitive values as a string typeDefault: false
Prefers Decimal
Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doublesDefault: false
Allow Comments
Ignores Java/C++ style comment in JSON records. A comment enclosed in / and / MUST be written on a single line. It SHOULD NOT span multiple lines, even if the multiLine property is set to true. The multiLine property is only applicable to json attributes, not to comments.Default: false
Allow Unquoted Field Names
Allows unquoted JSON field namesDefault: false
Allow Single Quotes
Allows single quotes in addition to double quotesDefault: true
Allow Numeric Leading Zeros
Allows leading zeros in numbers (e.g. 00012)Default: false
Allow Backslash Escaping Any Character
Allows accepting quoting of all character using backslash quoting mechanismDefault: false
Mode
Allows a mode for dealing with corrupt records during parsing.PERMISSIVE: When it meets a corrupted record, puts the malformed string into a field configured by Column Name Of Corrupt Record, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named Column Name Of Corrupt Record in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a Column Name Of Corrupt Record field in an output schema.DROPMALFORMED: Ignores the whole corrupted records.FAILFAST: Throws an exception when it meets corrupted records.Default: PERMISSIVE
Column Name Of Corrupt Record
Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
Date Format
String that indicates the date format to use when reading dates or timestamps. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to both DateType and TimestampType. By default it is null, which means try to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().Default: yyyy-MM-dd
Timestamp Format
Sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp typeDefault: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
Multi-Line
Parse one record, which may span across multiple lines per JSONL file or parse multiple records enclosed in a JSON array. The below example will be parsed as two records.Example:Default: false
Allow Unquoted Control Chars
Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.Default: false
Encoding
Allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified and Multi-Line is set to true, it will be detected automatically.
Line Separator
Defines the line separator that should be used for parsing. Default is \n, \r, \r\n. If using default as line separator, this field should be left empty.
Sampling Ratio
Defines fraction of input JSON objects used for schema inferringDefault: 1
Drop Field If All Null
Whether to ignore column of all null values or empty array/struct during schema inferenceDefault: false
Locale
Sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.Default: en-US
Allow Non-Numeric Numbers
Allows JSON parser to recognize set of "Not-a-Number" (NaN) tokens as legal floating number values:+INF for positive infinity, as well as alias of +Infinity and Infinity-INF for negative infinity, alias -InfinityNaN for other not-a-numbers, like result of division by zeroDefault: true
Watermark Field Name
Field name to be used as watermark. If unspecified in streaming mode, the default field name is 'tempWatermark'.Example: myConsumerWatermarkDefault: tempWatermark
Watermark Value
Watermark value settingExample: 10 seconds,2 minutes