Fixed Width

The fixed width file can be a multi segment file containing header and footer rows. This source has a provision to specify a fixed number of rows as header rows and a fixed number of rows as footer rows. In between the header and footer rows there can be multiple segment definitions. 
Each row in a multisegment file contains a record type as part of the line. The schema definition of the fixed width file has a definition "recordType": { "startsAt": 24, "length": 2 } which specifies how to identify the record type. In this case, we extract 2 characters from the start poisiton of 24 to determine the record type. 
If the file is a single segment file, both these values are set to -1. 
Next is the segments definition in the schema. Each segment has a 'name' and a 'recordTypeValue' to compare with to determine the type of record. So if the two characters in our example match the 'recordTypeValue' specified in the segment schema, that record is added to the named segment dataframe. 
This is followed by an actual schema definition of the other columns in the segment which are extracted based on the 'startsAt' and the 'length' of the characters which define the column. 
There is also a 'type' field like 'int', 'decimal', 'string', 'timestamp', etc. the decimal type has a precision and a scale and whether the decimal point is present in the data file. the timestamp type has a date format which follows the Java date formats for extraction.
PropertiesProperties supported in this source are shown below ( * indicates required fields )
Property
Description
Name * 
Name of the data source
Description
Description of the data source
Processing Mode
Select for batch and un-select for streaming. If 'Batch' is selected the value of the switch is set to true. If 'Streaming' is selected the value of the switch is set to false.﻿﻿﻿Default: true
Path * 
Path to the file location﻿﻿﻿Example: s3a://[bucketpath]/load.csv,hdfs://[URL]﻿﻿﻿Default:
Schema * 
The column layout schema in json format or a JSON file specifying the format.
# of Header Rows * 
Number of rows which should be considered as header rows﻿﻿﻿Example: 0﻿Default: 0
# of Footer Rows * 
Number of rows which should be considered as footer rows, checksum rows﻿﻿﻿Example: 0﻿Default: 0
Output To
If there are header rows (Header Rows > 0), the first item in this list is ALWAYS the header output. If there are footer rows (Footer rows > 0), the last item in this list is ALWAYS the footer row. All other outputs in between the first and the last correspond to the segments in the provided schema. Eg. If there are header rows, two segments in the schema and a footer row, then there should be 4 items in the list. The first is the header, the middle two are the segments and the last is the footer. If there are no headers and a footer and one segment, then there should be two names in the list. The first is the segment output name and the last is the footer output.
Select Fields / Columns
Comma separated list of fields / columns to select from source﻿﻿﻿Example: firstName, lastName, address1, address2, city, zipcode﻿Default: *
Filter Expression
SQL where clause for filtering records. This is also used to load partitions from the source﻿﻿﻿Example: date=2022-01-01,year = 22 and month = 6 and day = 2
Distinct Values
Select rows with distinct column values﻿﻿﻿Default: false
Path Glob Filter
Optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery
Recursive File Lookup
Recursively load files and it disables partition inferring. If your folder structure is partitioned with columnName=value (Eg. processDate=2022-01026), then using the recursive option WILL NOT read the partitions correctly.﻿﻿﻿Default: false
Filename Column
Adds the absolute path of the file being read as a new column with provided name﻿﻿﻿Example: file_name﻿﻿﻿Default:
Ignore Corrupt Files
If selected, jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned
Ignore Missing Files
Select to ignore missing files while reading data from files
Modified Before
An optional timestamp to only include files with modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss﻿﻿﻿Example: 2020-06-01T13:00:00
Modified After
An optional timestamp to only include files with modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss﻿﻿﻿Example: 2020-06-01T13:00:00
Character Set
Character set of the file﻿﻿﻿Default: UTF-8
Normalize Column Names
Normalizes column names by replacing special characters ,;{}()&/\n\t= and space with the given string﻿﻿﻿Example: _
Watermark Field Name
Field name to be used as watermark. If unspecified in streaming mode, the default field name is 'tempWatermark'.﻿﻿﻿Example: myConsumerWatermark﻿Default: tempWatermark
Watermark Value
Watermark value setting﻿Example: 10 seconds,2 minutes
Cache
MEMORY_ONLY: Persist data in memory only in deserialized format﻿MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk﻿MEMORY_ONLY_SER: Same as MEMORY_ONLY but difference being it persists in serialized format. This is generally more space-efficient than deserialized format, but more CPU-intensive to read.﻿MEMORY_AND_DISK_SER: Same as MEMORY_AND_DISK storage level difference being it persists in serialized format﻿DISK_ONLY: Persist the data partitions only on disk﻿MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as the levels above, but replicate each partition on two cluster nodes﻿OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled﻿﻿﻿Default: NONE
﻿﻿﻿
﻿﻿﻿