PySpark

Properties

Properties supported in this processor are shown below ( * indicates required fields )
Property
Description
Name *
Name of the processor
Description
Description of the processor
Input Relations *
Source or processor inputs for this processor. This field is not editable and auto populated.
Output To
Comma separated output names that can be referenced in the connected nodes
PySpark Code *
PySpark source code that performs data transformations
Process data from incoming nodes using PySpark custom code and manage DataFrames using following steps:
    Create DataFrames for Incoming Nodes:
Create DataFrame from the incoming node data:
  df = spark.sql("SELECT * FROM input_relation")
    Perform Required Transformations:
After creating the DataFrame, perform the necessary transformations on the data as needed.
    Register the New DataFrame:
Once the transformations are complete, register the new DataFrame using the following code:
out_df.createOrReplaceTempView("currentnodename")
    Create Additional Output DataFrames:
If you need to create additional output DataFrames, you can do so with:
out_df.createOrReplaceTempView("currentnodename_outputrelation")

    Specify Additional Output Relations:
Ensure that any additional output relations are specified in the "Output To" property value to help data flow to the downstream nodes.
Default:
import pyspark from pyspark.sql.functions
import col,lit from pyspark.sql
import SparkSession from pyspark.sql
import functions as F
def execute_node(spark: SparkSession):
## Read data from the incoming node <input_relation>
df = spark.sql("SELECT * FROM <input_relation>")
## Create a new dataframe
out_df = df.withColumn("new", F.lit("pw"))
## Register the outgoing dataframes
out_df.createOrReplaceTempView("<output_relation>")