Pipelines


Data pipelines can be used to move data between systems, to read, transform and output data into formats that make data analysis easy. Data pipelines can also perform a variety of other important data-related tasks like data quality checks and schema evolution.
  • Canvas with drag-and-drop interface for defining pipelines
  • Pre-built connectors to access data from various data sources and output to multiple destinations
  • Use SQL and other provided processors to transform data
  • Validate and test pipelines locally before deploying to a Spark runtime
  • Support for Databricks, AWS Glue, AWS EMR, Spark on Kubernetes and other Spark runtimes



Variables

Variables are placeholders with values that you define here and use them within the pipeline. When the pipeline runs, the value replaces the variable.
When you create a job, you can override the variable values defined in the pipeline. For example, you can define a directory variable and set it to different values depending which environment the job is set up to run. To use the directory variable within a pipeline, you can refer it as ${directory}. You may use different values for development, staging and production runtimes.

Validate

Design Document

Versions