Databricks — Workflows/Jobs
Every now and then you need to run a Databricks notebook according to some schedule or you have a few notebooks which you want to run and keep dependencies between them. You can use Azure Data Factory of course. But, for some reason you don’t need or don’t want to use an external tool. Then you can use a Databricks Job.
We have a delta table with a following structure:
An exemplary job has 4 steps. Each step runs one notebook.
- Delete records.
- Insert a first record.
- Insert a second record.
- Update a previously inserted records [City] column’s value.
Here is a Job:
As you see, we have set dependencies between tasks. So that, delete_records_000 runs as first. After the task is completed, add_record_001 and add_record_002 are performed in parallel. Once both are completed, update_records_003 task runs.
Here are task details.
- Path to a notebook to run.
- A cluster performs the notebook.
- Parameters which we want to use inside the notebook. To refer a variable inside the notebook, we can use: the dbutils.widgets.get function.
ID01 = dbutils.widgets.get(“ID01”)
ID02 = dbutils.widgets.get(“ID02”)
4. Tasks dependencies.
5. Retry policy and timeout settings.
Once we run the job we can check its status.
And look at flow details.
If the job failed, you can select a particular job run and check the reason.
Here, we can schedule the job.
Here are a few examples how to work with jobs using a Databricks CLI: Databricks CLI — a few examples.