Databricks — structured streaming

Michal Molka

3 min readDec 23, 2022

In the last post I described how to work with Delta Live Tables. Here is the link: Databricks — Delta Live Tables

Today I want to show a similar topic — a structured streaming.

Our example shows how to ingest .csv files from Azure Data Lake Storage. Here is a file structure:

The first step is to declare all needed variables and a .csv files schema.

The second step is to create a readStream.

In this case we use cloudFiles Autoloader which automatically ingests the new .csv files. Additionally here are two column added [event_timestamp] and [event_date] which will be useful later.

Before we add a new code part. I have created a second data frame. This one will be joined to our stream data.

Now, it is the time to write our streamed data into a delta location. In order to do this we use the writeStream function.

Line 2 — the data frame’s data is joined to streamed data.

Line 6 — a checkpoint location, where the streamed data state is saved.

Line 7 — in this case data is appended, you can overwrite or update data as well.

Line 8 — a sink delta table is partitioned by [event_date] column.

After new files are uploaded, new records appear in the target location:

The above example shows how to ingest new records into a delta location. The second one shows how to aggregate data using a window function. We won’t save the ingested data, we use cool MEMORY functionality, where our stream data resides directly in a memory.

As you see in the line 10 — all events are saved to the memory. In the line 11 a queryName was set — we will be able to make SQL queries to this stream.

In lines 1–6 we defined a window function. We use the [iowa_population_stream] initiated earlier. Data is grouped by an [event_date] column — line 3 and summed up by a [Population] column— line 5. The line 4 says that we use tumbling window with 10 seconds interval.

As I said earlier, we can query the stream through SQL because the queryName had been set.

Here is the code: Structured Streaming

Databricks — structured streaming

Written by Michal Molka

No responses yet