Databricks — structured streaming
In the last post I described how to work with Delta Live Tables. Here is the link: Databricks — Delta Live Tables
Today I want to show a similar topic — a structured streaming.
Our example shows how to ingest .csv files from Azure Data Lake Storage. Here is a file structure:
The first step is to declare all needed variables and a .csv files schema.
The second step is to create a readStream.
In this case we use cloudFiles Autoloader which automatically ingests the new .csv files. Additionally here are two column added [event_timestamp] and [event_date] which will be useful later.
Before we add a new code part. I have created a second data frame. This one will be joined to our stream data.
Now, it is the time to write our streamed data into a delta location. In order to do this we use the writeStream function.
Line 2 — the data frame’s data is joined to streamed data.
Line 6 — a checkpoint location, where the streamed data state is saved.
Line 7 — in this case data is appended, you can overwrite or update data as well.
Line 8 — a sink delta table is partitioned by [event_date] column.
After new files are uploaded, new records appear in the target location:
The above example shows how to ingest new records into a delta location. The second one shows how to aggregate data using a window function. We won’t save the ingested data, we use cool MEMORY functionality, where our stream data resides directly in a memory.
As you see in the line 10 — all events are saved to the memory. In the line 11 a queryName was set — we will be able to make SQL queries to this stream.
In lines 1–6 we defined a window function. We use the [iowa_population_stream] initiated earlier. Data is grouped by an [event_date] column — line 3 and summed up by a [Population] column— line 5. The line 4 says that we use tumbling window with 10 seconds interval.
As I said earlier, we can query the stream through SQL because the queryName had been set.
Here is the code: Structured Streaming