Azure Synapse makes your life easier
As a person who works with data. A Data Analyst, a Data Scientist, a Data Engineer, a BI Engineer, etc. There are a lot of situations that you need to work with data originates from various sources. SQL Databases, Databrick, Data Lakes, etc. You can use Azure Data Factory, move the data to a common location and work seamlessly. In some cases you don’t need/want to create a data movement mechanism in order to have data in one place.
Azure Synapse is a product which can save your time and effort. Because you have a Databricks cluster, a SQL Database and a Data Lake in one place. Additionally you can create data movement processes like in ADF.
Here is an example how you can use four tables in one query, every table is queried from a different source.
A link to a code repository is placed at the end of the article.
The first one is a Spark table:
The second one, a Dedicated SQL database:
A Data Lake parquet file:
And a Serverless SQL Database:
And all four tables in one query.
Worth mentioning is a fact if you want to query a Serverless SQL table, you need to provide an access to Data Lake where files are placed. You can do it through a Service Principal created in an Azure portal and grant this principal an access directly inside the database.
Here is a code which allows you to configure connection inside a Databricks notebook.
You can reference Spark tables from a Serverless SQL Database, Serverless External tables and files from Data Lake. As of today (2022–04–13) there is no possibility to query a table from a Dedicated SQL.
But…a Spark table has parquet files under it, a Serverless SQL uses files from Data Lake as source. So, we can create External tables inside the Dedicated SQL.
Now, we can query all four tables.
Here are the code files.
…and…if you need some data which is outside of the scope. You can always create your own pipeline like in Azure Data Factory.