Lakehouse azure

1/28/2024

Honestly that was the hardest Part, using synapse Compute fail the purpose as it is designed for Spark, the minimum is three VM, turn out Azure has an Amazing machine learning service, that we can just use for data engineering, you can run VM as low as 1 Core (8 cents/hour), auto shutdown is available, so you pay only for what you use, and you can schedule jobs, yes it is supposed to be for ML but it does work just fine for Data engineering Jobs. The theory is because the data is prepared and cleaned at the Storage level, PowerBI can just import the Parquet files, a SQL Endpoint is not strictly needed Ok sweet, so where do you run this Duck thing? Obviously you can use Azure ML notebook just to do exploratory analysis, but that’s not something that make sense for Business intelligence People □ PowerBI There are projects to do that but it is not ready yet, and even if you find some hacks, trying to implement governance and access controls will be non-trivial. I think it maybe the obvious question, DuckDB is an awesome Execution Engine, but it is not a client server DB, you can’t have a SQL Endpoint that you just use to run Queries from PowerBI etc. What if you have a smaller data size, can we keep this overall architecture and keep the lower cost, maybe we can, I will argue, it maybe even be useful very soon □ Why you are keeping the serverless Pool The Previous Diagram assume a big data workload, as Spark has a massive overhead in compute usage and cost and does not make much sense for a smaller data. My colleague who is an Azure data engineer will say the whole thing does not make any sense, ADF and SQL Server is all you need □.A Snowflake or Databricks Person will argue, One Engine should do everything prepare and serve.As I have a soft spot for Serverless, ideally, I would say, add Write capabilities to Serverless and call it a day.An Old school Dedicated pool professional will argue this is an over complicated system, and all you need is Source system -> Data integration tool -> Dedicated Pool.In practical term, Lakehouse here means the storage system with an open Table Format, now if you ask three different people about this diagram, probably they will give you 4 different answers. Notice, I am more interested in the overall Azure ecosystem, so it is vendor neutral. Last year, Azure Synapse team published an excellent article on how to build a lakehouse architecture using Azure synapse, what I really liked is this diagram, it is very simple and to the point. A simple script that load some data from Azure Storage into a disk cache, run complex Queries using DuckDB and save the results into a destination Bucket using a cheap Azure ML Notebook, The resulting bucket can be consumed in Synapse serverless/ PowerBI/ Notebook etc Introduction

0 Comments

Lakehouse azure

Leave a Reply.

Author

Archives

Categories