-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Synapse Spark workloads? #27
Comments
hi @ernestoongaro , Did you have success with this? Im looking for the same.., I would like use DBT on Synapse Spark Pools (Not Fabric)... |
I recently prototyped a DBT adapter for Synapse Spark pool. It is a fork of dbt-spark, but the extended part is written from scratch without reusing Cloudera's code. It currently works with the OSS Apache Livy (convenient for local tests) and the Synapse Spark pool (and I think it won't be challenging to adapt to HDInsight and Fabric as well). Unfortunately, I cannot share the source code for now as this is an internal POC in our company. However, I can share some ideas based on this experience. My short answer is "not so easy". The main challenge was the slowness of session creation in the Synapse Spark pool. As you also experienced, it is really slow. On the other hand, DBT opens and closes a DB connection for every task. Therefore, we had to find a way to (1) reuse the same Spark session across all tasks in a DBT run and (2) prevent zombie sessions that are not appropriately closed after the DBT run. My current strategy is "don't create (if it exists), don't stop." When establishing a DB connection, my prototype lists available sessions in a specified Spark pool and tries to find an available session with a name defined in profiles.yml. If such a session exists, the adapter uses it. Otherwise, it creates a new one with the name. My prototype intentionally does not stop sessions after a DBT run. The remaining session will be reused by the next run or will automatically stop by idle timeout. Another challenge is the latency of Livy API. Synapse's Livy endpoint imposes several seconds of latency between executing each Livy statement. Due to this latency, the current version takes more than 60 seconds (with 4 threads, not including time for session spin-up) to build the Jaffle Shop example compared to 20 seconds with local Spark+Livy. I do not know how to reduce this latency (multithreading helped, but not enough). I will test the prototype with bigger use cases and hopefully share more insight. |
Hello! Does this adapter work with only the Fabric Livy endpoints or would it also work with Spark Livy endpoints?
The text was updated successfully, but these errors were encountered: