Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support bulk copy for mssql and synapse #1234

Open
rudolfix opened this issue Apr 17, 2024 · 6 comments
Open

support bulk copy for mssql and synapse #1234

rudolfix opened this issue Apr 17, 2024 · 6 comments
Labels
destination Issue related to new destinations support This issue is monitored by Solution Engineer

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Apr 17, 2024

Background
The way we insert rows into mssql is ineffective. We should switch to bulk copy. MS odbc drivers come with bcp command that we can use.
https://github.com/yehoshuadimarsky/bcpandas/blob/master/bcpandas/utils.py
is a project that does that quite well

Tasks

    • allow mssql and synapse to handle csv
    • use bcp in copy jobs
@rudolfix rudolfix added the destination Issue related to new destinations label Apr 17, 2024
@jorritsandbrink
Copy link
Collaborator

@rudolfix Do we also want to do this for synapse? For synapse we already implement SynapseCopyFileLoadJob based on COPY INTO, which is the recommended approach:

"While dedicated SQL pools support many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement.

With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, we recommend using the COPY statement."

Source: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-elt-data-loading

@rudolfix
Copy link
Collaborator Author

@jorritsandbrink it just may work for synapse if we implement it ie. via csv files + bcp but it is not required. The reason for the ticket is abysmal insert performance of sql server. it is WAY slower than postgres. nevertheless priority of this ticket is quit low (at least for now)

@cody-scott
Copy link

Chiming in here; one of the big hold back for shifting to dlt is the abysmal odbc performance over bcp.

Some of our sources are tall and wide so bcp helps a ton with performance, so if this comes to be then I would be thrilled!

More considerations about the nuances of the data though (date time/floats/column ordering/etc) then sqlalchemy.

@bhuesemann
Copy link

Hi @rudolfix ,
for a mssql destination: apart from loading in parallel/batches and current approach of arrayed inserts or using bcp for bulk loading, would it possible to also use single prepared/parameterized inserts and using fast_executemany driver options? This would leave handling of arrayed inserts to the driver and may lead to better bulk database loading performance.
Something like this:

cnxn = pyodbc.connect(connection_string, autocommit=True)
crsr = cnxn.cursor()
data = crsr.execute("SELECT * FROM MillionRows").fetchall()
crsr.close()
printmem("data loaded")
crsr = cnxn.cursor()
ins = "INSERT INTO #tmp (ID, TextField, LongIntegerField, DoubleField, varchar_column) VALUES (?,?,?,?,?)"
crsr.fast_executemany = True
crsr.executemany(ins, data)
crsr.close()

I could create an sqlalchemy engine like this:
engine = create_engine(destination_conn, fast_executemany=True)
But how to get this configuration into a dlt destination? I can't find a way to inject it with a created destination:
destination = dlt.destinations.mssql(credentials)
destination.spec seems not the right way to pass sa engine parameters.

The post here is somewhat old and the current pyodbc version doesn't need to intercept events:
https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990

But you can see the possible performance gain as compared to other approaches. And meanwhile I also understand, that using prepared statements alone is no improvement, as the network round trips stay the same. So fast_executemany will also batch inserts but I guess will use a real arrayed insert, may be even bypassing transaction management (thus autocommit needs to be disabled) like real bulk loading would.

@burnbay
Copy link

burnbay commented Feb 4, 2025

Hi @rudolfix
Would really like to see @bhuesemann's suggestion investigated.

@cody-scott
Copy link

I'll add here that there is still going to be a performance limitation for large datasets with odbc vs the bulk copy tool.

Would be interesting to see the performance comparisons for a larger insert (1 million rows) to compare.

@VioletM VioletM added the support This issue is monitored by Solution Engineer label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination Issue related to new destinations support This issue is monitored by Solution Engineer
Projects
Status: Todo
Development

No branches or pull requests

6 participants