support bulk copy for mssql and synapse #1234

rudolfix · 2024-04-17T11:44:54Z

Background
The way we insert rows into mssql is ineffective. We should switch to bulk copy. MS odbc drivers come with bcp command that we can use.
https://github.com/yehoshuadimarsky/bcpandas/blob/master/bcpandas/utils.py
is a project that does that quite well

Tasks

- allow mssql and synapse to handle csv
- use bcp in copy jobs
- looks like that we can pass DSN as credentials and specify file format to copy on the fly: https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16

The text was updated successfully, but these errors were encountered:

jorritsandbrink · 2024-04-18T07:43:53Z

@rudolfix Do we also want to do this for synapse? For synapse we already implement SynapseCopyFileLoadJob based on COPY INTO, which is the recommended approach:

"While dedicated SQL pools support many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement.

With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, we recommend using the COPY statement."

Source: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-elt-data-loading

rudolfix · 2024-04-21T18:20:07Z

@jorritsandbrink it just may work for synapse if we implement it ie. via csv files + bcp but it is not required. The reason for the ticket is abysmal insert performance of sql server. it is WAY slower than postgres. nevertheless priority of this ticket is quit low (at least for now)

cody-scott · 2024-06-12T03:37:14Z

Chiming in here; one of the big hold back for shifting to dlt is the abysmal odbc performance over bcp.

Some of our sources are tall and wide so bcp helps a ton with performance, so if this comes to be then I would be thrilled!

More considerations about the nuances of the data though (date time/floats/column ordering/etc) then sqlalchemy.

bhuesemann · 2024-10-09T15:59:14Z

Hi @rudolfix ,
for a mssql destination: apart from loading in parallel/batches and current approach of arrayed inserts or using bcp for bulk loading, would it possible to also use single prepared/parameterized inserts and using fast_executemany driver options? This would leave handling of arrayed inserts to the driver and may lead to better bulk database loading performance.
Something like this:

cnxn = pyodbc.connect(connection_string, autocommit=True)
crsr = cnxn.cursor()
data = crsr.execute("SELECT * FROM MillionRows").fetchall()
crsr.close()
printmem("data loaded")
crsr = cnxn.cursor()
ins = "INSERT INTO #tmp (ID, TextField, LongIntegerField, DoubleField, varchar_column) VALUES (?,?,?,?,?)"
crsr.fast_executemany = True
crsr.executemany(ins, data)
crsr.close()

I could create an sqlalchemy engine like this:
engine = create_engine(destination_conn, fast_executemany=True)
But how to get this configuration into a dlt destination? I can't find a way to inject it with a created destination:
destination = dlt.destinations.mssql(credentials)
destination.spec seems not the right way to pass sa engine parameters.

The post here is somewhat old and the current pyodbc version doesn't need to intercept events:
https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990

But you can see the possible performance gain as compared to other approaches. And meanwhile I also understand, that using prepared statements alone is no improvement, as the network round trips stay the same. So fast_executemany will also batch inserts but I guess will use a real arrayed insert, may be even bypassing transaction management (thus autocommit needs to be disabled) like real bulk loading would.

burnbay · 2025-02-04T12:28:16Z

Hi @rudolfix
Would really like to see @bhuesemann's suggestion investigated.

cody-scott · 2025-02-07T01:52:49Z

I'll add here that there is still going to be a performance limitation for large datasets with odbc vs the bulk copy tool.

Would be interesting to see the performance comparisons for a larger insert (1 million rows) to compare.

rudolfix added the destination Issue related to new destinations label Apr 17, 2024

github-project-automation bot added this to dlt core library Apr 17, 2024

github-project-automation bot moved this to Todo in dlt core library Apr 17, 2024

VioletM added the support This issue is monitored by Solution Engineer label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support bulk copy for mssql and synapse #1234

support bulk copy for mssql and synapse #1234

rudolfix commented Apr 17, 2024 •

edited

Loading

jorritsandbrink commented Apr 18, 2024

rudolfix commented Apr 21, 2024

cody-scott commented Jun 12, 2024

bhuesemann commented Oct 9, 2024

burnbay commented Feb 4, 2025

cody-scott commented Feb 7, 2025

support bulk copy for mssql and synapse #1234

support bulk copy for mssql and synapse #1234

Comments

rudolfix commented Apr 17, 2024 • edited Loading

jorritsandbrink commented Apr 18, 2024

rudolfix commented Apr 21, 2024

cody-scott commented Jun 12, 2024

bhuesemann commented Oct 9, 2024

burnbay commented Feb 4, 2025

cody-scott commented Feb 7, 2025

rudolfix commented Apr 17, 2024 •

edited

Loading