-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support bulk copy for mssql and synapse #1234
Comments
@rudolfix Do we also want to do this for "While dedicated SQL pools support many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement. With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. For the most flexibility when loading, we recommend using the COPY statement." Source: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-elt-data-loading |
@jorritsandbrink it just may work for synapse if we implement it ie. via csv files + bcp but it is not required. The reason for the ticket is abysmal insert performance of sql server. it is WAY slower than postgres. nevertheless priority of this ticket is quit low (at least for now) |
Chiming in here; one of the big hold back for shifting to dlt is the abysmal odbc performance over bcp. Some of our sources are tall and wide so bcp helps a ton with performance, so if this comes to be then I would be thrilled! More considerations about the nuances of the data though (date time/floats/column ordering/etc) then sqlalchemy. |
Hi @rudolfix ,
I could create an sqlalchemy engine like this: The post here is somewhat old and the current pyodbc version doesn't need to intercept events: But you can see the possible performance gain as compared to other approaches. And meanwhile I also understand, that using prepared statements alone is no improvement, as the network round trips stay the same. So fast_executemany will also batch inserts but I guess will use a real arrayed insert, may be even bypassing transaction management (thus autocommit needs to be disabled) like real bulk loading would. |
Hi @rudolfix |
I'll add here that there is still going to be a performance limitation for large datasets with odbc vs the bulk copy tool. Would be interesting to see the performance comparisons for a larger insert (1 million rows) to compare. |
Background
The way we insert rows into mssql is ineffective. We should switch to bulk copy. MS odbc drivers come with
bcp
command that we can use.https://github.com/yehoshuadimarsky/bcpandas/blob/master/bcpandas/utils.py
is a project that does that quite well
Tasks
mssql
andsynapse
to handlecsv
bcp
in copy jobsThe text was updated successfully, but these errors were encountered: