Python — ETL and yield

There are a lot of cases when you need move your databse data from one place to another. When it comes to move small amount of data. This operation is pretty simple. But when you want to copy, for example >10 000 000 rows. You might encounquer problems with RAM capacity.

We will copy data between two SQL Server tables: dbo.Badges_Source and dbo.Badges_Destination. Both tables are based on dbo.Badges table from StackOverflow database. We will use python and pyodbc package. Code on my github repository:

Let’s assume that we want to copy data in a ‘2018–01–01’ and ‘2019–12–01’ [Date] range. There is 8 473 372 records. Copying all the data takes more than 3.7 GB of RAM. 3.7 GB for nearly 10 000 000 with so simple table structure isn’t good score.

In this moment we can use yield function.

range_slicer() function retrieves records from source table in batches. yield returns 1 000 000 row list . As you see above. We iterate through range_slicer() and we retrieve and save one batch at the time. Helpful is T-SQL OFFSET FETCH NEXT clause.

Entire process took 0,5 GB of RAM.

BI Developer (Microsoft Stack)