Python — ETL and yield

There are a lot of cases when you need move your databse data from one place to another. When it comes to move small amount of data. This operation is pretty simple. But when you want to copy, for example >10 000 000 rows. You might encounquer problems with RAM capacity.

We will copy data between two SQL Server tables: dbo.Badges_Source and dbo.Badges_Destination. Both tables are based on dbo.Badges table from StackOverflow database. We will use python and pyodbc package. Code on my github repository:

Let’s assume that we want to copy data in a ‘2018–01–01’ and ‘2019–12–01’ [Date] range. There is 8 473 372 records. Copying all the data takes more than 3.7 GB of RAM. 3.7 GB for nearly 10 000 000 with so simple table structure isn’t good score.

In this moment we can use yield function.

range_slicer() function retrieves records from source table in batches. yield returns 1 000 000 row list . As you see above. We iterate through range_slicer() and we retrieve and save one batch at the time. Helpful is T-SQL OFFSET FETCH NEXT clause.

Entire process took 0,5 GB of RAM.

--

--

--

Data | BI Engineer (Microsoft Stack)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

An Introduction to Monkey D. App

DMV for Log Statistics in SQL Server

What is Active Directory

Avoid Payday Advances if you can

The mighty “Cytools” error at Web3 smart contract development with Brownie

LeetCode 33. Search in Rotated Sorted Array — Python Solution

[LeetCode]#897. Increasing Order Search Tree

Android Test Automation with Espresso Testing Framework

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michal Molka

Michal Molka

Data | BI Engineer (Microsoft Stack)

More from Medium

Analyze Airbnb Data With Python

Pandas — Replacing NULL to ‘0’

Combine Multiple Logs Files Using Pandas

Combine Multiple Logs Files Using Pandas

How to Connect Python with SQL Server Database Using Pyodbc