Xtcworld

mssql-python Now Supports Zero-Copy Arrow Data Transfer from SQL Server

mssql-python now supports Apache Arrow, enabling zero-copy data transfer from SQL Server to Arrow-native tools like Polars and Pandas, with speed and memory benefits.

Xtcworld · 2026-05-20 04:52:00 · Data Science

Fetching large datasets from SQL Server has historically meant creating millions of Python objects—one per cell—only to convert them into a columnar DataFrame. The new Apache Arrow support in mssql-python eliminates this overhead by allowing direct transfer of data as Arrow structures. This means faster, more memory-efficient data pipelines for Polars, Pandas, DuckDB, and other Arrow-native tools. Below, we answer common questions about this feature.

What is the key change in mssql-python that improves data fetching?

The primary change is that mssql-python now supports the Apache Arrow C Data Interface. Previously, fetching a million rows from SQL Server required creating a million Python objects, each with garbage-collector overhead. Now, the entire data-fetch loop runs in C++ and writes values directly into Arrow buffers. The result: zero-copy data exchange. A Polars DataFrame, for example, receives a pointer to that memory and can start processing instantly, without any intermediate Python object creation. This architectural shift makes mssql-python a high-throughput bridge between SQL Server and the Arrow ecosystem.

mssql-python Now Supports Zero-Copy Arrow Data Transfer from SQL Server
Source: devblogs.microsoft.com

How does Apache Arrow eliminate Python object creation for each row?

Apache Arrow uses a columnar in-memory format. Instead of storing a table as a list of rows (each row being a collection of Python objects), Arrow stores all values for a column contiguously in a typed buffer. Null values are tracked via a compact bitmap, not per-cell None objects. With mssql-python's new feature, the driver writes directly into these typed buffers while still in C++ code, so no Python objects are ever allocated for individual row values. The DataFrame library then reads the same memory via a pointer. This means that a Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage—filters, joins, and aggregations all operate in-place on the Arrow buffers.

What is the Arrow C Data Interface and why is it important?

The Arrow C Data Interface is Apache Arrow's cross-language ABI (Application Binary Interface). It defines a stable shared-memory layout that any language can produce or consume by exchanging a single memory pointer—no serialization, no copies, no re-parsing. This is the foundation of zero-copy interoperability. For example, a C++ database driver and a Python DataFrame library can work on the exact same memory without either knowing about the other's internals. By adopting this interface, mssql-python gains the ability to transfer data to any Arrow-native consumer (Polars, DuckDB, Pandas with ArrowDtype, and others) with zero overhead. It's a standard that makes high-performance data pipelines possible across language boundaries.

What are the concrete benefits for users of mssql-python with Arrow support?

Users gain three major benefits: speed, lower memory usage, and seamless interoperability.

mssql-python Now Supports Zero-Copy Arrow Data Transfer from SQL Server
Source: devblogs.microsoft.com
  • Speed: The columnar fetch path avoids Python object creation per row, making fetching noticeably faster for many SQL Server types, especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated.
  • Lower memory usage: A column of one million integers is a single contiguous C array, not a million individual Python objects. Null values use a bitmap, not per-cell objects.
  • Seamless interoperability: Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and other Arrow-native tools can consume the data without any intermediate format conversion.

How does Arrow's columnar format reduce memory usage?

Traditional row-oriented fetching creates one Python object per cell—for a table with 10 columns and 1 million rows, that's 10 million objects plus the list overhead. Arrow stores data per column as a contiguous typed buffer. For example, an integer column is one C array of 4-byte ints. Nulls are tracked with a compact bitmap (e.g., 1 bit per value). This eliminates the per-object memory overhead (Python objects are typically 28+ bytes each) and the garbage-collector pressure of allocating and collecting millions of small objects. The result is dramatically lower peak memory usage, especially for wide tables or high-cardinality data. The memory savings compound when downstream operations like filtering or aggregation also run in-place on the same Arrow buffers.

Who contributed this feature to mssql-python?

The Apache Arrow support in mssql-python was contributed by community developer Felix Graßl (@ffelixg). The mssql-python team reviewed and integrated his work, and we are thrilled to ship it as part of the official release. Felix's contribution demonstrates how community involvement can bring significant performance improvements to open-source database drivers.

Recommended