python : pandas库的后继者polars库

终安和

2023-12-01

polars库是python的又一dataframe库，显然，在pandas库的光芒下，要上位是不容易的，必须有过硬的功夫。

一、用法基本一致

从长相上看，两者有孪生相，用法和接口基本无二。至少大部分非常非常相似。估计很多只需要在import 这行改一行，估计就能用上了。
安装：

pip install polars

也可以通过镜像，这样快一些。

二、速度polars优势明显

import time 
import polars as pl
import pandas as pd
file = r"C:\Users\songroom\Desktop\test_csv.csv"
t0 = time.time()
df_1 = pd.read_csv(file,encoding="gbk")
t1 =time.time() 
df_2 = pl.read_csv(file,encodeing ="gbk")
t2 = time.time()
print(f"pandas read_csv  cost time :{t1-t0}  polars read_csv cost time :{t2-t1}")
print(f"df_1 shape :{df_1.shape} df_2 shape : {df_2.shape}")
t3 = time.time()
for row in df_1.itertuples():
    v0 = row[1]
    v2 = row[2]
t4 = time.time()
for row in df_2.rows():
    v0 = row[1]
    v2 = row[2]

t5 = time.time()

print(f"pandas iterate  cost time :{t4-t3}  polars iterate cost time :{t5-t4}")

pandas read_csv cost time :1.3020009994506836 polars read_csv cost time :0.10900020599365234
df_1 shape :(589680, 14) df_2 shape : (589680, 14)
pandas iterate cost time :1.0449976921081543 polars iterate cost time :1.1010003089904785

总体上看，polars库在io上优势很明显，快太多了。就单个循还而言，其polar的rows()和pandas的itertuples()差不多。
polars库是通过Rust编写的一个库，Polars的内存模型是基于Apache Arrow。python只是一个前端的封装。

关于polars更多的资料，见polars的github上的源：

https://github.com/pola-rs/polars

关于polars的性能，见：

https://h2oai.github.io/db-benchmark/

三、生态polars还处于初期

显然，pandas是成千上万人N年如一日打磨的产品，是一个非常成熟大叔了，但polars相其相比，还是一个年青小伙子，但是基本上常见的功能已经够用了。特别是你感觉pandas不够快的话。
目前，Polars 是基于arrow1的，目前正准备向arrow2迁移，arrow2的速度将更快，这样有理由相信Polars值得我们期待！

python : pandas库的后继者polars库

相关阅读

相关文章

相关问答

相关文档