Python for Data Science

和和煦
2023-12-01

在python环境中实现 R for Data Science
https://r4ds.had.co.nz/ 中的5,12,13三章中的功能

1.Data transformation数据转换

参考python dfply文档https://github.com/kieferk/dfply/blob/master/README.md

!pip install dfply
from dfply import *
import numpy as np
import pandas as pd

加载内置的数据集diamonds数据集,数据集共53940行,有carat、cut、color、clarity、depth、table、price、x、y、z共10列,对应每个钻石的一些参数值。

diamonds.head()
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
10.21PremiumESI159.861.03263.893.842.31
20.23GoodEVS156.965.03274.054.072.31
30.29PremiumIVS262.458.03344.204.232.63
40.31GoodJSI263.358.03354.344.352.75

1.1 Filter rows过滤筛选

# 筛选cut为Ideal记录的前5行
diamonds >> filter_by(X.cut == 'Ideal') >> head()
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
110.23IdealJVS162.856.03403.933.902.46
130.31IdealJSI262.254.03444.354.372.71
160.30IdealISI262.054.03484.314.342.68
390.33IdealISI261.855.04034.494.512.78

管道函数是将上一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。

Python>>
R%>%
# 筛选cut为Ideal且price小于337的记录	
diamonds >> filter_by(X.cut == 'Ideal', X.price < 337)
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
caratcutcolorclaritydepthtablepricexyz
266830.33IdealESI262.254.04274.444.462.77
322970.34IdealESI262.454.04544.494.522.81
409280.30IdealESI161.654.04994.324.352.67
506230.30IdealESI262.154.04014.324.352.69
506250.30IdealESI262.054.04014.334.352.69

1.2 Arrange rows排列

diamonds >> arrange(X.table, ascending=False) >> head(5)     # ascending(升序)默认为True
caratcutcolorclaritydepthtablepricexyz
249322.01FairFSI158.695.0133878.328.314.87
507730.81FairFSI268.879.023015.265.203.58
513420.79FairGSI165.376.023625.525.133.35
528600.50FairEVS279.073.025795.215.184.09
493750.70FairHVS162.073.021005.655.543.47

1.3 Select columns选择

# 选择cut和price列,取前3行
diamonds >> select(X.cut, 'price', X.x) >> head(3)
cutpricex
0Ideal3263.95
1Premium3263.89
2Good3274.05
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)   # 从钻石数据集中选择“cut”、“price”、“x”和“y”。
cutpricexy
0Ideal3263.953.98
1Premium3263.893.84

starts_with(‘c’) :查找以字符c前缀开头的列;
ends_with(‘c’):查找以字符c后缀结束的列;
contains(‘c’):筛选出包包字符c的列;
everything():所有列。

# 根据参数快速筛选出相应数据
diamonds >> select(starts_with('c')) >> head(3)
caratcutcolorclarity
00.23IdealESI2
10.21PremiumESI1
20.23GoodEVS1

1.4 Add new variables添加新变量

可以使用mutate()函数创建新变量,在一次调用中可创建多个变量

diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)
xyzx_plus_y
03.953.982.437.93
13.893.842.317.73
24.054.072.318.12
# transmute()函数是 mutate 和已创建变量的选择的组合
diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)
x_plus_yy_div_z
07.931.637860
17.731.662338
28.121.761905

1.5 Grouped summaries分组汇总

diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
price_meanprice_std
03932.7997223989.439738
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
cutprice_meanprice_std
0Fair4358.7577643560.386612
1Good3928.8644523681.589584
2Ideal3457.5419703808.401172
3Premium4584.2577044349.204961
4Very Good3981.7598913935.862161
diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')
price_meanprice_vardepth_meandepth_var
03932.7997221.591533e+0761.7494052.052366
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)
cutprice_meanprice_vardepth_meandepth_var
0Fair4358.7577641.266848e+0764.04167713.266319
1Good3928.8644521.355134e+0762.3658794.705224
2Ideal3457.5419701.450325e+0761.7094010.516274
3Premium4584.2577041.891421e+0761.2646731.342755
4Very Good3981.7598911.548973e+0761.8182751.900466

2.Tidy data数据整理

2.1Pivoting旋转

import pandas as pd
import numpy as np
mydata=pd.DataFrame({
"Name":["苹果","谷歌","脸书","亚马逊","腾讯"],
"Conpany":["Apple","Google","Facebook","Amozon","Tencent"],
"Sale2013":[5000,3500,2300,2100,3100],
"Sale2014":[5050,3800,2900,2500,3300],
"Sale2015":[5050,3800,2900,2500,3300],
"Sale2016":[5050,3800,2900,2500,3300]
       })
mydata
NameConpanySale2013Sale2014Sale2015Sale2016
0苹果Apple5000505050505050
1谷歌Google3500380038003800
2脸书Facebook2300290029002900
3亚马逊Amozon2100250025002500
4腾讯Tencent3100330033003300

python中melt函数(数据宽转长)与R中pivot_longer函数类似,通过减少列并将它们转换为值来延长数据框

mydata1=mydata.melt(id_vars=["Name","Conpany"], # 要保留的主字段
                    var_name="Year", # 拉长的分类变量
                    value_name="Sale" ) # 拉长的度量值名称
mydata1
NameConpanyYearSale
0苹果AppleSale20135000
1谷歌GoogleSale20133500
2脸书FacebookSale20132300
3亚马逊AmozonSale20132100
4腾讯TencentSale20133100
5苹果AppleSale20145050

R中pivot_wider函数实现的功能可用python中pivot_table函数(数据长转宽)代替,跨多列传播键值对

mydata1.pivot_table(index=["Name","Conpany"],    #行索引(可以使多个类别变量)
                    columns=["Year"],                   #列索引(可以使多个类别变量)
                    values=["Sale"])                       #值(一般是度量指标)
Sale
YearSale2013Sale2014Sale2015Sale2016
NameConpany
亚马逊Amozon2100250025002500
脸书Facebook2300290029002900
腾讯Tencent3100330033003300
苹果Apple5000505050505050
谷歌Google3500380038003800

也可使用plydata库进行数据整理

python中的plydata库基于 R 中的 dplyr、tidyr 和 forcats 包,许多函数名称都是直接借用过来的

https://plydata.readthedocs.io/en/stable/api.html.

2.2 Separating and uniting分离与融合

# separate()通过在出现分隔符的地方进行拆分,将一列拆分为多列
df = pd.DataFrame({'alpha': 1,'x': ['a,1', 'b,2', 'c,3'],'zeta': 6})
print(df)
df >> separate('x', into=['A', 'B'], remove=False)
alphaxzetaAB
01a,16a1
11b,26b2
21c,36c3
# unite()将多列合并为一列
df = pd.DataFrame({'c1': [1, 2, 3, 4, None],'c2': list('abcde'),'c3': list('vwxyz')})
df >> unite('c1c2', 'c1', 'c2', na_rm=True)  

[‘c1’, ‘c2’] _ True maintain

c3c1c2
0v1.0_a
1w2.0_b
2x3.0_c
3y4.0_d
4zNaN

2.3 Missing values缺失值

df
c1c2c3
01.0av
12.0bw
23.0cx
34.0dy
4NaNez
# 查看缺失值
print(df.isnull())
print(df.isnull().sum())

删除缺失值可用df.dropna(),也可以给这个方法传入how="all"参数,只有在整行为空的前提下才删除

df.fillna('*')   # 缺失值填充
c1c2c3
01.0av
12.0bw
23.0cx
34.0dy
4*ez

3.Relational data关系数据

3.1 pd.merge(),pd.concat()

pd.merge()官方文档
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge.

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
print(df1)
print(df2)
ab
0foo1
1bar2
df1.merge(df2, how='inner', on='a')
abc
0foo13
df1.merge(df2, how='cross')
a_xba_yc
0foo1foo3
1foo1baz4
2bar2foo3
3bar2baz4
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'])
pd.concat([df1, df3], join="inner")
letternumber
0a1
1b2
0c3
1d4

3.2运用dfply库中的join系列函数

a = pd.DataFrame({'x1':['A','B','C'],'x2':[1,2,3]})
b = pd.DataFrame({'x1':['A','B','D'],'x3':[True,False,True]})

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HDlnjDWV-1639040492974)(attachment:image.png)]

a >> inner_join(b, by='x1') # 内连接
a >> outer_join(b, by='x1') # 外连接
a >> left_join(b, by='x1')
a >> right_join(b, by='x1')
a >> semi_join(b, by='x1') # 半连接
a >> anti_join(b, by='x1') # 自然连接
x1x2
2C3
 类似资料:

相关阅读

相关文章

相关问答