Python和NumPy索引运算符"[]“和属性运算符”."可以快速轻松地访问Pandas数据结构。由于要访问的数据类型不是预先知道的,直接使用标准运算符具有一些限制。
序号 | 索引 | 描述 |
---|---|---|
1 | .loc() | 基于标签 |
2 | .iloc() | 基于整数 |
.loc()主要基于标签(label)的,包括行标签(index)和列标签(colums),即行名称和列名称,可以使用def.loc[index_name, col_name]选择指定位置的数据,主要用法有:
1)单个标量标签,如果.loc中只有单个标签,即选择的是一行。例如:df.loc[‘a’]选择的是 index 为’a’的一样。
2) 标签列表,例如:df.loc[[‘a’, ‘b’, ‘c’]],同样只选择行。 切片对象,与通常的 python
3) 切片不同,在最终选择的数据数据中包含切片的 start 和 stop。例如:df.loc[‘a’ : ‘z’]
4)即包含’c’行,也包含’h’行。 布尔数组,用于筛选符合某些条件的行,例如:df.loc[df.A>0.5]
5)筛选出所有’A’列大于0.5的行。
.loc()需要两个单/列表/范围运算符,用","分隔。第一个表示行,第二个表示列。
代码如下:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5),
index = ['h','e','l','l','o','w','o','r','l','d'], columns = ['H', 'E', 'L', 'L','O'])
print ("df.loc[:,'H']=",df.loc[:,'H'])
# df.loc[:,'H']= h 0.864051
# e 1.515776
# l -1.290521
# l 2.489518
# o 1.144873
# w 0.322974
# o 0.531702
# r -0.304988
# l 0.544344
# d -0.247838
# Name: H, dtype: float64
print("df.loc[:,['H','O']=",df.loc[:,['H','O']])
# df.loc[:,['H','O']= H O
# h 2.141714 1.318901
# e -0.396920 -0.212180
# l 0.417706 0.316227
# l -2.051733 1.599844
# o 0.123853 1.643608
# w 0.078310 -0.686108
# o -0.169935 -0.184962
# r -0.578558 -0.633589
# l 1.322215 -1.424611
# d 1.875035 0.760358
print("df.loc[['h','e','w','d'],['H','E']]=",df.loc[['h','e','w','d'],['H','E']])
# df.loc[['h','e','w','d'],['H','E']]= H E
# h 1.009491 -1.231572
# e -1.973212 -0.114849
# w 1.004347 -1.331358
# d -0.001212 -0.022076
print("df.loc['h':'d']=",df.loc['h':'d'])
# df.loc['h':'d']= H E L L O
# h -0.135251 1.958364 -0.246656 1.338530 0.416838
# e 0.029605 -0.465326 1.904362 0.600459 -0.647927
# l -0.051904 -0.989360 -0.020527 -1.273677 1.065701
# l -1.220899 0.224219 0.614643 0.743743 0.924973
# o 0.658481 0.507983 0.047808 0.329159 0.325655
# w -0.741060 -0.804336 0.705783 0.373557 0.277881
# o 1.349972 0.126765 -0.167115 -0.031806 -0.844133
# r 0.066011 0.711079 0.302742 -1.648539 0.485925
# l -1.534991 -0.055344 1.798097 -1.758417 -0.226042
# d 0.036769 -0.440360 -0.137017 0.482295 0.853643
print("df.loc['h']>0=",df.loc['h']>0)
# df.loc['h']>0= H True
# E True
# L True
# L False
# O False
# Name: h, dtype: bool
.iloc() 是基于整数的索引,利用元素在各个轴上的索引序号进行选择,序号超过范围产生IndexError,切片时允许序号超过范围。各种访问方式如下 :
1)整数,类似于.loc,只使用一个维度,即对行选择,小标默认从 0 开始。例如:df.iloc[5],选择df第 6 行。
2)整数列表或者数组,例如df.iloc[[5, 1, 7]],选择df第 6 行, 第 2 行, 第 8 行。
3) 元素为整数的切片操作,不同于.loc,则下标为 stop 的数据不被选择。如:df.iloc[0:3], 只包含 0,1,2行,不包含第 3 行。 也可以使用布尔数组进行筛选,例如 df.iloc[np.array(df.A>0.5)],df.iloc[list(df.A>0.5)]。
4)注意使用布尔数组进行筛选时,可以使用 list 或者 array,使用 Series会出错,NotImplementedError 和 ValueError,前者是 Series 的 index 与待切片 DataFrame的index 不同时报错,后置 index 相同时报错。与.loc使用布尔数组,可以使用 list, array, 也可以使用Series,使用Series时 index需要一致,否则会报 IndexError。
代码如下:
df1=pd.DataFrame(np.random.randn(10,7),columns=['C','H','I','N','E','S','E'])
print("df1.iloc[:7]=",df1.iloc[:7])
# df1.iloc[:7]= C H I N E S E
# 0 0.954972 -1.452191 0.423801 -0.308185 -0.945486 -0.014652 0.398128
# 1 -0.928292 0.443338 1.447824 0.736785 -0.312469 -0.322120 0.759044
# 2 -1.436116 0.599263 0.616305 -0.180370 0.328911 1.987013 0.594443
# 3 0.375589 -1.775010 2.469267 -1.196811 1.194035 0.606647 -0.529487
# 4 0.535321 -0.909114 1.195933 -0.655170 1.924333 -0.284786 -1.288624
# 5 1.010001 -1.005942 1.276954 -0.691063 0.729140 -0.006609 -1.668707
# 6 1.355503 0.220937 0.038354 0.818567 1.559360 0.691327 -1.175570
print("df1.iloc[1:6,3:5]=",df1.iloc[1:6,3:5])
# df1.iloc[1:6,3:5]= N E
# 1 0.189258 -1.440922
# 2 -2.088273 1.170744
# 3 -0.067145 0.462795
# 4 1.214220 1.733308
# 5 1.733694 -0.914985
print("df1.iloc[[2,3,4],[5,6]]=",df1.iloc[[2,3,4],[5,6]])
# df1.iloc[[2,3,4],[5,6]]= S E
# 2 0.234386 0.317554
# 3 -1.455785 1.178777
# 4 0.492809 1.506377
print("df1.iloc[3:5,:]=",df1.iloc[3:5,:])
# df1.iloc[3:5,:]= C H I N E S E
# 3 0.439285 -0.497918 -0.46887 -1.649200 -0.143219 -1.455785 1.178777
# 4 0.138909 -0.317724 0.39049 -1.103951 1.548416 0.492809 1.506377
print("df1.iloc[:,4:6]=",df1.iloc[:,4:6])
# df1.iloc[:,4:6]= E S
# 0 -0.168126 -0.451168
# 1 -0.982937 0.746423
# 2 -0.708478 0.234386
# 3 -0.143219 -1.455785
# 4 1.548416 0.492809
# 5 -0.231034 -1.039720
# 6 -0.091844 -0.981034
# 7 -0.195216 -0.639616
# 8 -0.032843 0.353545
# 9 0.107125 -2.593816
[]操作只能输入一个维度,不能用逗号隔开输入两个维度,否则程序运行会报错。另外,.loc 和 .iloc只输入一维时选择的行,而[]选择的是列,且必须使用列名。
代码如下:
df2=pd.DataFrame(np.random.randn(5,5),columns=['W','O','R','L','D'])
print("df2['R']=",df2['R'])
# df2['R']= 0 0.173285
# 1 -0.656821
# 2 0.786381
# 3 0.959628
# 4 -0.692132
# Name: R, dtype: float64
print("df2[['R','D']]=",df2[['R','D']])
# df2[['R','D']]= R D
# 0 0.173285 0.701228
# 1 -0.656821 -0.275652
# 2 0.786381 0.614438
# 3 0.959628 -0.040172
# 4 -0.692132 0.598275
print("df[3:3]=",df[3:3])
# df[3:3]= Empty DataFrame
# Columns: [H, E, L, L, O]
# Index: []
可以使用属性运算符.来选择列。
df3=pd.DataFrame(np.random.randn(6,6),columns=['P','R','O','F','I','T'])
print("df3.F=",df3.F)
# df3.F= 0 -1.272033
# 1 -1.163044
# 2 0.722184
# 3 1.175426
# 4 0.814546
# 5 -0.117228
# Name: F, dtype: float64
print("df3.loc[lambda df3:[1,2]]=",df3.loc[lambda df3:[1,2]])# .loc 支持可调用函数
# df3.loc[lambda df3:[1,2]]= P R O F I T
# 1 0.327314 -1.477497 -2.811585 -0.779521 -1.617792 0.199685
# 2 1.706674 -0.804023 -0.992027 3.238219 0.692974 -1.469227
print("df3.iloc[lambda df3:[3,4]]=",df3.loc[lambda df3:[3,4]])# .iloc支持可调用函数
# df3.iloc[lambda df3:[3,4]]= P R O F I T
# 3 0.477647 -0.968227 0.619846 1.379768 -0.689116 -0.671862
# 4 2.347779 -0.514858 -0.529524 0.481264 -0.549950 1.919538
构造多级索引的方法有 from_tuples、from_arrays和 from_product 等三种方法,其都是 pd.MultiIndex 对象下的函数。
import pandas as pd
import numpy as np
df=pd.read_csv('D:/binchen/txzq/pandastest.csv',
usecols=['School', 'Grade', 'Name', 'Gender','Weight', 'Age'])
print("df['Name'].head()=",df['Name'].head())# 取单列
# df['Name'].head()= 0 zhangsan
# 1 lisi
# 2 wanger
# 3 liguang
# 4 xiaoming
# Name: Name, dtype: object
print("df[['Grade','Name']].head()=",df[['Grade','Name']].head())# 取多列
# df[['Grade','Name']].head()= Grade Name
# 0 6 zhangsan
# 1 5 lisi
# 2 3 wanger
# 3 3 liguang
# 4 4 xiaoming
print("df.Name.head()=",df.Name.head())#取出单列,且列名中不包含空格,也可以用 .列名 取出
# df.Name.head()= 0 zhangsan
# 1 lisi
# 2 wanger
# 3 liguang
# 4 xiaoming
np.random.seed(0)
m_index=pd.MultiIndex.from_product([list('ABCD'),
df.Gender.unique()],names=('School','Gender'))
m_column=pd.MultiIndex.from_product([['Height','Weight'],
df.Grade.unique()],names=('Indicator','Grade'))
df_m=pd.DataFrame(np.c_[(np.random.randn(8,4)*5+163).tolist(),
(np.random.randn(8,4)*5+65).tolist()],
index=m_index,
columns=m_column)
print("df_m=",df_m)
# df_m= Indicator Height ... Weight
# Grade 6 5 ... 3 4
# School Gender ...
# A Male 171.820262 165.000786 ... 63.260439 65.781745
# Female 172.337790 158.113611 ... 63.063366 63.488486
# B Male 162.483906 165.052993 ... 56.468649 74.753877
# Female 166.805189 163.608375 ... 58.736023 68.887452
# C Male 170.470395 161.974209 ... 60.522667 66.934512
# Female 150.235051 166.268093 ... 64.859089 67.141659
# D Male 174.348773 155.728172 ... 61.828390 63.186294
# Female 170.663896 170.346794 ... 60.934269 56.368587
DataFrame 的结构与单层索引的表一样,具备元素值、行索引和列索引三个部分,其行索引和列索引都是 MultiIndex 类型,但是索引中的元素是元组,而不是单层索引中的标量。例如,行索引的第四个元素为 (“B”, “Male”) ,列索引的第二个元素为 (“Height”, “5”) ,需要注意,当外层连续出现相同的值时,第一次之后出现的会被隐藏显示,增强了结果的可读性。
MultiIndex 具有名字属性,其与单层索引类似,代码中的 School 和 Gender 分别对应了表的第一层和第二层行索引的名字, Indicator 和 Grade 分别对应了第一层和第二层列索引的名字。
# 索引的名字和值属性分别可以通过 names 和 values 获得
print("df_m.index.names=",df_m.index.names)
# df_m.index.names= ['School', 'Gender']
print("df_m.columns.names=",df_m.columns.names)
# df_m.columns.names= ['Indicator', 'Grade']
print("df_m.index.values=",df_m.index.values)
# df_m.index.values= [('A', 'Male') ('A', 'Female') ('B', 'Male') ('B', 'Female') ('C', 'Male')
# ('C', 'Female') ('D', 'Male') ('D', 'Female')]
print("df_m.columns.values=",df_m.columns.values)
# df_m.columns.values= [('Height', 6) ('Height', 5) ('Height', 3) ('Height', 4) ('Weight', 6)
# ('Weight', 5) ('Weight', 3) ('Weight', 4)]
# 如果想得到某一层的索引,可以通过 get_level_values 获得
print("df_m.index.get_level_values(0)=",df_m.index.get_level_values(0))
# df_m.index.get_level_values(0)= Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')
df_m=df.set_index(['School','Grade'])
print("df_m.head()=",df_m.head())
# df_m.head()= Name Gender Weight Age
# School Grade
# zhongxinxiaoxue 6 zhangsan Male 75 12
# 5 lisi Male 100 11
# 3 wanger Male 70 9
# 3 liguang Male 53 7
# 4 xiaoming Male 96 9
IndexSlice 对象可以将切片和布尔列表混合使用, Slice 对象一共有两种形式,第一种为 loc[idx[,]] 型,第二种为 loc[idx[,],idx[,]] 型,
def mklbl(prefix,n):
return ["%s%s" % (prefix,i) for i in range(n)]
mult_index = pd.MultiIndex.from_product([mklbl('A',5),
mklbl('B',3),
mklbl('C',4),
mklbl('D',2)])
mult_columns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
('b','foo'),('b','bah')],
names=['lvl0', 'lvl1'])
df_mul = pd.DataFrame(np.arange(len(mult_index)*len(mult_columns)).reshape((len(mult_index),len(mult_columns))),
index=mult_index,
columns=mult_columns).sort_index().sort_index(axis=1)
print("df_mul.head()=",df_mul.head())
# df_mul.head()= lvl0 a b
# lvl1 bar foo bah foo
# A0 B0 C0 D0 1 0 3 2
# D1 5 4 7 6
# C1 D0 9 8 11 10
# D1 13 12 15 14
# C2 D0 17 16 19 18
print("df_mul.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]=",df_mul.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :])
# df_mul.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]= lvl0 a b
# lvl1 bar foo bah foo
# A1 B0 C1 D0 105 104 107 106
# D1 109 108 111 110
# C3 D0 121 120 123 122
# D1 125 124 127 126
# B1 C1 D0 137 136 139 138
# D1 141 140 143 142
# C3 D0 153 152 155 154
# D1 157 156 159 158
# B2 C1 D0 169 168 171 170
# D1 173 172 175 174
# C3 D0 185 184 187 186
# D1 189 188 191 190
# A2 B0 C1 D0 201 200 203 202
# D1 205 204 207 206
# C3 D0 217 216 219 218
# D1 221 220 223 222
# B1 C1 D0 233 232 235 234
# D1 237 236 239 238
# C3 D0 249 248 251 250
# D1 253 252 255 254
# B2 C1 D0 265 264 267 266
# D1 269 268 271 270
# C3 D0 281 280 283 282
# D1 285 284 287 286
# A3 B0 C1 D0 297 296 299 298
# D1 301 300 303 302
# C3 D0 313 312 315 314
# D1 317 316 319 318
# B1 C1 D0 329 328 331 330
# D1 333 332 335 334
# C3 D0 345 344 347 346
# D1 349 348 351 350
# B2 C1 D0 361 360 363 362
# D1 365 364 367 366
# C3 D0 377 376 379 378
# D1 381 380 383 382
idx = pd.IndexSlice
print("df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], :]=",df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], :])
# df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], :]= lvl0 a b
# lvl1 bar foo bah foo
# A1 B0 C1 D0 105 104 107 106
# D1 109 108 111 110
# C3 D0 121 120 123 122
# D1 125 124 127 126
# B1 C1 D0 137 136 139 138
# D1 141 140 143 142
# C3 D0 153 152 155 154
# D1 157 156 159 158
# B2 C1 D0 169 168 171 170
# D1 173 172 175 174
# C3 D0 185 184 187 186
# D1 189 188 191 190
# A2 B0 C1 D0 201 200 203 202
# D1 205 204 207 206
# C3 D0 217 216 219 218
# D1 221 220 223 222
# B1 C1 D0 233 232 235 234
# D1 237 236 239 238
# C3 D0 249 248 251 250
# D1 253 252 255 254
# B2 C1 D0 265 264 267 266
# D1 269 268 271 270
# C3 D0 281 280 283 282
# D1 285 284 287 286
# A3 B0 C1 D0 297 296 299 298
# D1 301 300 303 302
# C3 D0 313 312 315 314
# D1 317 316 319 318
# B1 C1 D0 329 328 331 330
# D1 333 332 335 334
# C3 D0 345 344 347 346
# D1 349 348 351 350
# B2 C1 D0 361 360 363 362
# D1 365 364 367 366
# C3 D0 377 376 379 378
# D1 381 380 383 382
# 选择出列索引第二层为bar的列
print("df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], idx[:, 'foo']]=",df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], idx[:, 'foo']])
# df_mul.loc[idx['A1': 'A3', :, ['C1', 'C3']], idx[:, 'foo']]= lvl0 a b
# lvl1 foo foo
# A1 B0 C1 D0 104 106
# D1 108 110
# C3 D0 120 122
# D1 124 126
# B1 C1 D0 136 138
# D1 140 142
# C3 D0 152 154
# D1 156 158
# B2 C1 D0 168 170
# D1 172 174
# C3 D0 184 186
# D1 188 190
# A2 B0 C1 D0 200 202
# D1 204 206
# C3 D0 216 218
# D1 220 222
# B1 C1 D0 232 234
# D1 236 238
# C3 D0 248 250
# D1 252 254
# B2 C1 D0 264 266
# D1 268 270
# C3 D0 280 282
# D1 284 286
# A3 B0 C1 D0 296 298
# D1 300 302
# C3 D0 312 314
# D1 316 318
# B1 C1 D0 328 330
# D1 332 334
# C3 D0 344 346
# D1 348 350
# B2 C1 D0 360 362
# D1 364 366
# C3 D0 376 378
# D1 380 382
# 使用布尔的蒙版来配合IndexSlice选择数据, 选择出foo列的数值小于100的行
mask = (df_mul[('a', 'foo')] < 100) & (df_mul[('b', 'foo')] < 100)
print("df_mul.loc[idx[mask, :, ['C1', 'C2']], idx[:, 'foo']]=",df_mul.loc[idx[mask, :, ['C1', 'C2']], idx[:, 'foo']])
# df_mul.loc[idx[mask, :, ['C1', 'C2']], idx[:, 'foo']]= lvl0 a b
# lvl1 foo foo
# A0 B0 C1 D0 8 10
# D1 12 14
# C2 D0 16 18
# D1 20 22
# B1 C1 D0 40 42
# D1 44 46
# C2 D0 48 50
# D1 52 54
# B2 C1 D0 72 74
# D1 76 78
# C2 D0 80 82
# D1 84 86
np.random.seed(0)
list5,list6,list7=['D','K'],['d','k'],['hello','world']
mult_index1=pd.MultiIndex.from_product([list5,list6,list7],names=('Up','Low','Extra'))
list8,list9,list10=['W','Q'],['w','q'],['pig','dog']
mult_index2=pd.MultiIndex.from_product([list8,list9,list10],names=('Big','Small','Other'))
df_e=pd.DataFrame(np.random.randint(-9,10,(8,8)),
index=mult_index1,
columns=mult_index2)
print("df_e=",df_e)
# df_e= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Up Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
# world -9 -5 -4 -3 -1 8 6 -5
# k hello 0 1 -8 -8 -2 0 -6 -3
# world 2 5 9 -9 5 -6 3 1
## 索引层的交换由 swaplevel 和 reorder_levels 完成,前者只能交换两个层,
# 而后者可以交换任意层,两者都可以指定交换的是轴是哪一个,即行索引或列索引
# # 列索引的第二层和第三层交换
print("df_e.swaplevel(1,2,axis=1).head()=",df_e.swaplevel(1,2,axis=1).head())
# df_e.swaplevel(1,2,axis=1).head()= Big W Q
# Other pig dog pig dog pig dog pig dog
# Small w w q q w w q q
# Up Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
print("df_e.reorder_levels([2,0,1],axis=0).head()",df_e.reorder_levels([2,0,1],axis=0).head()) # 列表数字指代原来索引中的层
# df_e.reorder_levels([2,0,1],axis=0).head() Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Extra Up Low
# hello D d 3 6 -9 -6 -6 -2 0 9
# world D d -5 -3 3 -8 -3 -2 5 8
# hello D k -4 4 -1 0 7 -4 6 6
# world D k -9 9 -6 8 5 -2 -9 -8
# hello K d 0 -9 1 -6 2 9 -7 -9
# 想要删除某一层的索引,可以使用 droplevel 方法
print("df_e.droplevel(1,axis=1)=",df_e.droplevel(1,axis=1))
# df_e.droplevel(1,axis=1)= Big W Q
# Other pig dog pig dog pig dog pig dog
# Up Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
# world -9 -5 -4 -3 -1 8 6 -5
# k hello 0 1 -8 -8 -2 0 -6 -3
# world 2 5 9 -9 5 -6 3 1
print("df_e.droplevel([0,1],axis=0)",df_e.droplevel([0,1],axis=0))
# df_e.droplevel([0,1],axis=0) Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Extra
# hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# hello 0 -9 1 -6 2 9 -7 -9
# world -9 -5 -4 -3 -1 8 6 -5
# hello 0 1 -8 -8 -2 0 -6 -3
# world 2 5 9 -9 5 -6 3 1
print("df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head()=",
df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head())
# df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head()= Big W Q
# Small w q w q
# Changed_Col pig dog pig dog pig dog pig dog
# Changed_row Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
df_e.rename_axis(index={'Upper':'Changed_row'},columns={'Other':'Changed_Col'}).head()
print("df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head()=",
df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head())
# df_e.rename_axis(index={'Up':'Changed_row'},columns={'Other':'Changed_Col'}).head()= Big W Q
# Small w q w q
# Changed_Col pig dog pig dog pig dog pig dog
# Changed_row Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
# 通过 rename 可以对索引的值进行修改,如果是多级索引,则需要指定修改的层号 level
print("df_e.rename(columns={'pig':'cat'},level=2).head()=",
df_e.rename(columns={'pig':'cat'},level=2).head())
# df_e.rename(columns={'pig':'cat'},level=2).head()= Big W Q
# Small w q w q
# Other cat dog cat dog cat dog cat dog
# Up Low Extra
# D d hello 3 6 -9 -6 -6 -2 0 9
# world -5 -3 3 -8 -3 -2 5 8
# k hello -4 4 -1 0 7 -4 6 6
# world -9 9 -6 8 5 -2 -9 -8
# K d hello 0 -9 1 -6 2 9 -7 -9
# 传入参数可以是函数,即其输入值是索引元素
print("df_e.rename(index=lambda x:str.upper(x),level=2).head()=",
df_e.rename(index=lambda x:str.upper(x),level=2).head())
# df_e.rename(index=lambda x:str.upper(x),level=2).head()= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Up Low Extra
# D d HELLO 3 6 -9 -6 -6 -2 0 9
# WORLD -5 -3 3 -8 -3 -2 5 8
# k HELLO -4 4 -1 0 7 -4 6 6
# WORLD -9 9 -6 8 5 -2 -9 -8
# K d HELLO 0 -9 1 -6 2 9 -7 -9
# 对于整个索引的元素替换,可以利用迭代器实现
new_values = iter(list('abcdefgh'))
print("df_e.rename(index=lambda x:next(new_values),level=2)=",
df_e.rename(index=lambda x:next(new_values),level=2))
# df_e.rename(index=lambda x:next(new_values),level=2)= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Up Low Extra
# D d a 3 6 -9 -6 -6 -2 0 9
# b -5 -3 3 -8 -3 -2 5 8
# k c -4 4 -1 0 7 -4 6 6
# d -9 9 -6 8 5 -2 -9 -8
# K d e 0 -9 1 -6 2 9 -7 -9
# f -9 -5 -4 -3 -1 8 6 -5
# k g 0 1 -8 -8 -2 0 -6 -3
# h 2 5 9 -9 5 -6 3 1
df_temp = df_e.copy()
new_idx = df_temp.index.map(lambda x: (x[0],x[1],str.upper(x[2])))
df_temp.index = new_idx
print("df_temp.head()=",df_temp.head())
# df_temp.head()= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# Up Low Extra
# D d HELLO 3 6 -9 -6 -6 -2 0 9
# WORLD -5 -3 3 -8 -3 -2 5 8
# k HELLO -4 4 -1 0 7 -4 6 6
# WORLD -9 9 -6 8 5 -2 -9 -8
# K d HELLO 0 -9 1 -6 2 9 -7 -9
new_idx1 = df_temp.index.map(lambda x: (x[0]+'-'+x[1]+'-'+x[2]))
df_temp.index = new_idx1
print("df_temp.head()=",df_temp.head())
# df_temp.head()= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# D-d-HELLO 3 6 -9 -6 -6 -2 0 9
# D-d-WORLD -5 -3 3 -8 -3 -2 5 8
# D-k-HELLO -4 4 -1 0 7 -4 6 6
# D-k-WORLD -9 9 -6 8 5 -2 -9 -8
# K-d-HELLO 0 -9 1 -6 2 9 -7 -9
new_idx2 = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx2
print("df_temp.head()=",df_temp.head())
# df_temp.head()= Big W Q
# Small w q w q
# Other pig dog pig dog pig dog pig dog
# D d HELLO 3 6 -9 -6 -6 -2 0 9
# WORLD -5 -3 3 -8 -3 -2 5 8
# k HELLO -4 4 -1 0 7 -4 6 6
# WORLD -9 9 -6 8 5 -2 -9 -8
# K d HELLO 0 -9 1 -6 2 9 -7 -9
new_df= pd.DataFrame({'H':list('hello'),'W':list('World'),'S':[1,2,3,4,5]})
print("new_df=",new_df)
# new_df= H W S
# 0 h W 1
# 1 e o 2
# 2 l r 3
# 3 l l 4
# 4 o d 5
# 索引的设置可以使用 set_index 完成,
# 其主要参数是 append ,表示是否来保留原来的索引,直接把新设定的添加到原索引的内层
print("new_df.set_index('W')=",new_df.set_index('W'))
# new_df.set_index('W')= H S
# W
# W h 1
# o e 2
# r l 3
# l l 4
# d o 5
print("new_df.set_index('W',append=True)=",new_df.set_index('W',append=True))
# new_df.set_index('W',append=True)= H S
# W
# 0 W h 1
# 1 o e 2
# 2 r l 3
# 3 l l 4
# 4 d o 5
print("new_df.set_index(['H','W'])=",new_df.set_index(['H','W']))
# new_df.set_index(['H','W'])= S
# H W
# h W 1
# e o 2
# l r 3
# l 4
# o d 5
my_index = pd.Series(list('WXYZA'), name='D')
new_df=new_df.set_index(['H',my_index])
print("new_df=",new_df)
# new_df= W S
# H D
# h W W 1
# e X o 2
# l Y r 3
# Z l 4
# o A d 5
# reset_index 是 set_index 的逆函数,其主要参数是 drop ,
# 表示是否要把去掉的索引层丢弃,而不是添加到列中
print("new_df.reset_index()=",new_df.reset_index())
# new_df.reset_index()= H D W S
# 0 h W W 1
# 1 e X o 2
# 2 l Y r 3
# 3 l Z l 4
# 4 o A d 5
在某些场合下,需要对索引做一些扩充或者剔除,更具体地要求是给定一个新的索引,把原表中相应的索引对应元素填充到新索引构成的表中。
df_reindex = pd.DataFrame({"Weight":[66,72,81],"Height":[176,180,179]},index=['1001','1003','1002'])
print("df_reindex=",df_reindex)
# df_reindex= Weight Height
# 1001 66 176
# 1003 72 180
# 1002 81 179
df_reindex.reindex(index=['1001','1002','1003','1004','1005','1006','1007'],columns=['Weight','Gender','Age'])
print("df_reindex.reindex(index=['1001','1002','1003','1004','1005','1006','1007'],columns=['Weight','Gender','Age'])=",
df_reindex.reindex(index=['1001','1002','1003','1004','1005','1006','1007'],columns=['Weight','Gender','Age']))
# df_reindex.reindex(index=['1001','1002','1003','1004','1005','1006','1007'],columns=['Weight','Gender','Age'])= Weight Gender Age
# 1001 66.0 NaN NaN
# 1002 81.0 NaN NaN
# 1003 72.0 NaN NaN
# 1004 NaN NaN NaN
# 1005 NaN NaN NaN
# 1006 NaN NaN NaN
# 1007 NaN NaN NaN
# 还有一个与 reindex 功能类似的函数是 reindex_like ,其功能是仿照传入的表的索引来进行被调用表索引的变形
df_existed = pd.DataFrame(index=['1001','1002','1003','1004','1005','1006','1007'], columns=['Weight','Gender','Age'])
print("df_reindex.reindex_like(df_existed)=",df_reindex.reindex_like(df_existed))
# df_reindex.reindex_like(df_existed)= Weight Gender Age
# 1001 66.0 NaN NaN
# 1002 81.0 NaN NaN
# 1003 72.0 NaN NaN
# 1004 NaN NaN NaN
# 1005 NaN NaN NaN
# 1006 NaN NaN NaN
# 1007 NaN NaN NaN
df_set1 = pd.DataFrame([[0,1],[1,2],[3,4]],index = pd.Index(['a','b','a'],name='id1'))
df_set2 = pd.DataFrame([[4,5],[2,6],[7,1]],index = pd.Index(['b','b','c'],name='id2'))
id1, id2 = df_set1.index.unique(), df_set2.index.unique()
print("id1.intersection(id2)=",id1.intersection(id2))
#id1.intersection(id2)= Index(['b'], dtype='object')
print("id1.union(id2)=",id1.union(id2))
# id1.union(id2)= Index(['a', 'b', 'c'], dtype='object')
print("id1.difference(id2)=",id1.difference(id2))
# id1.difference(id2)= Index(['a'], dtype='object')
print("id1.symmetric_difference(id2)=",id1.symmetric_difference(id2))
# id1.symmetric_difference(id2)= Index(['a', 'c'], dtype='object')
print("id1 & id2=",id1 & id2)
# id1 & id2= Index(['b'], dtype='object')
print("id1 | id2=",id1 | id2)
# id1 | id2= Index(['a', 'b', 'c'], dtype='object')
print("(id1 ^ id2) & id1=",(id1 ^ id2) & id1)
# (id1 ^ id2) & id1= Index(['a'], dtype='object')
print("id1 ^ id2=",id1 ^ id2)# ^符号即对称差
# id1 ^ id2= Index(['a', 'c'], dtype='object')
import pandas as pd
import numpy as np
df=pd.read_csv('D:/binchen/txzq/data/company.csv')
print("df=",df)
# df= EmployeeID birthdate_key ... job_title gender
# 0 1318 1/3/1954 ... CEO M
# 1 1319 1/3/1957 ... VP Stores F
# 2 1320 1/2/1955 ... Legal Counsel F
# 3 1321 1/2/1959 ... VP Human Resources M
# 4 1322 1/9/1958 ... VP Finance M
# ... ... ... ... ... ...
# 6279 8036 8/9/1992 ... Cashier F
# 6280 8181 9/26/1993 ... Cashier M
# 6281 8223 2/11/1994 ... Cashier M
# 6282 8226 2/16/1994 ... Cashier F
# 6283 8264 6/13/1994 ... Cashier F
# [6284 rows x 7 columns]
print("df.head(10)=",df.head(10))
# df.head(10)= EmployeeID birthdate_key ... job_title gender
# 0 1318 1/3/1954 ... CEO M
# 1 1319 1/3/1957 ... VP Stores F
# 2 1320 1/2/1955 ... Legal Counsel F
# 3 1321 1/2/1959 ... VP Human Resources M
# 4 1322 1/9/1958 ... VP Finance M
# 5 1323 1/9/1962 ... Exec Assistant, VP Stores M
# 6 1325 1/13/1964 ... Exec Assistant, Legal Counsel F
# 7 1328 1/17/1956 ... CHief Information Officer F
# 8 1329 1/23/1967 ... Store Manager F
# 9 1330 1/25/1967 ... Meats Manager F
df1=df.query('age<40')
df2=df1.query('department=="Dairy"')
df3=df1.query('department=="Bakery"')
#print("df3=",df3)
print("pd.concat([df2,df3])=",pd.concat([df2,df3]))
# pd.concat([df2,df3])= EmployeeID birthdate_key age ... department job_title gender
# 3720 5900 1/6/1976 39 ... Dairy Dairy Person F
# 3722 5902 1/12/1976 39 ... Dairy Dairy Person M
# 3723 5903 1/15/1976 39 ... Dairy Dairy Person F
# 3724 5904 1/16/1976 39 ... Dairy Dairy Person M
# 3725 5905 1/19/1976 39 ... Dairy Dairy Person M
# ... ... ... ... ... ... ... ...
# 5453 7644 8/18/1989 26 ... Bakery Baker M
# 5460 7651 8/29/1989 26 ... Bakery Baker M
# 5468 7660 9/28/1989 26 ... Bakery Baker F
# 5565 7758 5/16/1990 25 ... Bakery Baker F
# 6271 6540 2/14/1981 34 ... Bakery Baker M
#
# [804 rows x 7 columns]
print("df1.query('department=='Dairy'')+df1.query('department=='Bakery'')=",
df1.query('department=="Dairy"') + df1.query('department=="Bakery"'))
# df1.query('department=='Dairy'')+df1.query('department=='Bakery'')= EmployeeID birthdate_key age city_name department job_title gender
# 3719 NaN NaN NaN NaN NaN NaN NaN
# 3720 NaN NaN NaN NaN NaN NaN NaN
# 3722 NaN NaN NaN NaN NaN NaN NaN
# 3723 NaN NaN NaN NaN NaN NaN NaN
# 3724 NaN NaN NaN NaN NaN NaN NaN
# ... ... ... ... ... ... ... ...
# 6136 NaN NaN NaN NaN NaN NaN NaN
# 6137 NaN NaN NaN NaN NaN NaN NaN
# 6270 NaN NaN NaN NaN NaN NaN NaN
# 6271 NaN NaN NaN NaN NaN NaN NaN
# 6278 NaN NaN NaN NaN NaN NaN NaN
#
# [804 rows x 7 columns]
df4=df.loc[df['age']<40 ]
print('df4.loc[df4["department"]=="Dairy"] + df4.loc[df4["department"]=="Bakery"]=',
df4.loc[df4["department"]=="Dairy"] + df4.loc[df4["department"]=="Bakery"])
# df4.loc[df4["department"]=="Dairy"] + df4.loc[df4["department"]=="Bakery"]= EmployeeID birthdate_key age city_name department job_title gender
# 3719 NaN NaN NaN NaN NaN NaN NaN
# 3720 NaN NaN NaN NaN NaN NaN NaN
# 3722 NaN NaN NaN NaN NaN NaN NaN
# 3723 NaN NaN NaN NaN NaN NaN NaN
# 3724 NaN NaN NaN NaN NaN NaN NaN
# ... ... ... ... ... ... ... ...
# 6136 NaN NaN NaN NaN NaN NaN NaN
# 6137 NaN NaN NaN NaN NaN NaN NaN
# 6270 NaN NaN NaN NaN NaN NaN NaN
# 6271 NaN NaN NaN NaN NaN NaN NaN
# 6278 NaN NaN NaN NaN NaN NaN NaN
#
# [804 rows x 7 columns]
1、https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch3.html#query
2、https://geek-docs.com/pandas/pandas-tutorials/pandas-indexes-and-select-data.html
3、https://mlln.cn/2019/01/24/pandas%E7%BB%83%E4%B9%A0-%E5%A4%9A%E5%B1%82%E7%B4%A2%E5%BC%95%E7%9A%84%E5%88%9B%E5%BB%BA%E5%92%8C%E5%90%84%E7%A7%8D%E6%93%8D%E4%BD%9C(multiindex)%E7%AC%AC%E4%BA%8C%E9%83%A8%E5%88%86/