当前位置: 首页 > 文档资料 > Pandas 官方教程 >

学习 Pandas - 07 - Lesson

优质
小牛编辑
128浏览
2023-12-01

英文原文: 07 - Lesson

离群值 (Outlier)

  1. import pandas as pd
  2. import sys
  1. print('Python version ' + sys.version)
  2. print('Pandas version ' + pd.__version__)
  1. Python version 3.6.1 | packaged by conda-forge | (default, Mar 23 2017, 21:57:00)
  2. [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
  3. Pandas version 0.19.2
  1. # 创建一个 dataframe,用日期作为索引
  2. States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
  3. data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  4. idx = pd.date_range('1/1/2012', periods=10, freq='MS')
  5. df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
  6. df1['State'] = States
  7. # 创建第二个 dataframe
  8. data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
  9. idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
  10. df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
  11. df2['State'] = States
  1. # 把两个 dataframe 合并起来
  2. df = pd.concat([df1,df2])
  3. df















































































































RevenueState
2012-01-011.0NY
2012-02-012.0NY
2012-03-013.0NY
2012-04-014.0NY
2012-05-015.0FL
2012-06-016.0FL
2012-07-017.0GA
2012-08-018.0GA
2012-09-019.0FL
2012-10-0110.0FL
2013-01-0110.0NY
2013-02-0110.0NY
2013-03-019.0NY
2013-04-019.0NY
2013-05-018.0FL
2013-06-018.0FL
2013-07-017.0GA
2013-08-017.0GA
2013-09-016.0FL
2013-10-016.0FL

计算离群值的方法

注意: 均值(average)和标准差(Standard Deviation)只对高斯分布(gaussian distribution)有意义。

  1. # 方法 1
  2. # 原始的 df 拷贝一份
  3. newdf = df.copy()
  4. newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
  5. newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
  6. newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
  7. newdf














































































































































































RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY5.755.200273True
2012-02-012.0NY4.755.200273False
2012-03-013.0NY3.755.200273False
2012-04-014.0NY2.755.200273False
2012-05-015.0FL1.755.200273False
2012-06-016.0FL0.755.200273False
2012-07-017.0GA0.255.200273False
2012-08-018.0GA1.255.200273False
2012-09-019.0FL2.255.200273False
2012-10-0110.0FL3.255.200273False
2013-01-0110.0NY3.255.200273False
2013-02-0110.0NY3.255.200273False
2013-03-019.0NY2.255.200273False
2013-04-019.0NY2.255.200273False
2013-05-018.0FL1.255.200273False
2013-06-018.0FL1.255.200273False
2013-07-017.0GA0.255.200273False
2013-08-017.0GA0.255.200273False
2013-09-016.0FL0.755.200273False
2013-10-016.0FL0.755.200273False

  1. # 方法 2
  2. # 分组的方法
  3. # 原始的 df 拷贝一份
  4. newdf = df.copy()
  5. State = newdf.groupby('State')
  6. newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
  7. newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) )
  8. newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() )
  9. newdf














































































































































































RevenueStateOutlierx-Mean1.96*std
2012-01-011.0NYFalse5.007.554813
2012-02-012.0NYFalse4.007.554813
2012-03-013.0NYFalse3.007.554813
2012-04-014.0NYFalse2.007.554813
2012-05-015.0FLFalse2.253.434996
2012-06-016.0FLFalse1.253.434996
2012-07-017.0GAFalse0.250.980000
2012-08-018.0GAFalse0.750.980000
2012-09-019.0FLFalse1.753.434996
2012-10-0110.0FLFalse2.753.434996
2013-01-0110.0NYFalse4.007.554813
2013-02-0110.0NYFalse4.007.554813
2013-03-019.0NYFalse3.007.554813
2013-04-019.0NYFalse3.007.554813
2013-05-018.0FLFalse0.753.434996
2013-06-018.0FLFalse0.753.434996
2013-07-017.0GAFalse0.250.980000
2013-08-017.0GAFalse0.250.980000
2013-09-016.0FLFalse1.253.434996
2013-10-016.0FLFalse1.253.434996

  1. # 方法 2
  2. # 多个条件分组
  3. # 原始 df 拷贝一份
  4. newdf = df.copy()
  5. StateMonth = newdf.groupby(['State', lambda x: x.month])
  6. newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
  7. newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
  8. newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
  9. newdf














































































































































































RevenueStateOutlierx-Mean1.96*std
2012-01-011.0NYFalse4.512.473364
2012-02-012.0NYFalse4.011.087434
2012-03-013.0NYFalse3.08.315576
2012-04-014.0NYFalse2.56.929646
2012-05-015.0FLFalse1.54.157788
2012-06-016.0FLFalse1.02.771859
2012-07-017.0GAFalse0.00.000000
2012-08-018.0GAFalse0.51.385929
2012-09-019.0FLFalse1.54.157788
2012-10-0110.0FLFalse2.05.543717
2013-01-0110.0NYFalse4.512.473364
2013-02-0110.0NYFalse4.011.087434
2013-03-019.0NYFalse3.08.315576
2013-04-019.0NYFalse2.56.929646
2013-05-018.0FLFalse1.54.157788
2013-06-018.0FLFalse1.02.771859
2013-07-017.0GAFalse0.00.000000
2013-08-017.0GAFalse0.51.385929
2013-09-016.0FLFalse1.54.157788
2013-10-016.0FLFalse2.05.543717

  1. # 方法 3
  2. # 分组的方法
  3. # 原始 df 拷贝一份
  4. newdf = df.copy()
  5. State = newdf.groupby('State')
  6. def s(group):
  7. group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
  8. group['1.96*std'] = 1.96*group['Revenue'].std()
  9. group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
  10. return group
  11. Newdf2 = State.apply(s)
  12. Newdf2














































































































































































RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY5.007.554813False
2012-02-012.0NY4.007.554813False
2012-03-013.0NY3.007.554813False
2012-04-014.0NY2.007.554813False
2012-05-015.0FL2.253.434996False
2012-06-016.0FL1.253.434996False
2012-07-017.0GA0.250.980000False
2012-08-018.0GA0.750.980000False
2012-09-019.0FL1.753.434996False
2012-10-0110.0FL2.753.434996False
2013-01-0110.0NY4.007.554813False
2013-02-0110.0NY4.007.554813False
2013-03-019.0NY3.007.554813False
2013-04-019.0NY3.007.554813False
2013-05-018.0FL0.753.434996False
2013-06-018.0FL0.753.434996False
2013-07-017.0GA0.250.980000False
2013-08-017.0GA0.250.980000False
2013-09-016.0FL1.253.434996False
2013-10-016.0FL1.253.434996False

  1. # 方法 3
  2. # 多个条件分组
  3. # 原始 df 拷贝一份
  4. newdf = df.copy()
  5. StateMonth = newdf.groupby(['State', lambda x: x.month])
  6. def s(group):
  7. group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
  8. group['1.96*std'] = 1.96*group['Revenue'].std()
  9. group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
  10. return group
  11. Newdf2 = StateMonth.apply(s)
  12. Newdf2














































































































































































RevenueStatex-Mean1.96*stdOutlier
2012-01-011.0NY4.512.473364False
2012-02-012.0NY4.011.087434False
2012-03-013.0NY3.08.315576False
2012-04-014.0NY2.56.929646False
2012-05-015.0FL1.54.157788False
2012-06-016.0FL1.02.771859False
2012-07-017.0GA0.00.000000False
2012-08-018.0GA0.51.385929False
2012-09-019.0FL1.54.157788False
2012-10-0110.0FL2.05.543717False
2013-01-0110.0NY4.512.473364False
2013-02-0110.0NY4.011.087434False
2013-03-019.0NY3.08.315576False
2013-04-019.0NY2.56.929646False
2013-05-018.0FL1.54.157788False
2013-06-018.0FL1.02.771859False
2013-07-017.0GA0.00.000000False
2013-08-017.0GA0.51.385929False
2013-09-016.0FL1.54.157788False
2013-10-016.0FL2.05.543717False

假设是一个非高斯分布 (如果你绘制出图形,看上去不像是一个正态分布)

  1. # 原始的 df 拷贝一份
  2. newdf = df.copy()
  3. State = newdf.groupby('State')
  4. newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
  5. newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
  6. newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper'])
  7. newdf
RevenueStateLowerUpperOutlier
2012-01-011.0NY-7.00019.000False
2012-02-012.0NY-7.00019.000False
2012-03-013.0NY-7.00019.000False
2012-04-014.0NY-7.00019.000False
2012-05-015.0FL2.62511.625False
2012-06-016.0FL2.62511.625False
2012-07-017.0GA6.6257.625False
2012-08-018.0GA6.6257.625True
2012-09-019.0FL2.62511.625False
2012-10-0110.0FL2.62511.625False
2013-01-0110.0NY-7.00019.000False
2013-02-0110.0NY-7.00019.000False
2013-03-019.0NY-7.00019.000False
2013-04-019.0NY-7.00019.000False
2013-05-018.0FL2.62511.625False
2013-06-018.0FL2.62511.625False
2013-07-017.0GA6.6257.625False
2013-08-017.0GA6.6257.625False
2013-09-016.0FL2.62511.625False
2013-10-016.0FL2.62511.625False