当前位置: 首页 > 文档资料 > Pandas 官方教程 >

Pandas 秘籍 - 第七章

优质
小牛编辑
138浏览
2023-12-01
  1. # 通常的开头
  2. %matplotlib inline
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  5. import numpy as np
  6. # 使图表更大更漂亮
  7. pd.set_option('display.mpl_style', 'default')
  8. plt.rcParams['figure.figsize'] = (15, 5)
  9. plt.rcParams['font.family'] = 'sans-serif'
  10. # 在 Pandas 0.12 中需要展示大量的列
  11. # 在 Pandas 0.13 中不需要
  12. pd.set_option('display.width', 5000)
  13. pd.set_option('display.max_columns', 60)

杂乱数据的主要问题之一是:你怎么知道它是否杂乱呢?

我们将在这里使用 NYC 311 服务请求数据集,因为它很大,有点不方便。

  1. requests = pd.read_csv('../data/311-service-requests.csv')

7.1 我怎么知道它是否杂乱?

我们在这里查看几列。 我知道邮政编码有一些问题,所以让我们先看看它。

要了解列是否有问题,我通常使用.unique()来查看所有的值。 如果它是一列数字,我将绘制一个直方图来获得分布的感觉。

当我们看看Incident Zip中的唯一值时,很快就会清楚这是一个混乱。

一些问题:

  • 一些已经解析为字符串,一些是浮点
  • 存在nan
  • 部分邮政编码为29616-075983
  • 有一些 Pandas 无法识别的 N/A 值 ,如'N/A''NO CLUE'

我们可以做的事情:

  • N/ANO CLUE规格化为nan
  • 看看 83 处发生了什么,并决定做什么
  • 将一切转化为字符串
  1. requests['Incident Zip'].unique()
  1. array([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0,
  2. 11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0,
  3. 11219.0, 10025.0, 10310.0, 11236.0, nan, 10033.0, 11216.0, 10016.0,
  4. 10305.0, 10312.0, 10026.0, 10309.0, 10036.0, 11433.0, 11235.0,
  5. 11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0,
  6. 10459.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0,
  7. 10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10007.0,
  8. 10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0,
  9. 11211.0, 11412.0, 10458.0, 11229.0, 10065.0, 10030.0, 11222.0,
  10. 10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0,
  11. 10022.0, 11232.0, 11040.0, 11226.0, 10281.0, 11102.0, 11208.0,
  12. 10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0,
  13. 11203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0,
  14. 10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0,
  15. 11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0,
  16. 11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467.0,
  17. 11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0,
  18. 11228.0, 11368.0, 11694.0, 10464.0, 11415.0, 10314.0, 10301.0,
  19. 10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0,
  20. 11416.0, 10075.0, 11422.0, 11355.0, 10028.0, 10462.0, 10306.0,
  21. 10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0,
  22. 10460.0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0,
  23. 10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 11370.0,
  24. 10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0,
  25. 11427.0, 11001.0, 11363.0, 10004.0, 10474.0, 11430.0, 10000.0,
  26. 10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0,
  27. 11693.0, 10573.0, 83.0, 11559.0, 10020.0, 77056.0, 11776.0, 70711.0,
  28. 10282.0, 11109.0, 10044.0, '10452', '11233', '10468', '10310',
  29. '11105', '10462', '10029', '10301', '10457', '10467', '10469',
  30. '11225', '10035', '10031', '11226', '10454', '11221', '10025',
  31. '11229', '11235', '11422', '10472', '11208', '11102', '10032',
  32. '11216', '10473', '10463', '11213', '10040', '10302', '11231',
  33. '10470', '11204', '11104', '11212', '10466', '11416', '11214',
  34. '10009', '11692', '11385', '11423', '11201', '10024', '11435',
  35. '10312', '10030', '11106', '10033', '10303', '11215', '11222',
  36. '11354', '10016', '10034', '11420', '10304', '10019', '11237',
  37. '11249', '11230', '11372', '11207', '11378', '11419', '11361',
  38. '10011', '11357', '10012', '11358', '10003', '10002', '11374',
  39. '10007', '11234', '10065', '11369', '11434', '11205', '11206',
  40. '11415', '11236', '11218', '11413', '10458', '11101', '10306',
  41. '11355', '10023', '11368', '10314', '11421', '10010', '10018',
  42. '11223', '10455', '11377', '11433', '11375', '10037', '11209',
  43. '10459', '10128', '10014', '10282', '11373', '10451', '11238',
  44. '11211', '10038', '11694', '11203', '11691', '11232', '10305',
  45. '10021', '11228', '10036', '10001', '10017', '11217', '11219',
  46. '10308', '10465', '11379', '11414', '10460', '11417', '11220',
  47. '11366', '10027', '11370', '10309', '11412', '11356', '10456',
  48. '11432', '10022', '10013', '11367', '11040', '10026', '10475',
  49. '11210', '11364', '11426', '10471', '10119', '11224', '11418',
  50. '11429', '11365', '10461', '11239', '10039', '00083', '11411',
  51. '10075', '11004', '11360', '10453', '10028', '11430', '10307',
  52. '11103', '10004', '10069', '10005', '10474', '11428', '11436',
  53. '10020', '11001', '11362', '11693', '10464', '11427', '10044',
  54. '11363', '10006', '10000', '02061', '77092-2016', '10280', '11109',
  55. '14225', '55164-0737', '19711', '07306', '000000', 'NO CLUE',
  56. '90010', '10281', '11747', '23541', '11776', '11697', '11788',
  57. '07604', 10112.0, 11788.0, 11563.0, 11580.0, 7087.0, 11042.0,
  58. 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0, '10803',
  59. '11716', '11722', '11549-3650', '10162', '92123', '23502', '11518',
  60. '07020', '08807', '11577', '07114', '11003', '07201', '11563',
  61. '61702', '10103', '29616-0759', '35209-3114', '11520', '11735',
  62. '10129', '11005', '41042', '11590', 6901.0, 7208.0, 11530.0,
  63. 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0, 10107.0], dtype=object)

7.3 修复nan值和字符串/浮点混淆

我们可以将na_values选项传递到pd.read_csv来清理它们。 我们还可以指定Incident Zip的类型是字符串,而不是浮点。

  1. na_values = ['NO CLUE', 'N/A', '0']
  2. requests = pd.read_csv('../data/311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})
  1. requests['Incident Zip'].unique()
  1. array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
  2. '11417', '10011', '11225', '11218', '10003', '10029', '10466',
  3. '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
  4. '10305', '10312', '10026', '10309', '10036', '11433', '11235',
  5. '11213', '11379', '11101', '10014', '11231', '11234', '10457',
  6. '10459', '10465', '11207', '10002', '10034', '11233', '10453',
  7. '10456', '10469', '11374', '11221', '11421', '11215', '10007',
  8. '10019', '11205', '11418', '11369', '11249', '10005', '10009',
  9. '11211', '11412', '10458', '11229', '10065', '10030', '11222',
  10. '10024', '10013', '11420', '11365', '10012', '11214', '11212',
  11. '10022', '11232', '11040', '11226', '10281', '11102', '11208',
  12. '10001', '10472', '11414', '11223', '10040', '11220', '11373',
  13. '11203', '11691', '11356', '10017', '10452', '10280', '11217',
  14. '10031', '11201', '11358', '10128', '11423', '10039', '10010',
  15. '11209', '10021', '10037', '11413', '11375', '11238', '10473',
  16. '11103', '11354', '11361', '11106', '11385', '10463', '10467',
  17. '11204', '11237', '11377', '11364', '11434', '11435', '11210',
  18. '11228', '11368', '11694', '10464', '11415', '10314', '10301',
  19. '10018', '10038', '11105', '11230', '10468', '11104', '10471',
  20. '11416', '10075', '11422', '11355', '10028', '10462', '10306',
  21. '10461', '11224', '11429', '10035', '11366', '11362', '11206',
  22. '10460', '10304', '11360', '11411', '10455', '10475', '10069',
  23. '10303', '10308', '10302', '11357', '10470', '11367', '11370',
  24. '10454', '10451', '11436', '11426', '10153', '11004', '11428',
  25. '11427', '11001', '11363', '10004', '10474', '11430', '10000',
  26. '10307', '11239', '10119', '10006', '10048', '11697', '11692',
  27. '11693', '10573', '00083', '11559', '10020', '77056', '11776',
  28. '70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',
  29. '55164-0737', '19711', '07306', '000000', '90010', '11747', '23541',
  30. '11788', '07604', '10112', '11563', '11580', '07087', '11042',
  31. '07093', '11501', '92123', '00000', '11575', '07109', '11797',
  32. '10803', '11716', '11722', '11549-3650', '10162', '23502', '11518',
  33. '07020', '08807', '11577', '07114', '11003', '07201', '61702',
  34. '10103', '29616-0759', '35209-3114', '11520', '11735', '10129',
  35. '11005', '41042', '11590', '06901', '07208', '11530', '13221',
  36. '10954', '11111', '10107'], dtype=object)

7.4 短横线处发生了什么

  1. rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)
  2. len(requests[rows_with_dashes])
  1. 5
  1. requests[rows_with_dashes]
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
291362655055110/24/2013 06:16:34 PMNaNDCADepartment of Consumer AffairsConsumer ComplaintFalse AdvertisingNaN77092-20162700 EAST SELTICE WAYEAST SELTICE WAYNaNNaNNaNNaNNaNHOUSTONNaNNaNAssigned11/13/2013 11:15:20 AM10/29/2013 11:16:16 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
309392654883110/24/2013 09:35:10 AMNaNDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN55164-0737P.O. BOX 6443764437NaNNaNNaNNaNNaNST. PAULNaNNaNAssigned11/13/2013 02:30:21 PM10/29/2013 02:31:06 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
705392648841710/15/2013 03:40:33 PMNaNTLCTaxi and Limousine CommissionTaxi ComplaintDriver ComplaintStreet11549-3650365 HOFSTRA UNIVERSITYHOFSTRA UNIVERSITYNaNNaNNaNNaNNaNHEMSTEADNaNNaNAssigned11/30/2013 01:20:33 PM10/16/2013 01:21:39 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNLa Guardia AirportNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
858212646829610/10/2013 12:36:43 PM10/26/2013 01:07:07 AMDCADepartment of Consumer AffairsConsumer ComplaintDebt Not OwedNaN29616-0759PO BOX 25759BOX 25759NaNNaNNaNNaNNaNGREENVILLENaNNaNClosed10/26/2013 09:20:28 AM10/26/2013 01:07:07 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
893042646113710/09/2013 05:23:46 PM10/25/2013 01:06:41 AMDCADepartment of Consumer AffairsConsumer ComplaintHarassmentNaN35209-3114600 BEACON PKWYBEACON PKWYNaNNaNNaNNaNNaNBIRMINGHAMNaNNaNClosed10/25/2013 02:43:42 PM10/25/2013 01:06:41 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

我认为这些都是缺失的数据,像这样删除它们:

  1. requests['Incident Zip'][rows_with_dashes] = np.nan

但是我的朋友 Dave 指出,9 位邮政编码是正常的。 让我们看看所有超过 5 位数的邮政编码,确保它们没问题,然后截断它们。

  1. long_zip_codes = requests['Incident Zip'].str.len() > 5
  2. requests['Incident Zip'][long_zip_codes].unique()
  1. array(['77092-2016', '55164-0737', '000000', '11549-3650', '29616-0759',
  2. '35209-3114'], dtype=object)

这些看起来可以截断:

  1. requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

就可以了。

早些时候我认为 00083 是一个损坏的邮政编码,但事实证明中央公园的邮政编码是 00083! 显示我知道的吧。 我仍然关心 00000 邮政编码,但是:让我们看看。

  1. requests[requests['Incident Zip'] == '00000']
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
426002652931310/22/2013 02:51:06 PMNaNTLCTaxi and Limousine CommissionTaxi ComplaintDriver ComplaintNaN00000EWR EWREWRNaNNaNNaNNaNNaNNEWARKNaNNaNAssigned12/07/2013 09:53:51 AM10/23/2013 09:54:43 AM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNOtherNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
608432650738910/17/2013 05:48:44 PMNaNTLCTaxi and Limousine CommissionTaxi ComplaintDriver ComplaintStreet000001 NEWARK AIRPORTNEWARK AIRPORTNaNNaNNaNNaNNaNNEWARKNaNNaNAssigned12/02/2013 11:59:46 AM10/18/2013 12:01:08 PM0 UnspecifiedUnspecifiedNaNNaNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNOtherNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

这看起来对我来说很糟糕,让我将它们设为NaN

  1. zero_zips = requests['Incident Zip'] == '00000'
  2. requests.loc[zero_zips, 'Incident Zip'] = np.nan

太棒了,让我们看看现在在哪里。

  1. unique_zips = requests['Incident Zip'].unique()
  2. unique_zips.sort()
  3. unique_zips
  1. array([nan, '00083', '02061', '06901', '07020', '07087', '07093', '07109',
  2. '07114', '07201', '07208', '07306', '07604', '08807', '10000',
  3. '10001', '10002', '10003', '10004', '10005', '10006', '10007',
  4. '10009', '10010', '10011', '10012', '10013', '10014', '10016',
  5. '10017', '10018', '10019', '10020', '10021', '10022', '10023',
  6. '10024', '10025', '10026', '10027', '10028', '10029', '10030',
  7. '10031', '10032', '10033', '10034', '10035', '10036', '10037',
  8. '10038', '10039', '10040', '10044', '10048', '10065', '10069',
  9. '10075', '10103', '10107', '10112', '10119', '10128', '10129',
  10. '10153', '10162', '10280', '10281', '10282', '10301', '10302',
  11. '10303', '10304', '10305', '10306', '10307', '10308', '10309',
  12. '10310', '10312', '10314', '10451', '10452', '10453', '10454',
  13. '10455', '10456', '10457', '10458', '10459', '10460', '10461',
  14. '10462', '10463', '10464', '10465', '10466', '10467', '10468',
  15. '10469', '10470', '10471', '10472', '10473', '10474', '10475',
  16. '10573', '10803', '10954', '11001', '11003', '11004', '11005',
  17. '11040', '11042', '11101', '11102', '11103', '11104', '11105',
  18. '11106', '11109', '11111', '11201', '11203', '11204', '11205',
  19. '11206', '11207', '11208', '11209', '11210', '11211', '11212',
  20. '11213', '11214', '11215', '11216', '11217', '11218', '11219',
  21. '11220', '11221', '11222', '11223', '11224', '11225', '11226',
  22. '11228', '11229', '11230', '11231', '11232', '11233', '11234',
  23. '11235', '11236', '11237', '11238', '11239', '11249', '11354',
  24. '11355', '11356', '11357', '11358', '11360', '11361', '11362',
  25. '11363', '11364', '11365', '11366', '11367', '11368', '11369',
  26. '11370', '11372', '11373', '11374', '11375', '11377', '11378',
  27. '11379', '11385', '11411', '11412', '11413', '11414', '11415',
  28. '11416', '11417', '11418', '11419', '11420', '11421', '11422',
  29. '11423', '11426', '11427', '11428', '11429', '11430', '11432',
  30. '11433', '11434', '11435', '11436', '11501', '11518', '11520',
  31. '11530', '11549', '11559', '11563', '11575', '11577', '11580',
  32. '11590', '11691', '11692', '11693', '11694', '11697', '11716',
  33. '11722', '11735', '11747', '11776', '11788', '11797', '13221',
  34. '14225', '19711', '23502', '23541', '29616', '35209', '41042',
  35. '55164', '61702', '70711', '77056', '77092', '90010', '92123'], dtype=object)

太棒了! 这更加干净。 虽然这里有一些奇怪的东西 - 我在谷歌地图上查找 77056,这是在德克萨斯州。

让我们仔细看看:

  1. zips = requests['Incident Zip']
  2. # Let's say the zips starting with '0' and '1' are okay, for now. (this isn't actually true -- 13221 is in Syracuse, and why?)
  3. is_close = zips.str.startswith('0') | zips.str.startswith('1')
  4. # There are a bunch of NaNs, but we're not interested in them right now, so we'll say they're False
  5. is_far = ~(is_close) & zips.notnull()
  1. zips[is_far]
  1. 12102 77056
  2. 13450 70711
  3. 29136 77092
  4. 30939 55164
  5. 44008 90010
  6. 47048 23541
  7. 57636 92123
  8. 71001 92123
  9. 71834 23502
  10. 80573 61702
  11. 85821 29616
  12. 89304 35209
  13. 94201 41042
  14. Name: Incident Zip, dtype: object
  1. requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort('Incident Zip')
Incident ZipDescriptorCity
7183423502HarassmentNORFOLK
4704823541HarassmentNORFOLK
8582129616Debt Not OwedGREENVILLE
8930435209HarassmentBIRMINGHAM
9420141042HarassmentFLORENCE
3093955164HarassmentST. PAUL
8057361702Billing DisputeBLOOMIGTON
1345070711Contract DisputeCLIFTON
1210277056Debt Not OwedHOUSTON
2913677092False AdvertisingHOUSTON
4400890010Billing DisputeLOS ANGELES
5763692123HarassmentSAN DIEGO
7100192123Billing DisputeSAN DIEGO

好吧,真的有来自 LA 和休斯敦的请求! 很高兴知道它们。 按邮政编码过滤可能是处理它的一个糟糕的方式 - 我们真的应该看着城市。

  1. requests['City'].str.upper().value_counts()
  1. BROOKLYN 31662
  2. NEW YORK 22664
  3. BRONX 18438
  4. STATEN ISLAND 4766
  5. JAMAICA 2246
  6. FLUSHING 1803
  7. ASTORIA 1568
  8. RIDGEWOOD 1073
  9. CORONA 707
  10. OZONE PARK 693
  11. LONG ISLAND CITY 678
  12. FAR ROCKAWAY 652
  13. ELMHURST 647
  14. WOODSIDE 609
  15. EAST ELMHURST 562
  16. ...
  17. MELVILLE 1
  18. PORT JEFFERSON STATION 1
  19. NORWELL 1
  20. EAST ROCKAWAY 1
  21. BIRMINGHAM 1
  22. ROSLYN 1
  23. LOS ANGELES 1
  24. MINEOLA 1
  25. JERSEY CITY 1
  26. ST. PAUL 1
  27. CLIFTON 1
  28. COL.ANVURES 1
  29. EDGEWATER 1
  30. ROSELYN 1
  31. CENTRAL ISLIP 1
  32. Length: 100, dtype: int64

看起来这些是合法的投诉,所以我们只是把它们放在一边。

7.5 把它们放到一起

这里是我们最后所做的事情,用于清理我们的邮政编码,都在一起:

  1. na_values = ['NO CLUE', 'N/A', '0']
  2. requests = pd.read_csv('../data/311-service-requests.csv',
  3. na_values=na_values,
  4. dtype={'Incident Zip': str})
  1. def fix_zip_codes(zips):
  2. # Truncate everything to length 5
  3. zips = zips.str.slice(0, 5)
  4. # Set 00000 zip codes to nan
  5. zero_zips = zips == '00000'
  6. zips[zero_zips] = np.nan
  7. return zips
  1. requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip'])
  1. requests['Incident Zip'].unique()
  1. array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',
  2. '11417', '10011', '11225', '11218', '10003', '10029', '10466',
  3. '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',
  4. '10305', '10312', '10026', '10309', '10036', '11433', '11235',
  5. '11213', '11379', '11101', '10014', '11231', '11234', '10457',
  6. '10459', '10465', '11207', '10002', '10034', '11233', '10453',
  7. '10456', '10469', '11374', '11221', '11421', '11215', '10007',
  8. '10019', '11205', '11418', '11369', '11249', '10005', '10009',
  9. '11211', '11412', '10458', '11229', '10065', '10030', '11222',
  10. '10024', '10013', '11420', '11365', '10012', '11214', '11212',
  11. '10022', '11232', '11040', '11226', '10281', '11102', '11208',
  12. '10001', '10472', '11414', '11223', '10040', '11220', '11373',
  13. '11203', '11691', '11356', '10017', '10452', '10280', '11217',
  14. '10031', '11201', '11358', '10128', '11423', '10039', '10010',
  15. '11209', '10021', '10037', '11413', '11375', '11238', '10473',
  16. '11103', '11354', '11361', '11106', '11385', '10463', '10467',
  17. '11204', '11237', '11377', '11364', '11434', '11435', '11210',
  18. '11228', '11368', '11694', '10464', '11415', '10314', '10301',
  19. '10018', '10038', '11105', '11230', '10468', '11104', '10471',
  20. '11416', '10075', '11422', '11355', '10028', '10462', '10306',
  21. '10461', '11224', '11429', '10035', '11366', '11362', '11206',
  22. '10460', '10304', '11360', '11411', '10455', '10475', '10069',
  23. '10303', '10308', '10302', '11357', '10470', '11367', '11370',
  24. '10454', '10451', '11436', '11426', '10153', '11004', '11428',
  25. '11427', '11001', '11363', '10004', '10474', '11430', '10000',
  26. '10307', '11239', '10119', '10006', '10048', '11697', '11692',
  27. '11693', '10573', '00083', '11559', '10020', '77056', '11776',
  28. '70711', '10282', '11109', '10044', '02061', '77092', '14225',
  29. '55164', '19711', '07306', '90010', '11747', '23541', '11788',
  30. '07604', '10112', '11563', '11580', '07087', '11042', '07093',
  31. '11501', '92123', '11575', '07109', '11797', '10803', '11716',
  32. '11722', '11549', '10162', '23502', '11518', '07020', '08807',
  33. '11577', '07114', '11003', '07201', '61702', '10103', '29616',
  34. '35209', '11520', '11735', '10129', '11005', '41042', '11590',
  35. '06901', '07208', '11530', '13221', '10954', '11111', '10107'], dtype=object)