当前位置: 首页 > 文档资料 > Pandas 官方教程 >

Pandas 秘籍 - 第二章

优质
小牛编辑
132浏览
2023-12-01
  1. # 通常的开头
  2. import pandas as pd
  3. # 使图表更大更漂亮
  4. pd.set_option('display.mpl_style', 'default')
  5. pd.set_option('display.line_width', 5000)
  6. pd.set_option('display.max_columns', 60)
  7. figsize(15, 5)

我们将在这里使用一个新的数据集,来演示如何处理更大的数据集。 这是来自 NYC Open Data 的 311 个服务请求的子集。

  1. complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 里面究竟有什么?(总结)

当你查看一个大型数据框架,而不是显示数据框架的内容,它会显示一个摘要。 这包括所有列,以及每列中有多少非空值。

  1. complaints
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 111069 entries, 0 to 111068
  3. Data columns (total 52 columns):
  4. Unique Key 111069 non-null values
  5. Created Date 111069 non-null values
  6. Closed Date 60270 non-null values
  7. Agency 111069 non-null values
  8. Agency Name 111069 non-null values
  9. Complaint Type 111069 non-null values
  10. Descriptor 111068 non-null values
  11. Location Type 79048 non-null values
  12. Incident Zip 98813 non-null values
  13. Incident Address 84441 non-null values
  14. Street Name 84438 non-null values
  15. Cross Street 1 84728 non-null values
  16. Cross Street 2 84005 non-null values
  17. Intersection Street 1 19364 non-null values
  18. Intersection Street 2 19366 non-null values
  19. Address Type 102247 non-null values
  20. City 98860 non-null values
  21. Landmark 95 non-null values
  22. Facility Type 110938 non-null values
  23. Status 111069 non-null values
  24. Due Date 39239 non-null values
  25. Resolution Action Updated Date 96507 non-null values
  26. Community Board 111069 non-null values
  27. Borough 111069 non-null values
  28. X Coordinate (State Plane) 98143 non-null values
  29. Y Coordinate (State Plane) 98143 non-null values
  30. Park Facility Name 111069 non-null values
  31. Park Borough 111069 non-null values
  32. School Name 111069 non-null values
  33. School Number 111052 non-null values
  34. School Region 110524 non-null values
  35. School Code 110524 non-null values
  36. School Phone Number 111069 non-null values
  37. School Address 111069 non-null values
  38. School City 111069 non-null values
  39. School State 111069 non-null values
  40. School Zip 111069 non-null values
  41. School Not Found 38984 non-null values
  42. School or Citywide Complaint 0 non-null values
  43. Vehicle Type 99 non-null values
  44. Taxi Company Borough 117 non-null values
  45. Taxi Pick Up Location 1059 non-null values
  46. Bridge Highway Name 185 non-null values
  47. Bridge Highway Direction 185 non-null values
  48. Road Ramp 184 non-null values
  49. Bridge Highway Segment 223 non-null values
  50. Garage Lot Name 49 non-null values
  51. Ferry Direction 37 non-null values
  52. Ferry Terminal Name 336 non-null values
  53. Latitude 98143 non-null values
  54. Longitude 98143 non-null values
  55. Location 98143 non-null values
  56. dtypes: float64(5), int64(1), object(46)

2.2 选择列和行

为了选择一列,使用列名称作为索引,像这样:

  1. complaints['Complaint Type']
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. 5 Noise - Commercial
  7. 6 Blocked Driveway
  8. 7 Noise - Commercial
  9. 8 Noise - Commercial
  10. 9 Noise - Commercial
  11. 10 Noise - House of Worship
  12. 11 Noise - Commercial
  13. 12 Illegal Parking
  14. 13 Noise - Vehicle
  15. 14 Rodent
  16. ...
  17. 111054 Noise - Street/Sidewalk
  18. 111055 Noise - Commercial
  19. 111056 Street Sign - Missing
  20. 111057 Noise
  21. 111058 Noise - Commercial
  22. 111059 Noise - Street/Sidewalk
  23. 111060 Noise
  24. 111061 Noise - Commercial
  25. 111062 Water System
  26. 111063 Water System
  27. 111064 Maintenance or Facility
  28. 111065 Illegal Parking
  29. 111066 Noise - Street/Sidewalk
  30. 111067 Noise - Commercial
  31. 111068 Blocked Driveway
  32. Name: Complaint Type, Length: 111069, dtype: object

要获得DataFrame的前 5 行,我们可以使用切片:df [:5]

这是一个了解数据框架中存在什么信息的很好方式 - 花一点时间来查看内容并获得此数据集的感觉。

  1. complaints[:5]
Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
02658965110/31/2013 02:08:41 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1143290-03 169 STREET169 STREET90 AVENUE91 AVENUENaNNaNADDRESSJAMAICANaNPrecinctAssigned10/31/2013 10:08:41 AM10/31/2013 02:35:17 AM12 QUEENSQUEENS1042027197389UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.708275-73.791604(40.70827532593202, -73.79160395779721)
12659369810/31/2013 02:01:04 AMNaNNYPDNew York City Police DepartmentIllegal ParkingCommercial Overnight ParkingStreet/Sidewalk1137858 AVENUE58 AVENUE58 PLACE59 STREETNaNNaNBLOCKFACEMASPETHNaNPrecinctOpen10/31/2013 10:01:04 AMNaN05 QUEENSQUEENS1009349201984UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.721041-73.909453(40.721040535628305, -73.90945306791765)
22659413910/31/2013 02:00:24 AM10/31/2013 02:40:32 AMNYPDNew York City Police DepartmentNoise - CommercialLoud Music/PartyClub/Bar/Restaurant100324060 BROADWAYBROADWAYWEST 171 STREETWEST 172 STREETNaNNaNADDRESSNEW YORKNaNPrecinctClosed10/31/2013 10:00:24 AM10/31/2013 02:39:42 AM12 MANHATTANMANHATTAN1001088246531UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.843330-73.939144(40.84332975466513, -73.93914371913482)
32659572110/31/2013 01:56:23 AM10/31/2013 02:21:48 AMNYPDNew York City Police DepartmentNoise - VehicleCar/Truck HornStreet/Sidewalk10023WEST 72 STREETWEST 72 STREETCOLUMBUS AVENUEAMSTERDAM AVENUENaNNaNBLOCKFACENEW YORKNaNPrecinctClosed10/31/2013 09:56:23 AM10/31/2013 02:21:10 AM07 MANHATTANMANHATTAN989730222727UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.778009-73.980213(40.7780087446372, -73.98021349023975)
42659093010/31/2013 01:53:44 AMNaNDOHMHDepartment of Health and Mental HygieneRodentCondition Attracting RodentsVacant Lot10027WEST 124 STREETWEST 124 STREETLENOX AVENUEADAM CLAYTON POWELL JR BOULEVARDNaNNaNBLOCKFACENEW YORKNaNN/APending11/30/2013 01:53:44 AM10/31/2013 01:59:54 AM10 MANHATTANMANHATTAN998815233545UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.807691-73.947387(40.80769092704951,

我们可以组合它们来获得一列的前五行。

  1. complaints['Complaint Type'][:5]
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. Name: Complaint Type, dtype: object

并且无论我们以什么方向:

  1. complaints[:5]['Complaint Type']
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. Name: Complaint Type, dtype: object

2.3 选择多列

如果我们只关心投诉类型和区,但不关心其余的信息怎么办? Pandas 使它很容易选择列的一个子集:只需将所需列的列表用作索引。

  1. complaints[['Complaint Type', 'Borough']]
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 111069 entries, 0 to 111068
  3. Data columns (total 2 columns):
  4. Complaint Type 111069 non-null values
  5. Borough 111069 non-null values
  6. dtypes: object(2)

这会向我们展示总结,我们可以获取前 10 列:

  1. complaints[['Complaint Type', 'Borough']][:10]
Complaint TypeBorough
0Noise - Street/SidewalkQUEENS
1Illegal ParkingQUEENS
2Noise - CommercialMANHATTAN
3Noise - VehicleMANHATTAN
4RodentMANHATTAN
5Noise - CommercialQUEENS
6Blocked DrivewayQUEENS
7Noise - CommercialQUEENS
8Noise - CommercialMANHATTAN
9Noise - CommercialBROOKLYN

2.4 什么是最常见的投诉类型?

这是个易于回答的问题,我们可以调用.value_counts()方法:

  1. complaints['Complaint Type'].value_counts()
  1. HEATING 14200
  2. GENERAL CONSTRUCTION 7471
  3. Street Light Condition 7117
  4. DOF Literature Request 5797
  5. PLUMBING 5373
  6. PAINT - PLASTER 5149
  7. Blocked Driveway 4590
  8. NONCONST 3998
  9. Street Condition 3473
  10. Illegal Parking 3343
  11. Noise 3321
  12. Traffic Signal Condition 3145
  13. Dirty Conditions 2653
  14. Water System 2636
  15. Noise - Commercial 2578
  16. ...
  17. Opinion for the Mayor 2
  18. Window Guard 2
  19. DFTA Literature Request 2
  20. Legal Services Provider Complaint 2
  21. Open Flame Permit 1
  22. Snow 1
  23. Municipal Parking Facility 1
  24. X-Ray Machine/Equipment 1
  25. Stalled Sites 1
  26. DHS Income Savings Requirement 1
  27. Tunnel Condition 1
  28. Highway Sign - Damaged 1
  29. Ferry Permit 1
  30. Trans Fat 1
  31. DWD 1
  32. Length: 165, dtype: int64

如果我们想要最常见的 10 个投诉类型,我们可以这样:

  1. complaint_counts = complaints['Complaint Type'].value_counts()
  2. complaint_counts[:10]
  1. HEATING 14200
  2. GENERAL CONSTRUCTION 7471
  3. Street Light Condition 7117
  4. DOF Literature Request 5797
  5. PLUMBING 5373
  6. PAINT - PLASTER 5149
  7. Blocked Driveway 4590
  8. NONCONST 3998
  9. Street Condition 3473
  10. Illegal Parking 3343
  11. dtype: int64

但是还可以更好,我们可以绘制出来!

  1. complaint_counts[:10].plot(kind='bar')
  1. <matplotlib.axes.AxesSubplot at 0x7ba2290>

第二章 - 图1