第01章 Pandas 基础

优质
小牛编辑
130浏览
2023-12-01

相较于《利用Python进行数据分析》,本书最大的特点是所有操作都变成了分解动作,而且每步都有详细讲解。但是,书写的有点啰嗦,而Jupyter Notebook又有些错。我对两者做了整合和总结。

第一遍整理完,还有许多地方不足,还要再弄。


作者 Theodore Petrou,Dunder Data 创始人

公司网址,http://www.dunderdata.com(dunder是蒸馏朗姆酒的残留液体,取这个名字是类比数据分析过程)

GitHub地址:https://github.com/tdpetrou

领英个人页面:https://www.linkedin.com/in/tedpetrou

推特:https://twitter.com/tedpetrou?lang=en

Medium博客:https://medium.com/@petrou.theodore


# 引入pandas和numpy的约定
in[1]: import pandas as pd
       import numpy as np

1. DataFrame的结构

# 设定最大列数和最大行数
 in[2]: pd.set_option('max_columns', 8, 'max_rows', 10)
# 用read_csv()方法读取csv文件
# head()方法可以查看前五行,head(n)可以查看前n行
 in[3]: movie = pd.read_csv('data/movie.csv')
       movie.head()
out[3]:

2. 访问DataFrame的组件

in[4]: # 提取列索引
       columns = movie.columns
       # 提取行索引
       index = movie.index
       # 提取数据
       data = movie.values
 in[5]: columns
out[5]: Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
 in[6]: index
out[6]: RangeIndex(start=0, stop=4916, step=1)
 in[7]: data
out[7]: array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ..., 
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
 in[8]: # index的类型
        type(index) # pandas.core.indexes.range.RangeIndex
out[8]: pandas.core.indexes.range.RangeIndex
 in[9]: # columns的类型
        type(columns) # pandas.core.indexes.base.Index
out[9]: pandas.core.indexes.base.Index
 in[10]: # data的类型
         type(data) # numpy.ndarray
out[10]: numpy.ndarray
 in[11]: # 判断是不是子类型
         issubclass(pd.RangeIndex, pd.Index) # True
out[11]: True

更多

 in[12]: # 访问index的值
         index.values
         # index的值是个列表,所以可以索引或切片
         index.values[0]
out[12]: array([   0,    1,    2, ..., 4913, 4914, 4915])
 in[13]: # 访问columns的值
         columns.values
out[13]: array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype=object)

3. 理解数据类型

 in[14]: movie = pd.read_csv('data/movie.csv')
         # 各列的类型
 in[15]: movie.dtypes
out[15]: color                       object
         director_name               object
         num_critic_for_reviews     float64
         duration                   float64
         director_facebook_likes    float64
                                     ...   
         title_year                 float64
         actor_2_facebook_likes     float64
         imdb_score                 float64
         aspect_ratio               float64
         movie_facebook_likes         int64
         Length: 28, dtype: object
 in[16]: # 显示各类型的数量
         movie.get_dtype_counts()
out[16]: float64    13
         int64       3
         object     12
         dtype: int64

4. 选择一列数据,作为Series

 in[17]: movie = pd.read_csv('data/movie.csv')
 in[18]: # 选择director_name这列
         movie['director_name']
out[18]: 0           James Cameron
         1          Gore Verbinski
         2              Sam Mendes
         3       Christopher Nolan
         4             Doug Walker
                    ...        
         4911          Scott Smith
         4912                  NaN
         4913     Benjamin Roberds
         4914          Daniel Hsia
         4915             Jon Gunn
         Name: director_name, Length: 4916, dtype: object
 in[19]: # 也可以通过属性的方式选取
         movie.director_name
out[19]: 0           James Cameron
         1          Gore Verbinski
         2              Sam Mendes
         3       Christopher Nolan
         4             Doug Walker
                    ...        
         4911          Scott Smith
         4912                  NaN
         4913     Benjamin Roberds
         4914          Daniel Hsia
         4915             Jon Gunn
         Name: director_name, Length: 4916, dtype: object
# 查看类型
 in[20]: type(movie['director_name'])
out[20]: pandas.core.series.Series

更多

 in[21]: director = movie['director_name']
         # 查看选取的列的名字
         director.name
out[21]: 'director_name'
 in[22]: # 单列Series转换为DataFrame
         director.to_frame().head()
out[22]: 
    director_name
0   James Cameron
1   Gore Verbinski
2   Sam Mendes
3   Christopher Nolan
4   Doug Walker

5. 调用Series方法

准备

 in[23]: # 查看Series所有不重复的指令
         s_attr_methods = set(dir(pd.Series))
         # 该集合的大小
         len(s_attr_methods)
out[23]: 442
 in[24]: # 查看DataFrame所有不重复的指令
         df_attr_methods = set(dir(pd.DataFrame))
         len(df_attr_methods)
out[24]: 445
 in[25]: # 这两个集合中有多少共有的指令
         len(s_attr_methods & df_attr_methods)
out[25]: 376

原理

 in[26]: # 选取director和actor_1_fb_likes两列
         movie = pd.read_csv('data/movie.csv')
         director = movie['director_name']
         actor_1_fb_likes = movie['actor_1_facebook_likes']
# 查看头部
 in[27]: director.head()
out[27]: 0        James Cameron
         1       Gore Verbinski
         2           Sam Mendes
         3    Christopher Nolan
         4          Doug Walker
         Name: director_name, dtype: object
 in[28]: actor_1_fb_likes.head()
out[28]: 0     1000.0
         1    40000.0
         2    11000.0
         3    27000.0
         4      131.0
        Name: actor_1_facebook_likes, dtype: float64
 in[29]: # 分别计数
         pd.set_option('max_rows', 8)
         director.value_counts()
out[29]: Steven Spielberg    26
         Woody Allen         22
         Clint Eastwood      20
         Martin Scorsese     20
                              ..
         James Nunn           1
         Gerard Johnstone     1
         Ethan Maniquis       1
         Antony Hoffman       1
         Name: director_name, Length: 2397, dtype: int64
 in[30]: actor_1_fb_likes.value_counts()
out[30]: 1000.0     436
         11000.0    206
         2000.0     189
         3000.0     150
                  ... 
         216.0        1
         859.0        1
         225.0        1
         334.0        1
         Name: actor_1_facebook_likes, Length: 877, dtype: int64
 in[31]: director.size  
out[31]: 4916
 in[32]: director.shape 
out[33]: (4916,)
 in[33]: len(director)  
out[33]: 4916
 in[34]: # director有多少非空值
         director.count() 
out[34]: 4814 # 说明有102个缺失值
 in[35]: # actor_1_fb_likes有多少非空值
         actor_1_fb_likes.count()
out[35]: 4909
 in[36]: # actor_1_fb_likes的中位分位数
         actor_1_fb_likes.quantile()
out[36]: 982.0
 in[37]: # 求最小值、最大值、平均值、中位数、标准差、总和
         actor_1_fb_likes.min(), actor_1_fb_likes.max(), \
         actor_1_fb_likes.mean(), actor_1_fb_likes.median(), \
         actor_1_fb_likes.std(), actor_1_fb_likes.sum()
out[37]: (0.0, 640000.0, 6494.488490527602, 982.0, 15106.986883848309, 31881444.0)
 in[38]: # 打印描述信息
         actor_1_fb_likes.describe()
out[38]: count      4909.000000
         mean       6494.488491
         std       15106.986884
         min           0.000000
         25%         607.000000
         50%         982.000000
         75%       11000.000000
         max      640000.000000
         Name: actor_1_facebook_likes, dtype: float64
 in[39]: director.describe()
out[39]: count                 4814
         unique                2397
         top       Steven Spielberg
         freq                    26
         Name: director_name, dtype: object
 in[40]: actor_1_fb_likes.quantile(.2)
out[41]: 510.0
 in[41]: # 各个十分之一分位数
         actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
out[41]: 0.1      240.0
         0.2      510.0
         0.3      694.0
         0.4      854.0
               ...   
         0.6     1000.0
         0.7     8000.0
         0.8    13000.0
         0.9    18000.0
         Name: actor_1_facebook_likes, Length: 9, dtype: float64
# 非空值
 In[42]: director.isnull()
Out[42]: 0       False
         1       False
         2       False
         3       False
                 ...  
         4912     True
         4913    False
         4914    False
         4915    False
         Name: director_name, Length: 4916, dtype: bool
# 填充缺失值
 In[43]: actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
         actor_1_fb_likes_filled.count()
Out[43]: 4916
# 删除缺失值
 In[44]: actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
         actor_1_fb_likes_dropped.size
Out[44]: 4909

更多

# value_counts(normalize=True) 可以返回频率
 In[45]: director.value_counts(normalize=True)
Out[45]: Steven Spielberg    0.005401
         Woody Allen         0.004570
         Clint Eastwood      0.004155
         Martin Scorsese     0.004155
                         ...   
         James Nunn          0.000208
         Gerard Johnstone    0.000208
         Ethan Maniquis      0.000208
         Antony Hoffman      0.000208
         Name: director_name, Length: 2397, dtype: float64
# 判断是否有缺失值
 In[46]: director.hasnans
Out[46]: True
# 判断是否是非缺失值
 In[47]: director.notnull()
Out[47]: 0        True
         1        True
         2        True
         3        True
                 ...  
         4912    False
         4913     True
         4914     True
         4915     True
         Name: director_name, Length: 4916, dtype: bool

6. 在Series上使用运算符

 In[48]: pd.options.display.max_rows = 6
 In[49]: 5 + 9    # 加法
Out[49]: 14
 In[50]: 4 ** 2   # 幂运算
Out[50]: 16
 In[51]: a = 10   # 赋值
 In[52]: 5 <= 9 # 小于等于
Out[52]: True
 In[53]: 'abcde' + 'fg' # 字符串拼接
Out[53]: 'abcdefg'
 In[54]: not (5 <= 9)  # 非运算符
Out[54]: False
 In[55]: 7 in [1, 2, 6]  # in运算符
Out[55]: False
 In[56]: set([1,2,3]) & set([2,3,4])  # 求交集
Out[56]: {2, 3}
# 不支持列表和整数间的运算
 In[57]: [1, 2, 3] - 3 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-57-7ca967348b32> in <module>()
----> 1 [1, 2, 3] - 3

TypeError: unsupported operand type(s) for -: 'list' and 'int'
 In[58]: a = set([1,2,3])     
         a[0]                 # 集合不支持索引

准备

# 选取imdb_score这列
 In[59]: movie = pd.read_csv('data/movie.csv')
         imdb_score = movie['imdb_score']
         imdb_score
Out[59]: 0       7.9
         1       7.1
         2       6.8
             ... 
         4913    6.3
         4914    6.3
         4915    6.6
         Name: imdb_score, Length: 4916, dtype: float64
# 每列值加1
 In[60]: imdb_score + 1
Out[60]: 0       8.9
         1       8.1
         2       7.8
             ... 
         4913    7.3
         4914    7.3
         4915    7.6
         Name: imdb_score, Length: 4916, dtype: float64
# 每列值乘以2.5
 In[61]: imdb_score * 2.5
Out[61]: 0       19.75
         1       17.75
         2       17.00
             ...  
         4913    15.75
         4914    15.75
         4915    16.50
         Name: imdb_score, Length: 4916, dtype: float64
# 每列值除以7的余数
 In[62]: imdb_score // 7
Out[62]: 0       1.0
         1       1.0
         2       0.0
              ... 
         4913    0.0
         4914    0.0
         4915    0.0
         Name: imdb_score, Length: 4916, dtype: float64
# 判断是否大于7
 In[63]: imdb_score > 7
Out[63]: 0        True
         1        True
         2       False
               ...  
         4913    False
         4914    False
         4915    False
         Name: imdb_score, Length: 4916, dtype: bool
# 判断是否等于字符串
 In[64]: director = movie['director_name']
 In[65]: director == 'James Cameron'
Out[65]: 0        True
         1       False
         2       False
               ...  
         4913    False
         4914    False
         4915    False
         Name: director_name, Length: 4916, dtype: bool

更多

# 利用通用函数实现加法
 In[66]: imdb_score.add(1)              # imdb_score + 1
Out[66]: 0       8.9
         1       8.1
         2       7.8
              ... 
         4913    7.3
         4914    7.3
         4915    7.6
Name: imdb_score, Length: 4916, dtype: float64
# 利用通用函数实现乘法
 In[67]: imdb_score.mul(2.5)            # imdb_score * 2.5
Out[67]: 0       19.75
         1       17.75
         2       17.00
               ...  
         4913    15.75
         4914    15.75
         4915    16.50
         Name: imdb_score, Length: 4916, dtype: float64
# 利用通用函数实现底除
 In[68]: imdb_score.floordiv(7)         # imdb_score // 7
Out[68]: 0       1.0
         1       1.0
         2       0.0
              ... 
         4913    0.0
         4914    0.0
         4915    0.0
         Name: imdb_score, Length: 4916, dtype: float64
# 利用通用函数实现大于
 In[69]: imdb_score.gt(7)               # imdb_score > 7
Out[69]: 0        True
         1        True
         2       False
               ...  
        4913    False
        4914    False
        4915    False
        Name: imdb_score, Length: 4916, dtype: bool
# 利用通用函数实现等于
 In[70]: director.eq('James Cameron')   # director == 'James Cameron'
Out[70]: 0        True
         1       False
         2       False
               ...  
         4913    False
         4914    False
         4915    False
         Name: director_name, Length: 4916, dtype: bool
# 利用通用函数实现取模
 In[71]: imdb_score.astype(int).mod(5)
Out[71]: 0       2
         1       2
         2       1
              ..
         4913    1
         4914    1
         4915    1
         Name: imdb_score, Length: 4916, dtype: int64
# a是int对象
 In[72]: a = type(1)
 In[73]: type(a)
Out[73]: type
# a是pandas.core.series.Series对象
 In[74]: a = type(imdb_score)
 In[75]: a([1,2,3])
Out[75]: 0    1
         1    2
         2    3
         dtype: int64

7. 串联Series方法

# value_counts().head(3),计数,查看前三
 In[76]: movie = pd.read_csv('data/movie.csv')
         actor_1_fb_likes = movie['actor_1_facebook_likes']
         director = movie['director_name']
 In[77]: director.value_counts().head(3)
Out[77]: Steven Spielberg    26
         Woody Allen         22
         Clint Eastwood      20
         Name: director_name, dtype: int64
# 统计缺失值的数量
 In[78]: actor_1_fb_likes.isnull().sum()
Out[78]: 7
# actor_1_fb_likes的数据类型
 In[79]: actor_1_fb_likes.dtype
Out[79]: dtype('float64')
# 缺失值填充为0、转换为整型、查看前五
 In[80]: actor_1_fb_likes.fillna(0)\
                .astype(int)\
                .head()
Out[80]: 0     1000
         1    40000
         2    11000
         3    27000
         4      131
         Name: actor_1_facebook_likes, dtype: int64

更多

# 缺失值的比例
 In[81]: actor_1_fb_likes.isnull().mean()
Out[81]: 0.0014239218877135883
# 使用括号串联
 In[82]: (actor_1_fb_likes.fillna(0)
                 .astype(int)
                 .head())
Out[82]: 0     1000
         1    40000
         2    11000
         3    27000
         4      131
         Name: actor_1_facebook_likes, dtype: int64

8. 使索引有意义

# set_index()给行索引命名
 In[83]: movie = pd.read_csv('data/movie.csv')
 In[84]: movie.shape
Out[84]: (4916, 28)
 In[85]: movie2 = movie.set_index('movie_title')
         movie2
Out[85]:

# 通过index_col参数命名
 In[86]: pd.read_csv('data/movie.csv', index_col='movie_title')
Out[86]:

更多

# 复原索引
 In[87]: movie2.reset_index()

9. 重命名行名和列名

# 通过rename()重命名
 In[88]: movie = pd.read_csv('data/movie.csv', index_col='movie_title')
 In[89]: idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'} 
         col_rename = {'director_name':'Director Name', 
                       'num_critic_for_reviews': 'Critical Reviews'} 
 In[90]: movie.rename(index=idx_rename, 
                      columns=col_rename).head()
Out[90]:

更多

 In[91]: movie = pd.read_csv('data/movie.csv', index_col='movie_title')
         index = movie.index
         columns = movie.columns

         index_list = index.tolist()
         column_list = columns.tolist()

         index_list[0] = 'Ratava'
         index_list[2] = 'Ertceps'
         column_list[1] = 'Director Name'
         column_list[2] = 'Critical Reviews'

 In[92]: print(index_list[:5])
         ['Ratava', "Pirates of the Caribbean: At World's End", 'Ertceps', 'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens']

 In[93]: print(column_list)
         ['color', 'Director Name', 'Critical Reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']

 In[94]: movie.index = index_list
         movie.columns = column_list

 In[95]: movie.head()
Out[95]:

10. 创建、删除列

# 通过[列名]添加新列
 In[96]: movie = pd.read_csv('data/movie.csv')
 In[97]: movie['has_seen'] = 0
 In[98]: movie.columns
Out[98]: Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
                'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
                'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
                'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
                'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
                'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
                'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
                'imdb_score', 'aspect_ratio', 'movie_facebook_likes', 'has_seen'],
               dtype='object')
# 给新列赋值
 In[99]: movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + 
                                              movie['actor_2_facebook_likes'] + 
                                              movie['actor_3_facebook_likes'] + 
                                              movie['director_facebook_likes'])
 In[100]: movie['actor_director_facebook_likes'].isnull().sum()
Out[100]: 122
# 用all()检查是否所有的布尔值都为True
 In[101]: movie['actor_director_facebook_likes'] = movie['actor_director_facebook_likes'].fillna(0)
 In[102]: movie['is_cast_likes_more'] = (movie['cast_total_facebook_likes'] >= 
                                  movie['actor_director_facebook_likes'])
 In[103]: movie['is_cast_likes_more'].all()
Out[103]: False

 In[104]: movie = movie.drop('actor_director_facebook_likes', axis='columns')
 In[105]: movie['actor_total_facebook_likes'] = (movie['actor_1_facebook_likes'] + 
                                       movie['actor_2_facebook_likes'] + 
                                       movie['actor_3_facebook_likes'])

          movie['actor_total_facebook_likes'] = movie['actor_total_facebook_likes'].fillna(0)
 In[106]: movie['is_cast_likes_more'] = movie['cast_total_facebook_likes'] >= \
                                  movie['actor_total_facebook_likes']

          movie['is_cast_likes_more'].all()
Out[106]: True
 In[107]: movie['pct_actor_cast_like'] = (movie['actor_total_facebook_likes'] / 
                                movie['cast_total_facebook_likes'])
 In[108]: movie['pct_actor_cast_like'].min(), movie['pct_actor_cast_like'].max() 
Out[108]: (0.0, 1.0)

 In[109]: movie.set_index('movie_title')['pct_actor_cast_like'].head()
Out[109]: movie_title
          Avatar                                        0.577369
          Pirates of the Caribbean: At World's End      0.951396
          Spectre                                       0.987521
          The Dark Knight Rises                         0.683783
          Star Wars: Episode VII - The Force Awakens    0.000000
          Name: pct_actor_cast_like, dtype: float64

更多

# 用insert()方法原地插入列
 In[110]: profit_index = movie.columns.get_loc('gross') + 1
          profit_index
 In[111]: movie.insert(loc=profit_index,
                 column='profit',
                 value=movie['gross'] - movie['budget'])
 In[112]: movie.head()
Out[112]: