import pandas as pd
students_classes = {'Alice': 'Physics',
'Jack': 'Chemistry',
'Molly': 'English',
'Sam': 'History'}
s = pd.Series(students_classes)
s
Alice Physics
Jack Chemistry
Molly English
Sam History
dtype: object
s.iloc[3] #'History', begin from 0
s.loc['Molly'] #'English'
numbers = pd.Series(np.random.randint(0,1000,10000))
%%timeit -n 100
total = 0
total += number for number in numbers
total / len(numbers)
4.12 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)
214 µs ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 read_csv
import pandas as pd
df = pd.read_csv('datasets/Admission_Predict.csv',index_col=0)
##impo note: use column 0 as index, no need of Pandas index added.
2.2 make sure the column names
First make sure the column names are correct.
new_df.columns
2.3 strip and rename
new_df = df.rename(mapper=str.strip, axis='columns').rename(columns = {'SOP':'Statement of Purpose', 'LOR': 'Letter of Recommendation'})
the first rename is to clean the format, del whitespace in column names; the second is to change names
this doesn’t change the old df column name in fact, unless df = df.rename(…)
df = pd.read_csv('Admission_Predict.csv', index_col=0)
df = df.set_index('Chance of Admit ')
df = df.reset_index()
multi-level index, often found when dealing with geographical data which is sorted by regions
df = pd.read_csv('census.csv')
df['SUMLEV'].uniquie()
df = df[ df['SUMLEV'] == 50 ]
columns_to_keep = ['STNAME', 'CTYNAME', ...]
df = df[columns_to_keep]
df = df.set_index(['STNAME', 'CTYNAME']) #multi-level index
loc to query the dataframe
df.loc[ 'Michegan', 'Washtenaw County']
df.loc[ [('Michigan', 'Washtenaw County'),
('Michigan', 'Wayne County')] ]
eg: None type or Numpy NaN values
a few reasons: Missing at Random, Missing Completely at Random(MCAR)
df = pd.read_csv('class_grades.csv')
mask = df.isnull() #create a boolean mask, broadcast isnull() function to every cell of df.
df = df.dropana() #another operation to drop missing value
or fill missing values with 0, use fillna()
df = df.fillna(0, inplace=True) #modify the dataframe instead
sometimes it is more meaningful to fill the NA not with 0, but with foward or backward value: method = ffill, or bfill
df = df.set_index(['time', 'user'])
df = df.sort_index()
df = df.fillna(method='ffilll')
customize fill-in to replace values with replace() function, which allows several approaches: value-to value, list, dictinary, regex
df.replace([1, 3], [100, 300])
df = pd.read_csv('log.csv')
df.replace(to_replace=‘.*.html$’, value='webpage', regex=True)
replace()
df = pd.read_csv('presidents.csv')
#to clean up 'President' into firstname and lastname. create two new columns and apply regex
df['First'] = df['President'] #add a new column. use del(df['First']) to delete
df['Fisrst'] = df['First'].replace('[ ].*', ' ', regex=True)
#it worked, but is gross and slow. there are a few other ways, such as apply()
apply()
#apply take any function you have written, so we define one:
def splitname(row):
#row is a Series object which is a single row indexed by columns
row['First'] = row['President'].split(" ")[0]
row['Last'] = row['President'].split(" ")[-1]
return row
df = df.apply(splitname, axis='columns')
extract()
pattern="(^[\w]*)(?:.* )([\w]*$)"
df['President'].str.extract(pattern)
but the coloumn names is 0, 1, so name the groups in pattern
pattern = "(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)
now to clean up ‘Born’ column
df['Born'] = df['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4}])')
0 Feb 22, 1732
1 Oct 30, 1735
use pandas date/time feature to deal with financial transcations, make subsequent processing much easier
pd.to_datetime(df['Born'])
0 1732-02-22
1 1735-10-30
The str module has a number of important functions for cleaning pandas dataframes. but you don’t have to use these . For now, the str functions are incredibly useful and build on your existing knowledge of regular expressions, and because they are vectorized they are efficient to use as well.
compute the proportion of children who had a mother with the education levels(‘EDUC1’) equal to 1,2, 3, 4
educ1_list = df.groupby("EDUC1")['SEQNUMC'].count()
educ1_list = educ1_list.rename({1:'less than high school',2:'high school',3:'more than high school but not college',4:'college'})
educ1_list = educ1_list / educ1_list.sum()
educ1_dict = educ1_list.to_dict()
vclist = df.groupby(['CBF_01'])['P_NUMFLU'].agg(np.mean)
vctup = (vclist[1],vclist[2])
Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it versus those who were vaccinated but did not contract chicken pox
vac_df = df[df['P_NUMVRC'] >=1].groupby(['SEX','HAD_CPOX'])['SEQNUMC'].count()
vac_dic = {'male':vac_df[1][2]/(vac_df[1][1]+vac_df[1][2]),'female':vac_df[2][2]/(vac_df[2][1]+vac_df[2][2])}
#not elegant