Data Science Intro by Michigan University - Week2 Series and Dataframes

袁恩

2023-12-01

1. series

OS command line in jupyter notebook
!cat
series

import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

attributes: iloc, loc

s.iloc[3] #'History', begin from 0
s.loc['Molly'] #'English'

celluar magic function: timeit

numbers = pd.Series(np.random.randint(0,1000,10000))
%%timeit -n 100
total = 0
total += number for number in numbers
total / len(numbers)

4.12 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

214 µs ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2. data frames

2.1 read_csv

import pandas as pd
df = pd.read_csv('datasets/Admission_Predict.csv',index_col=0) 
##impo note: use column 0 as index, no need of Pandas index added.

2.2 make sure the column names
First make sure the column names are correct.

new_df.columns

2.3 strip and rename

new_df = df.rename(mapper=str.strip, axis='columns').rename(columns = {'SOP':'Statement of Purpose', 'LOR': 'Letter of Recommendation'})

the first rename is to clean the format, del whitespace in column names; the second is to change names
this doesn’t change the old df column name in fact, unless df = df.rename(…)

3. indexing data frame

df = pd.read_csv('Admission_Predict.csv', index_col=0)
df = df.set_index('Chance of Admit ')
df = df.reset_index()

multi-level index, often found when dealing with geographical data which is sorted by regions

df = pd.read_csv('census.csv')
df['SUMLEV'].uniquie()
df = df[ df['SUMLEV'] == 50 ]
columns_to_keep = ['STNAME', 'CTYNAME', ...]
df = df[columns_to_keep]
df = df.set_index(['STNAME', 'CTYNAME']) #multi-level index

loc to query the dataframe

	df.loc[ 'Michegan', 'Washtenaw County'] 
	df.loc[ [('Michigan', 'Washtenaw County'), 
			('Michigan', 'Wayne County')] ]

4. Missing Values

eg: None type or Numpy NaN values
a few reasons: Missing at Random, Missing Completely at Random(MCAR)

df =  pd.read_csv('class_grades.csv')
mask = df.isnull() #create a boolean mask, broadcast isnull() function to every cell of df.

df = df.dropana() #another operation to drop missing value
or fill missing values with 0, use fillna()
df = df.fillna(0, inplace=True) #modify the dataframe instead

sometimes it is more meaningful to fill the NA not with 0, but with foward or backward value: method = ffill, or bfill

	df = df.set_index(['time', 'user'])
	df = df.sort_index()
	df = df.fillna(method='ffilll')

customize fill-in to replace values with replace() function, which allows several approaches: value-to value, list, dictinary, regex

	df.replace([1, 3], [100, 300])

df = pd.read_csv('log.csv')
df.replace(to_replace=‘.*.html$’, value='webpage', regex=True)

5. example of manipulationg dataframes

replace()

df = pd.read_csv('presidents.csv')
#to clean up 'President' into firstname and lastname. create two new columns and apply regex
df['First'] = df['President'] #add a new column. use del(df['First']) to delete
df['Fisrst'] = df['First'].replace('[ ].*', ' ', regex=True)
#it worked, but is gross and slow. there are a few other ways, such as apply()

apply()

#apply take any function you have written, so we define one:
def splitname(row):
    #row is a Series object which is a single row indexed by columns
    row['First'] = row['President'].split(" ")[0]
    row['Last'] = row['President'].split(" ")[-1]
    return row
df = df.apply(splitname, axis='columns')

extract()

pattern="(^[\w]*)(?:.* )([\w]*$)"
df['President'].str.extract(pattern)

but the coloumn names is 0, 1, so name the groups in pattern

pattern = "(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)

now to clean up ‘Born’ column

df['Born'] = df['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4}])')

0    Feb 22, 1732
1    Oct 30, 1735

use pandas date/time feature to deal with financial transcations, make subsequent processing much easier

pd.to_datetime(df['Born'])

0   1732-02-22
1   1735-10-30

The str module has a number of important functions for cleaning pandas dataframes. but you don’t have to use these . For now, the str functions are incredibly useful and build on your existing knowledge of regular expressions, and because they are vectorized they are efficient to use as well.

6. assignment 2

6.1 compute proportion of mother’s education

compute the proportion of children who had a mother with the education levels(‘EDUC1’) equal to 1,2, 3, 4

educ1_list = df.groupby("EDUC1")['SEQNUMC'].count()
educ1_list = educ1_list.rename({1:'less than high school',2:'high school',3:'more than high school but not college',4:'college'})
educ1_list = educ1_list / educ1_list.sum()
educ1_dict = educ1_list.to_dict()

6.2 explore relationship between being fed breastmilk and getting influenza vaccine

vclist = df.groupby(['CBF_01'])['P_NUMFLU'].agg(np.mean)
vctup = (vclist[1],vclist[2])

6.3 effective of vaccine

Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it versus those who were vaccinated but did not contract chicken pox

vac_df = df[df['P_NUMVRC'] >=1].groupby(['SEX','HAD_CPOX'])['SEQNUMC'].count()
vac_dic = {'male':vac_df[1][2]/(vac_df[1][1]+vac_df[1][2]),'female':vac_df[2][2]/(vac_df[2][1]+vac_df[2][2])}
#not elegant