Exploratory-Data-Analysis-in-Python

诸超
2023-12-01

1. Read, clean, and validate

1.1 DataFrames and Series

1.2 Read the codebook

1.3 Exploring the NSFG data

To get the number of rows and columns in a DataFrame, you can read its shape attribute.

To get the column names, you can read the columns attribute. The result is an Index, which is a Pandas data structure that is similar to a list. Let’s begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called nsfg.

Introduction

  • Calculate the number of rows and columns in the DataFrame nsfg.
  • Display the names of the columns in nsfg.
  • Select the column 'birthwgt_oz1' and assign it to a new variable called ounces
  • Display the first 5 elements of ounces
在这里插入代码片

1.4 Clean and Validate

1.5 Validate a variable

In the NSFG dataset, the variable 'outcome' encodes the outcome of each pregnancy as shown below:

valuelabel
1Live birth
2Induced abortion
3Stillbirth
4Miscarriage
5Ectopic pregnancy
6Current pregnancy

How many pregnancies in this dataset ended with a live birth?

■ \blacksquare 6489

□ \square 9538

□ \square 1469

□ \square 6

1.6 Clean a variable

In the NSFG dataset, the variable 'nbrnaliv' records the number of babies born alive at the end of a pregnancy.

If you use .value_counts() to view the responses, you’ll see that the value 8 appears once, and if you consult the codebook, you’ll see that this value indicates that the respondent refused to answer the question.

Your job in this exercise is to replace this value with np.nan. Recall from the video how Allen replaced the values 98 and 99 in the ounces column using the .replace() method:

ounces.replace([98, 99], np.nan, inplace=True)

Instruction

  • In the 'nbrnaliv' column, replace the value 8, in place, with the special value NaN.
  • Confirm that the value 8 no longer appears in this column by printing the values and their frequencies.
在这里插入代码片

1.7 Compute a variable

For each pregnancy in the NSFG dataset, the variable 'agecon' encodes the respondent’s age at conception, and 'agepreg'the respondent’s age at the end of the pregnancy.

Both variables are recorded as integers with two implicit decimal places, so the value 2575 means that the respondent’s age was 25.75.

Instruction 1
Select 'agecon' and 'agepreg', divide them by 100, and assign them to the local variables agecon and agepreg.

在这里插入代码片

Instruction 2
Compute the difference, which is an estimate of the duration of the pregnancy. Keep in mind that for each pregnancy, agepreg will be larger than agecon.

在这里插入代码片

Instruction 3
Use .describe() to compute the mean duration and other summary statistics.

在这里插入代码片

1.8 Filter and visualize

1.9 Make a histogram

Histograms are one of the most useful tools in exploratory data analysis. They quickly give you an overview of the distribution of a variable, that is, what values the variable can have, and how many times each value appears.

As we saw in a previous exercise, the NSFG dataset includes a variable 'agecon' that records age at conception for each pregnancy. Here, you’re going to plot a histogram of this variable. You’ll use the bins parameter that you saw in the video, and also a new parameter - histtype - which you can read more about here in the matplotlib documentation. Learning how to read documentation is an essential skill. If you want to learn more about matplotlib, you can check out DataCamp’s Introduction to Matplotlib course.

Instruction 1
Plot a histogram of agecon with 20 bins.

在这里插入代码片

Instruction 2
Adapt your code to make an unfilled histogram by setting the parameter histtype to be 'step'.

在这里插入代码片

1.10 Compute birth weight

Now let’s pull together the steps in this chapter to compute the average birth weight for full-term babies.

I’ve provided a function, resample_rows_weighted, that takes the NSFG data and resamples it using the sampling weights in wgt2013_2015. The result is a sample that is representative of the U.S. population.

Then I extract birthwgt_lb1 and birthwgt_oz1, replace special codes with NaN, and compute total birth weight in pounds, birth_weight.

# Resample the data
nsfg = resample_rows_weighted(nsfg, 'wgt2013_2015')

# Clean the weight variables
pounds = nsfg['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg['birthwgt_oz1'].replace([98, 99], np.nan)

# Compute total birth weight
birth_weight = pounds + ounces/16

Instruction

  • Make a Boolean Series called full_term that is true for babies with 'prglngth' greater than or equal to 37 weeks.
  • Use full_term and birth_weight to select birth weight in pounds for full-term babies. Store the result in full_term_weight.
  • Compute the mean weight of full-term babies.
在这里插入代码片

1.11 Filter

In the previous exercise, you computed the mean birth weight for full-term babies; you filtered out preterm babies because their distribution of weight is different.

The distribution of weight is also different for multiple births, like twins and triplets. In this exercise, you’ll filter them out, too, and see what effect it has on the mean.

Instruction

  • Use the variable 'nbrnaliv' to make a Boolean Series that is True for single births (where 'nbrnaliv' equals 1) and False otherwise.
  • Use Boolean Series and logical operators to select single, full-term babies and compute their mean birth weight.
  • For comparison, select multiple, full-term babies and compute their mean birth weight.
在这里插入代码片

2. Distributions

2.1 Probaility mass functions

2.2 Make a PMF

The GSS dataset has been pre-loaded for you into a DataFrame called gss. You can explore it in the IPython Shell to get familiar with it.

In this exercise, you’ll focus on one variable in this dataset, 'year', which represents the year each respondent was interviewed.

The Pmf class you saw in the video has already been created for you. You can access it outside of DataCamp via the empiricaldist library.

Instruction 1
Make a PMF for year with normalize=False and display the result.

在这里插入代码片

Instruction 2
How many respondents were interviewed in 2016?

■ \blacksquare 2867

□ \square 1613

□ \square 2538

□ \square 0.045897

2.3 Plot a PMF

Now let’s plot a PMF for the age of the respondents in the GSS dataset. The variable 'age' contains respondents’ age in years.

Instruction 1
Select the 'age' column from the gss DataFrame and store the result in age

在这里插入代码片

Instruction 2
Make a normalized PMF of age. Store the result in pmf_age

在这里插入代码片

Instruction 3
Plot pmf_age as a bar chart

在这里插入代码片

2.4 Cumlative distribution functions

2.5 Make a CDF

In this exercise, you’ll make a CDF and use it to determine the fraction of respondents in the GSS dataset who are OLDER than 30.

The GSS dataset has been preloaded for you into a DataFrame called gss.

As with the Pmf class from the previous lesson, the Cdf class you just saw in the video has been created for you, and you can access it outside of DataCamp via the empiricaldist library.

Instruction 1
Select the 'age' column. Store the result in age.

Instruction 2
Compute the CDF of age. Store the result in cdf_age.

Instruction 3
Calculate the CDF of 30.

Instruction 4
What fraction of the respondents in the GSS dataset are OLDER than 30?

■ \blacksquare Approximately 75%

□ \square Approximately 65%

□ \square Approximately 45%

□ \square Approximately 25%

2.6 Compute IQR

Recall from the video that the interquartile range (IQR) is the difference between the 75th and 25th percentiles. It is a measure of variability that is robust in the presence of errors or extreme values.

In this exercise, you’ll compute the interquartile range of income in the GSS dataset. Income is stored in the 'realinc' column, and the CDF of income has already been computed and stored in cdf_income.

Instruction 1
Calculate the 75th percentile of income and store it in percentile_75th.

Instruction 2
Calculate the 25th percentile of income and store it in percentile_25th.

Instruction 3
Calculate the interquartile range of income. Store the result in iqr.

Instruction 4
What is the interquartile range (IQR) of income in the GSS datset?

■ \blacksquare Approximately 29676

□ \square Approximately 26015

□ \square Approximately 34702

□ \square Approximately 30655

2.7 Plot a CDF

The distribution of income in almost every country is long-tailed; that is, there are a small number of people with very high incomes.

In the GSS dataset, the variable 'realinc' represents total household income, converted to 1986 dollars. We can get a sense of the shape of this distribution by plotting the CDF.

Instruction

  • Select 'realinc' from the gss dataset.
  • Make a Cdf object called cdf_income.
  • Create a plot of cdf_income using .plot().
在这里插入代码片

2.8 Comparing distributions

2.9 Distribution of education

Let’s begin comparing incomes for different levels of education in the GSS dataset, which has been pre-loaded for you into a DataFrame called gss. The variable educ represents the respondent’s years of education.

What fraction of respondents report that they have 12 years of education or fewer?

□ \square Approximately 22%

□ \square Approximately 31%

□ \square Approximately 47%

■ \blacksquare Approximately 53%

2.10 Extract eduction levels

Let’s create Boolean Series to identify respondents with different levels of education.

In the U.S, 12 years of education usually means the respondent has completed high school (secondary education). A respondent with 14 years of education has probably completed an associate degree (two years of college); someone with 16 years has probably completed a bachelor’s degree (four years of college).

Instruction

  • Complete the line that identifies respondents with associate degrees, that is, people with 14 or more years of education but less than 16.
  • Complete the line that identifies respondents with 12 or fewer years of education.
  • Confirm that the mean of high is the fraction we computed in the previous exercise, about 53%.
在这里插入代码片

2.11 Plot income CDFs

Let’s now see what the distribution of income looks like for people with different education levels. You can do this by plotting the CDFs. Recall how Allen plotted the income CDFs of respondents interviewed before and after 1995:

Cdf(income[pre95]).plot(label='Before 1995')
Cdf(income[~pre95]).plot(label='After 1995')

You can assume that Boolean Series have been defined, as in the previous exercise, to identify respondents with different education levels: high, assc, and bach.

Instruction
Fill in the missing lines of code to plot the CDFs.

在这里插入代码片

2.12 Modeling distributions

2.13 Distribution of income

In many datasets, the distribution of income is approximately lognormal, which means that the logarithms of the incomes fit a normal distribution. We’ll see whether that’s true for the GSS data. As a first step, you’ll compute the mean and standard deviation of the log of incomes using NumPy’s np.log10() function.

Then, you’ll use the computed mean and standard deviation to make a norm object using the scipy.stats.norm() function.

Instruction

  • Extract 'realinc' from gss and compute its logarithm using np.log10().
  • Compute the mean and standard deviation of the result.
  • Make a norm object by passing the computed mean and standard deviation to norm().
在这里插入代码片

2.14 Comparing CDFs

To see whether the distribution of income is well modeled by a lognormal distribution, we’ll compare the CDF of the logarithm of the data to a normal distribution with the same mean and standard deviation. These variables from the previous exercise are available for use:

# Extract realinc and compute its log
log_income = np.log10(gss['realinc'])

# Compute mean and standard deviation
mean, std = log_income.mean(), log_income.std()

# Make a norm object
from scipy.stats import norm
dist = norm(mean, std)

dist is a scipy.stats.norm object with the same mean and standard deviation as the data. It provides .cdf(), which evaluates the normal cumulative distribution function.
Be careful with capitalization: Cdf(), with an uppercase C, creates Cdf objects. dist.cdf(), with a lowercase c, evaluates the normal cumulative distribution function.

在这里插入代码片

2.15 Comparing PDFs

In the previous exercise, we used CDFs to see if the distribution of income is lognormal. We can make the same comparison using a PDF and KDE. That’s what you’ll do in this exercise!

As before, the norm object dist is available in your workspace:

from scipy.stats import norm
dist = norm(mean, std)

Just as all norm objects have a .cdf() method, they also have a .pdf() method.

To create a KDE plot, you can use Seaborn’s kdeplot() function.

Instruction

  • Evaluate the normal PDF using dist, which is a norm object with the same mean and standard deviation as the data.
  • Make a KDE plot of the logarithms of the incomes, using log_income, which is a Series object.
在这里插入代码片

3. Relationships

3.1 Exploring relationships

3.2 PMF of age

PMF of ageDo people tend to gain weight as they get older? We can answer this question by visualizing the relationship between weight and age. But before we make a scatter plot, it is a good idea to visualize distributions one variable at a time. Here, you’ll visualize age using a bar chart first. Recall that all PMF objects have a .bar() method to make a bar chart.

The BRFSS dataset includes a variable, 'AGE' (note the capitalization!), which represents each respondent’s age. To protect respondents’ privacy, ages are rounded off into 5-year bins. 'AGE' contains the midpoint of the bins.

Instruction

  • Extract the variable 'AGE' from the DataFrame brfss and assign it to age.
  • Plot the PMF of age as a bar chart.
在这里插入代码片

3.3 Scatter plot

Now let’s make a scatterplot of weight versus age. To make the code run faster, I’ve selected only the first 1000 rows from the brfss DataFrame.

weight and age have already been extracted for you. Your job is to use plt.plot() to make a scatter plot.

Instruction
Make a scatter plot of weight and age with format string 'o' and alpha=0.1.

在这里插入代码片

3.4 Jittering

In the previous exercise, the ages fall in columns because they’ve been rounded into 5-year bins. If we jitter them, the scatter plot will show the relationship more clearly. Recall how Allen jittered height and weight in the video:

height_jitter = height + np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + np.random.normal(0, 2, size=len(brfss))

Instruction

  • Add random noise to age with mean 0 and standard deviation 2.5.
  • Make a scatter plot between weight and age with marker size 5 and alpha=0.2. Be sure to also specify 'o'.
在这里插入代码片

3.5 Visualizing relationships

3.6 Height and weight

Previously we looked at a scatter plot of height and weight, and saw that taller people tend to be heavier. Now let’s take a closer look using a box plot. The brfss DataFrame contains a variable '_HTMG10' that represents height in centimeters, binned into 10 cm groups.

Recall how Allen created the box plot of 'AGE' and 'WTKG3' in the video, with the y-axis on a logarithmic scale:

sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
plt.yscale('log')

3.7 Distribution of income

In the next two exercises we’ll look at relationships between income and other variables. In the BRFSS, income is represented as a categorical variable; that is, respondents are assigned to one of 8 income categories. The variable name is 'INCOME2'. Before we connect income with anything else, let’s look at the distribution by computing the PMF. Recall that all Pmf objects have a .bar() method.

Instruction

  • Extract 'INCOME2' from the brfss DataFrame and assign it to income.
  • Plot the PMF of income as a bar chart.
# Extract income
income = brfss['INCOME2']
# Plot the PMF
plt.pmf(income).bar()

# Label the axes
plt.xlabel('Income level')
plt.ylabel('PMF')plt.show()

3.8 Income and height

Let’s now use a violin plot to visualize the relationship between income and height.

Instruction

  • Create a violin plot to plot the distribution of height ('HTM4') in each income ('INCOME2') group. Specify inner=None to simplify the plot.
在这里插入代码片

3.9 Correlation

3.10 Computing correlations

The purpose of the BRFSS is to explore health risk factors, so it includes questions about diet. The variable '_VEGESU1' represents the number of servings of vegetables respondents reported eating per day.

Let’s see how this variable relates to age and income.

Instruction

  • From the brfss DataFrame, select the columns 'AGE', 'INCOME2', and '_VEGESU1'.
  • Compute the correlation matrix for these variables.
在这里插入代码片

3.11 Interpreting correlations

In the previous exercise, the correlation between income and vegetable consumption is about 0.12. The correlation between age and vegetable consumption is about -0.01.

Which of the following are correct interpretations of these results:

  • A: People with higher incomes eat more vegetables.
  • B: The relationship between income and vegetable consumption is linear.
  • C: Older people eat more vegetables.
  • D: There could be a strong nonlinear relationship between age and vegetable consumption.

■ \blacksquare A and C only.

□ \square B and D only.

□ \square B and C only.

□ \square A and D only.

3.12 Simple regression

3.13 Income and vegetables

As we saw in a previous exercise, the variable '_VEGESU1' represents the number of vegetable servings respondents reported eating per day.

Let’s estimate the slope of the relationship between vegetable consumption and income.

Instruction

  • Extract the columns 'INCOME2' and '_VEGESU1' from subset into xs and ys respectively.
  • Compute the simple linear regression of these variables.
在这里插入代码片

3.14 Fit a line

Continuing from the previous exercise:

  • Assume that xs and ys contain income codes and daily vegetable consumption, respectively, and
  • res contains the results of a simple linear regression of ys onto xs.

Instruction

  • Set fx to the minimum and maximum of xs, stored in a NumPy array.
  • Set fy to the points on the fitted line that correspond to the fx.
在这里插入代码片

4. Multivariate Thinking

4.1 Limits of simple regression

4.2 Regression and causation

In the BRFSS dataset, there is a strong relationship between vegetable consumption and income. The income of people who eat 8 servings of vegetables per day is double the income of people who eat none, on average.

Which of the following conclusions can we draw from this data?
A. Eating a good diet leads to better health and higher income.
B. People with higher income can afford a better diet.
C. People with high income are more likely to be vegetarians.

□ \square A only.

□ \square B only.

□ \square B and C.

■ \blacksquare None of them.

4.3 Using StatsModels

Let’s run the same regression using SciPy and StatsModels, and confirm we get the same results.

Instruction

  • Compute the regression of '_VEGESU1' as a function of 'INCOME2' using SciPy’s linregress().
  • Compute the regression of '_VEGESU1' as a function of 'INCOME2' using StatsModels’ smf.ols().
在这里插入代码片

4.4 Multiple regression

4.5 Plot income and education

To get a closer look at the relationship between income and education, let’s use the variable 'educ' to group the data, then plot mean income in each group.

Instruction

  • Group gss by 'educ'. Store the result in grouped.
  • From grouped, extract 'realinc' and compute the mean.
  • Plot mean_income_by_educ as a scatter plot. Specify 'o' and alpha=0.5.
在这里插入代码片

4.6 Non-linear model of eaduction

The graph in the previous exercise suggests that the relationship between income and education is non-linear. So let’s try fitting a non-linear model.

Instruction

  • Add a column named 'educ2' to the gss DataFrame; it should contain the values from 'educ' squared.
  • Run a regression model that uses 'educ', 'educ2', 'age', and 'age2' to predict 'realinc'.
在这里插入代码片

4.7 Visualizing regression results

4.8 Making predictions

At this point, we have a model that predicts income using age, education, and sex.

Let’s see what it predicts for different levels of education, holding age constant.

Instruction

  • Using np.linspace(), add a variable named 'educ' to df with a range of values from 0 to 20.
  • Add a variable named 'age' with the constant value 30.
  • Use df to generate predicted income as a function of education.
在这里插入代码片

4.9 Visualizing predictions

Now let’s visualize the results from the previous exercise!

Instruction

  • Plot mean_income_by_educ using circles ('o'). Specify an alpha of 0.5.
  • Plot the prediction results with a line, with df['educ'] on the x-axis and pred on the y-axis.
在这里插入代码片

4.10 Logistic regression

4.11 Predicting a binary variable

Let’s use logistic regression to predict a binary variable. Specifically, we’ll use age, sex, and education level to predict support for legalizing cannabis (marijuana) in the U.S.

In the GSS dataset, the variable grass records the answer to the question “Do you think the use of marijuana should be made legal or not?”

Instruction 1
Fill in the parameters of smf.logit() to predict grass using the variables age, age2, educ, and educ2, along with sex as a categorical variable.

在这里插入代码片

Instruction 2
Add a column called educ and set it to 12 years; then compute a second column, educ2, which is the square of educ.

在这里插入代码片

Instruction 3
Generate separate predictions for men and women.

在这里插入代码片

Instruction 4
Fill in the missing code to compute the mean of 'grass' for each age group, and then the arguments of plt.plot() to plot pred2 versus df['age'] with the label 'Female'.

在这里插入代码片

4.12 Next steps

 类似资料: