To get the number of rows and columns in a DataFrame, you can read its shape
attribute.
To get the column names, you can read the columns
attribute. The result is an Index, which is a Pandas data structure that is similar to a list. Let’s begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called nsfg
.
Introduction
nsfg
.nsfg
.'birthwgt_oz1'
and assign it to a new variable called ounces
ounces
在这里插入代码片
In the NSFG dataset, the variable 'outcome'
encodes the outcome of each pregnancy as shown below:
value | label |
---|---|
1 | Live birth |
2 | Induced abortion |
3 | Stillbirth |
4 | Miscarriage |
5 | Ectopic pregnancy |
6 | Current pregnancy |
How many pregnancies in this dataset ended with a live birth?
■ \blacksquare ■ 6489
□ \square □ 9538
□ \square □ 1469
□ \square □ 6
In the NSFG dataset, the variable 'nbrnaliv'
records the number of babies born alive at the end of a pregnancy.
If you use .value_counts()
to view the responses, you’ll see that the value 8
appears once, and if you consult the codebook, you’ll see that this value indicates that the respondent refused to answer the question.
Your job in this exercise is to replace this value with np.nan
. Recall from the video how Allen replaced the values 98
and 99
in the ounces column using the .replace()
method:
ounces.replace([98, 99], np.nan, inplace=True)
Instruction
'nbrnaliv'
column, replace the value 8
, in place, with the special value NaN
.8
no longer appears in this column by printing the values and their frequencies.在这里插入代码片
For each pregnancy in the NSFG dataset, the variable 'agecon'
encodes the respondent’s age at conception, and 'agepreg'
the respondent’s age at the end of the pregnancy.
Both variables are recorded as integers with two implicit decimal places, so the value 2575
means that the respondent’s age was 25.75
.
Instruction 1
Select 'agecon'
and 'agepreg'
, divide them by 100
, and assign them to the local variables agecon
and agepreg
.
在这里插入代码片
Instruction 2
Compute the difference, which is an estimate of the duration of the pregnancy. Keep in mind that for each pregnancy, agepreg
will be larger than agecon
.
在这里插入代码片
Instruction 3
Use .describe()
to compute the mean duration and other summary statistics.
在这里插入代码片
Histograms are one of the most useful tools in exploratory data analysis. They quickly give you an overview of the distribution of a variable, that is, what values the variable can have, and how many times each value appears.
As we saw in a previous exercise, the NSFG dataset includes a variable 'agecon'
that records age at conception for each pregnancy. Here, you’re going to plot a histogram of this variable. You’ll use the bins
parameter that you saw in the video, and also a new parameter - histtype
- which you can read more about here in the matplotlib
documentation. Learning how to read documentation is an essential skill. If you want to learn more about matplotlib
, you can check out DataCamp’s Introduction to Matplotlib course.
Instruction 1
Plot a histogram of agecon
with 20
bins.
在这里插入代码片
Instruction 2
Adapt your code to make an unfilled histogram by setting the parameter histtype
to be 'step'
.
在这里插入代码片
Now let’s pull together the steps in this chapter to compute the average birth weight for full-term babies.
I’ve provided a function, resample_rows_weighted
, that takes the NSFG data and resamples it using the sampling weights in wgt2013_2015
. The result is a sample that is representative of the U.S. population.
Then I extract birthwgt_lb1
and birthwgt_oz1
, replace special codes with NaN
, and compute total birth weight in pounds, birth_weight
.
# Resample the data
nsfg = resample_rows_weighted(nsfg, 'wgt2013_2015')
# Clean the weight variables
pounds = nsfg['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg['birthwgt_oz1'].replace([98, 99], np.nan)
# Compute total birth weight
birth_weight = pounds + ounces/16
Instruction
full_term
that is true for babies with 'prglngth'
greater than or equal to 37 weeks.full_term
and birth_weight
to select birth weight in pounds for full-term babies. Store the result in full_term_weight
.在这里插入代码片
In the previous exercise, you computed the mean birth weight for full-term babies; you filtered out preterm babies because their distribution of weight is different.
The distribution of weight is also different for multiple births, like twins and triplets. In this exercise, you’ll filter them out, too, and see what effect it has on the mean.
Instruction
'nbrnaliv'
to make a Boolean Series that is True
for single births (where 'nbrnaliv'
equals 1
) and False
otherwise.在这里插入代码片
The GSS dataset has been pre-loaded for you into a DataFrame called gss
. You can explore it in the IPython Shell to get familiar with it.
In this exercise, you’ll focus on one variable in this dataset, 'year'
, which represents the year each respondent was interviewed.
The Pmf
class you saw in the video has already been created for you. You can access it outside of DataCamp via the empiricaldist library.
Instruction 1
Make a PMF for year
with normalize=False
and display the result.
在这里插入代码片
Instruction 2
How many respondents were interviewed in 2016?
■ \blacksquare ■ 2867
□ \square □ 1613
□ \square □ 2538
□ \square □ 0.045897
Now let’s plot a PMF for the age of the respondents in the GSS dataset. The variable 'age'
contains respondents’ age in years.
Instruction 1
Select the 'age'
column from the gss
DataFrame and store the result in age
在这里插入代码片
Instruction 2
Make a normalized PMF of age
. Store the result in pmf_age
在这里插入代码片
Instruction 3
Plot pmf_age
as a bar chart
在这里插入代码片
In this exercise, you’ll make a CDF and use it to determine the fraction of respondents in the GSS dataset who are OLDER than 30.
The GSS dataset has been preloaded for you into a DataFrame called gss
.
As with the Pmf
class from the previous lesson, the Cdf
class you just saw in the video has been created for you, and you can access it outside of DataCamp via the empiricaldist
library.
Instruction 1
Select the 'age'
column. Store the result in age
.
Instruction 2
Compute the CDF of age
. Store the result in cdf_age
.
Instruction 3
Calculate the CDF of 30
.
Instruction 4
What fraction of the respondents in the GSS dataset are OLDER than 30?
■ \blacksquare ■ Approximately 75%
□ \square □ Approximately 65%
□ \square □ Approximately 45%
□ \square □ Approximately 25%
Recall from the video that the interquartile range (IQR) is the difference between the 75th and 25th percentiles. It is a measure of variability that is robust in the presence of errors or extreme values.
In this exercise, you’ll compute the interquartile range of income in the GSS dataset. Income is stored in the 'realinc'
column, and the CDF of income has already been computed and stored in cdf_income
.
Instruction 1
Calculate the 75th percentile of income and store it in percentile_75th
.
Instruction 2
Calculate the 25th percentile of income and store it in percentile_25th
.
Instruction 3
Calculate the interquartile range of income. Store the result in iqr
.
Instruction 4
What is the interquartile range (IQR) of income in the GSS datset?
■ \blacksquare ■ Approximately 29676
□ \square □ Approximately 26015
□ \square □ Approximately 34702
□ \square □ Approximately 30655
The distribution of income in almost every country is long-tailed; that is, there are a small number of people with very high incomes.
In the GSS dataset, the variable 'realinc'
represents total household income, converted to 1986 dollars. We can get a sense of the shape of this distribution by plotting the CDF.
Instruction
'realinc'
from the gss
dataset.cdf_income
..plot()
.在这里插入代码片
Let’s begin comparing incomes for different levels of education in the GSS dataset, which has been pre-loaded for you into a DataFrame called gss
. The variable educ
represents the respondent’s years of education.
What fraction of respondents report that they have 12 years of education or fewer?
□ \square □Approximately 22%
□ \square □ Approximately 31%
□ \square □ Approximately 47%
■ \blacksquare ■ Approximately 53%
Let’s create Boolean Series to identify respondents with different levels of education.
In the U.S, 12 years of education usually means the respondent has completed high school (secondary education). A respondent with 14 years of education has probably completed an associate degree (two years of college); someone with 16 years has probably completed a bachelor’s degree (four years of college).
Instruction
在这里插入代码片
Let’s now see what the distribution of income looks like for people with different education levels. You can do this by plotting the CDFs. Recall how Allen plotted the income CDFs of respondents interviewed before and after 1995:
Cdf(income[pre95]).plot(label='Before 1995')
Cdf(income[~pre95]).plot(label='After 1995')
You can assume that Boolean Series have been defined, as in the previous exercise, to identify respondents with different education levels: high
, assc
, and bach
.
Instruction
Fill in the missing lines of code to plot the CDFs.
在这里插入代码片
In many datasets, the distribution of income is approximately lognormal, which means that the logarithms of the incomes fit a normal distribution. We’ll see whether that’s true for the GSS data. As a first step, you’ll compute the mean and standard deviation of the log of incomes using NumPy’s np.log10()
function.
Then, you’ll use the computed mean and standard deviation to make a norm
object using the scipy.stats.norm()
function.
Instruction
'realinc'
from gss
and compute its logarithm using np.log10()
.norm
object by passing the computed mean and standard deviation to norm()
.在这里插入代码片
To see whether the distribution of income is well modeled by a lognormal distribution, we’ll compare the CDF of the logarithm of the data to a normal distribution with the same mean and standard deviation. These variables from the previous exercise are available for use:
# Extract realinc and compute its log
log_income = np.log10(gss['realinc'])
# Compute mean and standard deviation
mean, std = log_income.mean(), log_income.std()
# Make a norm object
from scipy.stats import norm
dist = norm(mean, std)
dist
is a scipy.stats.norm
object with the same mean and standard deviation as the data. It provides .cdf()
, which evaluates the normal cumulative distribution function.
Be careful with capitalization: Cdf()
, with an uppercase C
, creates Cdf
objects. dist.cdf()
, with a lowercase c
, evaluates the normal cumulative distribution function.
在这里插入代码片
In the previous exercise, we used CDFs to see if the distribution of income is lognormal. We can make the same comparison using a PDF and KDE. That’s what you’ll do in this exercise!
As before, the norm
object dist
is available in your workspace:
from scipy.stats import norm
dist = norm(mean, std)
Just as all norm
objects have a .cdf()
method, they also have a .pdf()
method.
To create a KDE plot, you can use Seaborn’s kdeplot()
function.
Instruction
dist
, which is a norm
object with the same mean and standard deviation as the data.log_income
, which is a Series object.在这里插入代码片
PMF of ageDo people tend to gain weight as they get older? We can answer this question by visualizing the relationship between weight and age. But before we make a scatter plot, it is a good idea to visualize distributions one variable at a time. Here, you’ll visualize age using a bar chart first. Recall that all PMF objects have a .bar()
method to make a bar chart.
The BRFSS dataset includes a variable, 'AGE'
(note the capitalization!), which represents each respondent’s age. To protect respondents’ privacy, ages are rounded off into 5-year bins. 'AGE'
contains the midpoint of the bins.
Instruction
'AGE'
from the DataFrame brfss
and assign it to age
.age
as a bar chart.在这里插入代码片
Now let’s make a scatterplot of weight
versus age
. To make the code run faster, I’ve selected only the first 1000 rows from the brfss
DataFrame.
weight
and age
have already been extracted for you. Your job is to use plt.plot()
to make a scatter plot.
Instruction
Make a scatter plot of weight
and age
with format string 'o'
and alpha=0.1
.
在这里插入代码片
In the previous exercise, the ages fall in columns because they’ve been rounded into 5-year bins. If we jitter them, the scatter plot will show the relationship more clearly. Recall how Allen jittered height
and weight
in the video:
height_jitter = height + np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + np.random.normal(0, 2, size=len(brfss))
Instruction
age
with mean 0
and standard deviation 2.5
.weight
and age
with marker size 5 and alpha=0.2
. Be sure to also specify 'o'
.在这里插入代码片
Previously we looked at a scatter plot of height and weight, and saw that taller people tend to be heavier. Now let’s take a closer look using a box plot. The brfss
DataFrame contains a variable '_HTMG10'
that represents height in centimeters, binned into 10 cm groups.
Recall how Allen created the box plot of 'AGE'
and 'WTKG3'
in the video, with the y-axis on a logarithmic scale:
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
plt.yscale('log')
In the next two exercises we’ll look at relationships between income and other variables. In the BRFSS, income is represented as a categorical variable; that is, respondents are assigned to one of 8 income categories. The variable name is 'INCOME2'
. Before we connect income with anything else, let’s look at the distribution by computing the PMF. Recall that all Pmf objects have a .bar()
method.
Instruction
'INCOME2'
from the brfss
DataFrame and assign it to income
.income
as a bar chart.# Extract income
income = brfss['INCOME2']
# Plot the PMF
plt.pmf(income).bar()
# Label the axes
plt.xlabel('Income level')
plt.ylabel('PMF')plt.show()
Let’s now use a violin plot to visualize the relationship between income and height.
Instruction
'HTM4'
) in each income ('INCOME2'
) group. Specify inner=None
to simplify the plot.在这里插入代码片
The purpose of the BRFSS is to explore health risk factors, so it includes questions about diet. The variable '_VEGESU1'
represents the number of servings of vegetables respondents reported eating per day.
Let’s see how this variable relates to age and income.
Instruction
brfss
DataFrame, select the columns 'AGE'
, 'INCOME2'
, and '_VEGESU1'
.在这里插入代码片
In the previous exercise, the correlation between income and vegetable consumption is about 0.12
. The correlation between age and vegetable consumption is about -0.01
.
Which of the following are correct interpretations of these results:
■ \blacksquare ■ A and C only.
□ \square □ B and D only.
□ \square □ B and C only.
□ \square □ A and D only.
As we saw in a previous exercise, the variable '_VEGESU1'
represents the number of vegetable servings respondents reported eating per day.
Let’s estimate the slope of the relationship between vegetable consumption and income.
Instruction
'INCOME2'
and '_VEGESU1'
from subset
into xs
and ys
respectively.在这里插入代码片
Continuing from the previous exercise:
xs
and ys
contain income codes and daily vegetable consumption, respectively, andres
contains the results of a simple linear regression of ys
onto xs
.Instruction
fx
to the minimum and maximum of xs
, stored in a NumPy array.fy
to the points on the fitted line that correspond to the fx
.在这里插入代码片
In the BRFSS dataset, there is a strong relationship between vegetable consumption and income. The income of people who eat 8 servings of vegetables per day is double the income of people who eat none, on average.
Which of the following conclusions can we draw from this data?
A. Eating a good diet leads to better health and higher income.
B. People with higher income can afford a better diet.
C. People with high income are more likely to be vegetarians.
□ \square □ A only.
□ \square □ B only.
□ \square □ B and C.
■ \blacksquare ■ None of them.
Let’s run the same regression using SciPy and StatsModels, and confirm we get the same results.
Instruction
'_VEGESU1'
as a function of 'INCOME2'
using SciPy’s linregress()
.'_VEGESU1'
as a function of 'INCOME2'
using StatsModels’ smf.ols()
.在这里插入代码片
To get a closer look at the relationship between income and education, let’s use the variable 'educ'
to group the data, then plot mean income in each group.
Instruction
gss
by 'educ'
. Store the result in grouped
.grouped
, extract 'realinc'
and compute the mean.mean_income_by_educ
as a scatter plot. Specify 'o'
and alpha=0.5
.在这里插入代码片
The graph in the previous exercise suggests that the relationship between income and education is non-linear. So let’s try fitting a non-linear model.
Instruction
'educ2'
to the gss
DataFrame; it should contain the values from 'educ'
squared.'educ'
, 'educ2'
, 'age'
, and 'age2'
to predict 'realinc'
.在这里插入代码片
At this point, we have a model that predicts income using age, education, and sex.
Let’s see what it predicts for different levels of education, holding age
constant.
Instruction
np.linspace()
, add a variable named 'educ'
to df
with a range of values from 0
to 20
.'age'
with the constant value 30
.df
to generate predicted income as a function of education.在这里插入代码片
Now let’s visualize the results from the previous exercise!
Instruction
mean_income_by_educ
using circles ('o'
). Specify an alpha
of 0.5
.df['educ']
on the x-axis and pred
on the y-axis.在这里插入代码片
Let’s use logistic regression to predict a binary variable. Specifically, we’ll use age, sex, and education level to predict support for legalizing cannabis (marijuana) in the U.S.
In the GSS dataset, the variable grass
records the answer to the question “Do you think the use of marijuana should be made legal or not?”
Instruction 1
Fill in the parameters of smf.logit()
to predict grass
using the variables age
, age2
, educ
, and educ2
, along with sex
as a categorical variable.
在这里插入代码片
Instruction 2
Add a column called educ
and set it to 12 years; then compute a second column, educ2
, which is the square of educ
.
在这里插入代码片
Instruction 3
Generate separate predictions for men and women.
在这里插入代码片
Instruction 4
Fill in the missing code to compute the mean of 'grass'
for each age group, and then the arguments of plt.plot()
to plot pred2
versus df['age']
with the label 'Female'
.
在这里插入代码片