Assignment 1
MAST90083 Computational Statistics and Data Mining
Due time: 5PM, Monday September 16th
You must submit your report via LMS
1 Data Analysis
Gross domestic product is a standard measure of the size of an economy; it’s the total value
of all goods and services bought and solid in a country over the course of a year. It’s not a
perfect measure of prosperity, but it is a very common one, and many important questions
in economics turn on what leads GDP to grow faster or slower. One common idea is that
poorer economies, those with lower initial GDPs, should grower faster than richer ones.
The reasoning behind this catching up is that poor economies can copy technologies and
procedures from richer ones, but already-developed countries can only grow as technology
advances. A second, separate idea is that countries can boost their growth rate by undervaluing
their currency, making the goods and services they export cheaper. Our dataset
“uval.csv” contains the following variables:
• Country, in a three-letter code.
• Year (in five-year increments).
• Per-capita GDP, in dollars per person per year
• Average percentage growth rate in GDP over the next five years.
• An index of currency under-valuation. The index is 0 if the currency is neither overnor
under-valued, positive if under-valued, negative if it is over-valued.
Note that not all countries have data for all years. However, there are no missing values in
代写MAST90083留学生作业、代做Data Mining作业
the data table.
1. Linearly regress the growth rate on the under-valuation index and the log of GDP.
Report the coefficients and their standard errors. Do the coefficients support the
idea of catching up? Do they support the idea that under-valuing a currency boosts
economic growth?
1
2. Repeat the linear regression but add as covariates the country, and the year. Use
factor(year), not year, in the regression formula.
(a) Report the coefficients for log GDP and undervaluation, and their standard errors.
(b) Explain why it is more appropriate to use factor(year) in the formula than just
year.
(c) Plot the coefficients on year versus time.
(d) Does this expanded model support the idea of catching up? Of undervaluation
boosting growth?
3. Does adding in year and country as covariates improve the predictive ability of a linear
model which includes log GDP and under-valuation?
(a) What are the R2 and the adjusted R2 of the two models?
(b) Use leave-one-out cross-validation to find the mean squared errors of the two
models. Which one actually predicts better, and by how much?
(c) Explain why using 5-fold cross-validation would be hard here.
4. Kernel regression Use kernel regression, as implemented in the np package, to nonparametrically
regress growth on log GDP, under-valuation, country, and year (treating
year as a categorical variable). Hint: read chapter four of Shalizi carefully. In particular,
try setting tol to about 10−3 and ftol to about 10−4
in the npreg command,
and allow several minutes for it to run.
(a) Give the coefficients of the kernel regression, or explain why you cannot.
(b) Plot the predicted values of the kernel regression, for each country and year,
against the predicted values of the linear model.
(c) Plot the residuals of the kernel regression against its predicted values. Should
these points be scattered around a flat line, if the model is right? Are they?
(d) The npreg function reports a cross-validated estimate of the mean squared error
for the model it fits. What is that? Does the kernel regression predict better or
worse than the linear model with the same variables?
2 Kernel regression and varying smoothness
Starter code for this problem is in starter.R. That code will generate a data set to be used
for this problem, and will also provide a true mean function µ(x). The resulting data frame
has a x column (your predictor) and a y column (your response).
1. Plot y versus x. Overlay the true mean function µ(x) using the curve function in R.
What do you notice for x < 4π and x > 4π?
2
2. Using the np library in R, fit a kernel regression on each of the following datasets:
(a) Only those data points with x < 4π.
(b) Only those data points with x > 4π.
(c) All the data points
For each of these regressions, what is the optimal bandwidth? How does the optimal
bandwidth for the overall data set compare to the optimal bandwidth for each of the
halves?
3. For each of the three selected bandwidths, make a plot showing:
• The true mean µ(x).
• The data points.
• The kernel regression predictions, with the bandwidth specified to be the selected
bandwidth.
• The 95% confidence band for the regression curve µ using resampling of residuals.
• The 95% confidence band for the regression curve µ using resampling of cases.
The result should be three plots, each tuned to one of the selected bandwidths. Give
these plots clear titles to distinguish them.
4. How do these three plots differ? In particular, how well do the regressions trained on
the left and right halves do on each half of the data set? How well does the bandwidth
fit on the overall data set do on each half? (Be specific about the types of problems
that occur.) What lesson might this tell about functions of varying smoothness and
kernel regression, if any?
3 Theoretical questions
1. Exercise 1.2 in Shalizi
2. Exercise 1.4 in Shalizi
3. Exercise 7.4 in ESL
因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:99515681@qq.com
微信:codehelp