# Regression Assignment, Spring 2020

Dr. Satish Nargundkar

Part A: Regression Interpretation

1.         A zoologist builds a regression model to predict the height (in feet) at which birds of a certain type build a nest on Oak trees in a forest, based on the type of Oak tree that the nest is on, and the amount of rainfall (in inches) in the month before nest building starts.  There are three types of Oak trees the zoologist is interested in – Bur Oak, Pin Oak and White Oak. She creates dummies for Pin Oak and White Oak. Regression coefficients are shown below. Assume all are significant.

R-Square = 0.75

 Intercept 50 Rainfall (in) -1.5 Pin Oak -7 White Oak 9

a.       Interpret the R-Square value of 0.75 in the context of this problem.

b.      What is the predicted height of the nest on a Pin Oak if there was 4 inches of rain?

c.       Interpret the coefficient -7.0 for Pin Oak.

d.      According to the model, on what type of Oak tree would the nests be at the lowest height, assuming rainfall was held constant?

e.       If Pin Oak was made the baseline and Bur Oak was a dummy variable, how would the intercept and coefficients change, all else being the same?

New Intercept = ______

Coefficient for Bur Oak =  _______Coefficient for White Oak = _______

2.         Consider the partial data from a regression output below. The regression has 3 independent variables. Fill in the shaded blanks in the table.

 R-squared 0.75 Standard Error 10 Observations

 df SS MS F Sig-F Regression 1.07E-07 Residual(Error) Total 28 10,000

3.         What are the null and alternate hypotheses in a simple regression? Write in plain language and mathematically.

4.         You have a categorical variable in your data about undergraduate students that identifies their enrollment status as Freshman, Sophomore, Junior, and Senior. Consider the values for the first 6 records in your sample shown below in the table. Fill in as many of the blank columns below as needed (and no more) to code the enrollment Status variable as dummy variables to use in regression. Label the columns and fill in the numeric values for each dummy variable.

 Obs Status 1 Freshman 2 Senior 3 Senior 4 Sophomore 5 Junior 6 Sophomore

5.      What are the assumptions of Linear Regression? For each one, write what one can do if the assumption is violated.

Part B: Working with the Data

Download the BDNF Dataset [You may work either in Excel or SPSS].

A researcher is trying to determine if exercise affects different people differently. Specifically, when people exercise, there is an increase in a protein called BDNF (Brain derived neurotrophic factor). Is this Increase in BDNF different among males and females, and is it different among different ethnicities?

The dataset provided shows data on Increase in BDNF in the body due to Exercise, for people in the age range of 18-25, along with their Gender and Ethnicity.

1.      Create a Histogram of the variable Exercise to look at its distribution. Compute the Mean and Standard Deviation of the variable as well.

2.      Create a Pivot table in Excel to analyze the Increase in BDNF by Ethnicity and Gender simultaneously. The table should look roughly as follows, with the Average, Standard Deviation, and the Count values for Increase in BDNF in each of the cells:

 Column Labels Row Labels Female Male Grand Total African Asian Caucasian Grand Total

Interpret the table – what can you say about the impact of Gender and Ethnicity overall? Does there seem to be an interaction between Gender and Ethnicity? In other words, is the impact of Gender on Increase in BDNF different for different Ethnicities?

3.      Create a scatterplot of Increase in BDNF and Exercise. Make sure Increase in BDNF is on the Y axis. Interpret.

4.      Perform a multiple regression analysis using the dataset provided (use all the Xs available for the regression), to predict Increase in BDNF in the body. For Gender, code Male as 1 and Female as 0 (baseline). For Ethnicity, create two dummies, one for Asian and one for African (keep Caucasian as the baseline).

Note that in Excel, all the Xs that you will use must be in contiguous columns.

5.      Is the regression significant overall? What does that mean?

6.      Is each of the variables in the model significant at the 5% level? Interpret.

7.      What is the R-squared value? What does it mean?

Now run the regression again after removing variables that are not significant at the 5% level.

(remember that you may have to move data in the worksheet so that the variables left in the regression are all in contiguous columns).

8.      What is the final equation to predict BDNF increase? What would your model predict as the increase in BDNF for a person who exercises 20 minutes, and is a Caucasian Female?

9.      Interpret the coefficients for Asian and African.

10.  If you want to improve the R-square value of the model, is increasing the sample size the best way? Why or why not?

Part C: Your group research project

As you complete this assignment, remember to think about how this can help you with your group research project idea that you need to present in April. You need a research idea, a brief literature review, a model you wish to test, and a methods section that says how you plan to collect the data and analyze it quantitatively. You can use the same idea that you use for the qualitative methods class if you wish. However, it must be reasonably modified in approach so you can discuss how the same question may be answered with quantitative analysis.