Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding the Relationship between Interval Variables: Correlation & Regression - Prof, Exams of Political Science

An introduction to correlation and regression analysis, two methods used to study the relationship between two interval level variables. Correlation produces a measure of association, pearson's correlation coefficient, which indicates the direction and strength of the relationship. Regression analysis distinguishes one variable as independent and the other as dependent, producing a regression coefficient that measures the average change in the dependent variable for every one-unit change in the independent variable. Instructions on how to perform correlation and regression analysis using stata 8 and interpreting the results.

Typology: Exams

Pre 2010

Uploaded on 08/07/2009

koofers-user-97e
koofers-user-97e 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
-40-
Module 4: Correlation and Regression
Introduction
In the previous modules, you have learned to create and test hypotheses for
relationships where both variables are categorical (chi-square) or where the dependent
variable is interval level and the independent is categorical (t-test). The last relationship
that you need to learn to describe involves two interval level variables. There are two
very closely related methods that allow researchers to study these relationships:
correlation and regression.
Correlation is a test that produces a measure of association, Pearson’s correlation
coefficient (Pearson’s r), that tells you both the direction and strength of a relationship
between two interval level variables. Pearson’s r ranges from –1 to +1, so the sign on the
coefficient tells you the direction of the variables’ relationship; a positive coefficient (0 to
+1) signifies a positive relationship while a negative coefficient (-1 to 0) signifies a
negative relationship. A positive relationship means that as one variable increases, the
other increases too. A negative (or inverse) relationship means that as one variable
increases, the other decreases. The relationship gets stronger as the coefficient gets
nearer to one of the “poles” (-1 or +1) and weaker as the coefficient approaches 0.
Therefore, a Pearson’s r of +1 signifies a perfect positive relationship and –1 signifies a
perfect negative relationship while a coefficient of 0 means there is no relationship
between the variables. It is important to mention that a strong correlation between two
variables does not mean that the relationship is causal; that is, a change in one variable
doesn’t cause a change in the other. For this reason, researchers don’t usually identify
one variable as independent and the other as dependent in a correlation analysis.
Regression analysis differs from correlation in that it does distinguish one
variable as independent and the other as dependent. Regression analysis produces a
regression coefficient that measures the average change in the dependent variable for
every one-unit change in the independent variable. If the variable were plotted on a
scatter plot with the x-axis containing the independent variable and the y-axis signifying
the dependent variable, the regression coefficient would be the slope of the line that best
models the points of data. The equation for the line of best fit (also called a regression
line) for variables is:
Y = a + b(X)
In this equation, Y is the value of the dependent variable, X is the value of the
independent variable, and b is the value of the regression coefficient. Also, a signifies the
Y-intercept of the line; this is the value the dependent variable (Y) would take if the
independent variable (X) were equal to 0.
Scatter Plots
The ideas of correlation and regression can be hard to grasp without visualizing it.
Let’s consider an example using STATES.dta. This data contains “voted96,” a variable
that contains the percentage of the states’ residents who voted in 1996. This will serve as
pf3
pf4
pf5
pf8

Partial preview of the text

Download Understanding the Relationship between Interval Variables: Correlation & Regression - Prof and more Exams Political Science in PDF only on Docsity!

Module 4: Correlation and Regression Introduction In the previous modules, you have learned to create and test hypotheses for relationships where both variables are categorical (chi-square) or where the dependent variable is interval level and the independent is categorical (t-test). The last relationship that you need to learn to describe involves two interval level variables. There are two very closely related methods that allow researchers to study these relationships: correlation and regression. Correlation is a test that produces a measure of association, Pearson’s correlation coefficient (Pearson’s r), that tells you both the direction and strength of a relationship between two interval level variables. Pearson’s r ranges from –1 to +1, so the sign on the coefficient tells you the direction of the variables’ relationship; a positive coefficient (0 to +1) signifies a positive relationship while a negative coefficient (-1 to 0) signifies a negative relationship. A positive relationship means that as one variable increases, the other increases too. A negative (or inverse) relationship means that as one variable increases, the other decreases. The relationship gets stronger as the coefficient gets nearer to one of the “poles” (-1 or +1) and weaker as the coefficient approaches 0. Therefore, a Pearson’s r of +1 signifies a perfect positive relationship and –1 signifies a perfect negative relationship while a coefficient of 0 means there is no relationship between the variables. It is important to mention that a strong correlation between two variables does not mean that the relationship is causal; that is, a change in one variable doesn’t cause a change in the other. For this reason, researchers don’t usually identify one variable as independent and the other as dependent in a correlation analysis. Regression analysis differs from correlation in that it does distinguish one variable as independent and the other as dependent. Regression analysis produces a regression coefficient that measures the average change in the dependent variable for every one-unit change in the independent variable. If the variable were plotted on a scatter plot with the x-axis containing the independent variable and the y-axis signifying the dependent variable, the regression coefficient would be the slope of the line that best models the points of data. The equation for the line of best fit (also called a regression line) for variables is: Y = a + b(X) In this equation, Y is the value of the dependent variable, X is the value of the independent variable, and b is the value of the regression coefficient. Also, a signifies the Y-intercept of the line; this is the value the dependent variable (Y) would take if the independent variable (X) were equal to 0. Scatter Plots The ideas of correlation and regression can be hard to grasp without visualizing it. Let’s consider an example using STATES.dta. This data contains “voted96,” a variable that contains the percentage of the states’ residents who voted in 1996. This will serve as

our dependent variable and will be plotted on the y-axis. The independent variable (x- axis) will be “blkpct,” which is the percentage of the states’ residents who are black. To create a scatter plot in Stata 8, go to Graphics>Easy graphs>Scatter plot. Enter the appropriate variables in the “X variable” and “Y variable” fields (Figure 4.1) and click OK. After looking at the output (Figure 4.2), you should be able to see a negative relationship in the data. As “blkpct” increases, “voted96” decreases. Figure 4. 1 40 50 60 70 percent voting in 1996 0 10 20 30 40 percent black Figure 4. 2

b = b + (standard error) This is saying that the true coefficient (b) is equal to the sample coefficient (b) plus the amount of error associated with b (sampling error, etc.) The test then measures how accurately b mirrors b, and whether or not b is statistically significant. The null hypothesis for all regression analyses is that b = 0. In words, the null hypothesis claims that the values of the dependent variable won’t change as the values of the independent variable increases; in other words, there is no slope to the true regression line. A regression analysis then tests this hypothesis (it uses a t-distribution much like a t-test) and produces a P-value for the coefficient. The P-value tells the percentage of the time that we would observe a “b” as large as we did if the true value of b really is 0. If the P-value is .05 or less, we can reject the null hypothesis and assume the alternative. The alternative hypothesis is that b is not equal to 0; the regression line does have a slope, signifying either a positive or negative relationship. In our example, the alternative hypothesis is that b < 0; that is, that the line has a negative slope. In words, the hypothesis says that as the percentage of black residents in the state increases, the percentage of residents who voted in 1996 will decrease. The null hypothesis is that b = 0, or the line has no slope, meaning that the percentage of black residents has no effect on voter turnout in a state. Regression Analysis To test those hypotheses, we need to run a regression analysis and interpret the coefficients. To run this procedure, go to Statistics>Linear regression and related>Linear regression. Enter the independent and dependent variables (Figure 4.5). Figure 4. 5

The output for regression analysis (Figure 4.6) is very detailed and complex. The Figure 4. 6 values we are interested in are found in the bottom table. The tables for all regressions are set up the same way. The dependent variable is always the top row of the table. The table is comprised of rows for each independent variable (we only have one in this example) and a row for the constant term of the equation (labeled “_cons”). In the rows, we find the coefficient, standard error, t-score (regression is based on comparison of means and uses the t-distribution, but you shouldn’t concern yourself with the actual calculation methods), P-value, and confidence interval for each piece of the equation. From our output, we know to reject the null hypothesis. The coefficient on “blkpct” (the b in our hypothesis) is less than 0 (-.3358) and its P-value is 0.000, so we have strong evidence that the negative relationship isn’t due to “chance.” The constant term of this analysis (54.9717) is also statistically significant (P-value is 0.000). These coefficients reflect our sample, but when the analysis is extended to the population, they could vary. The confidence intervals give us a range in which the true population coefficient would fall 95 out of 100 times. So, if we did this analysis 100 times with different samples, 95% of the intervals would contain the true population coefficient of “blkpct” and 95% would contain the true population‘s constant coefficient. There is a 95% chance that the true population’s coefficient on “blkpct” falls between - 0.5159 and –0.1557, and a 95% chance that the true population constant coefficient falls between 52.4683 and 57.4752. Thus, we can be confident that the true coefficients fall in that range. The coefficients in the output can be inserted into the equation Y = a + b(X), where Y is the value of the dependent variable, X is the value of the independent variable, a is the constant term, and b is the value of the coefficient on the independent variable. Thus, our model is: “voted96” = 54.9717 – 0.3358*(“blkpct”)

The resulting graph (Figure 4.8) displays the scatter plot and the regression line (along with a confidence interval for the line, but we won’t concern ourselves with that). It is easy to see how the data follow the general relationship the regression line implies. Strength of a Relationship The regression coefficient for the independent variable gives us the average change in the dependent variable as we increase one-unit on the independent variable. In this example, as we increase 1 percentage point on “blkpct,” “voted96” decreases by .3358. However, since the coefficient measures the average change, most of the data do no fall exactly on the line. Some of the data fall above the line, and some fall below it. Therefore, we need to determine how completely the regression line models the data. That is, how closely do the data fit the regression line? Using some statistical equations that we do not need to concern ourselves with, regression analysis answers this question with the R 2 (R-square) statistic. This statistic explains the extent to which knowing the independent value can improve our prediction of the dependent variable. R 2 ranges from 0 (the data is completely random and doesn’t resemble a line) to 1 (the data forms a perfect line) and can be expressed as a percent. Thus, if R 2 was .5, we could say that knowing the independent variable would improve our prediction by 50% over trying to predict it without knowing the independent variable. The R-square value is given in the upper-right of the regression output. The R- square for our example was 0.2265. By knowing the value of “blkpct,” we can improve our prediction of “voted96” by 22.65%. R-square is simply the correlation coefficient (Pearson’s r) squared. However, Pearson’s r is not expressed as a percent and has no effect on our ability to predict, while R-square does. The theory behind improving predictions is as follows: If we didn’t know the value of the independent variable, the best prediction for the dependent variable is its mean. If we know the independent variable, we can plug it into the equation and get a prediction. R-square is then figured by comparing the two predictions and determining how much better the prediction is when we know the independent variable. Let’s look at our prediction for “voted96” for a state whose “blkpct” was 20. Earlier we determined this prediction to be 48.2557% voter turnout for that state. If we didn’t know the state was 20% black, our best prediction of “voted96” would be its mean. Thus, our prediction would be 51.5576% voter turnout for that state (51.5576 is the mean of “voted96” when summary statistics for the variable were obtained – see Module 1). The R-square is telling us that 48.2557 is a 22.65% better prediction than 51.5576. EXERCISES FOR MODULE 4

  1. When performing correlation or regression, the dependent variable must be measured on the _____________ level.
  2. The regression coefficient tells us the average change in the dependent variable for every ____-________ change in the ______________.
  1. Fill in the following correlation matrix. Use World.dta. wealth10 hindu legdom wealth10 1. hindu legdom
  2. Using States.dta, perform a regression analysis with “womleg” as your dependent variable and “blkpct” as the independent variable. Your alternate hypothesis is that b < 0, while the null hypothesis is that b = 0. a. Report the coefficients, P-values, and confidence intervals for the constant and the independent variable. b. What conclusion should you reach on the hypotheses? c. Give the equation for the regression line in this analysis. If a state is 15% black, what would be your prediction for its percentage of “womleg”? d. Make a scatter plot of the data that includes the regression line. Place your name in the graph’s title.