978-1111826925 Chapter 23 Lecture Note

Unlock access to all the studying documents.

View Full Document

Chapter 23

Bivariate Statistical Analysis:

Measures of Association

AT-A-GLANCE

I. The Basics

II. Simple Correlation Coefficient

A. An example

B. Correlation, covariance and causation

C. Coefficient of determination

D. Correlation matrix

III. Regression Analysis

A. The regression equation

B. Parameter estimate choices

Raw regression estimates (b1)

Standardize regression estimates ()

C. Visual estimation of a simple regression model

Errors in prediction

D. Ordinary least-squares (OLS) method of regression analysis

Statistical significance of regression model



Interpreting regression output

Plotting the OLS regression line

Simple regression and hypothesis testing

IV. Appendix 23A: Arithmetic Behind OLS

LEARNING OUTCOMES

1. Apply and interpret simple bivariate correlations

2. Interpret a correlation matrix

3. Understand simple (bivariate) regression

4. Understand the least-squares estimation technique

5. Interpret regression output including the tests of hypotheses tied to specific parameter coefficients

CHAPTER VIGNETTE: Bringing Your Work to Your Home (and

Bringing Your Home to Work)

Our understanding of the work and family interface has changed substantially in recent years. The idea

that work roles and family roles could be at odds with one another is nowadays referred to as work-family

conflict (WFC)—conflict that results when the demands and responsibilities of one role “spill over” into

the other role. Researchers have begun to examine and explore the many different work and family

characteristics (i.e., independent variables) that can predict WFC (a dependent variable), with the goal of

providing insights into the causes and consequences of this phenomenon.

SURVEY THIS!

Based on the variables list, do the following:

1. Choose 3 variables (independent variables) that you think would predict satisfaction

(dependent variable).

2. Conduct a bivariate correlation analysis for all of your selected variables—do they show the

correct sign? Are they significantly related?

3. Using those same independent and dependent variables, conduct a simple regression analysis.

What do you find?

RESEARCH SNAPSHOTS

What Makes Attractiveness?

What are the things that make someone attractive? Companies that hire people to sell fashion are

interested in this question, and a correlation matrix is given that shows how different

characteristics related to each other. Variables include a measure of fit (i.e., how well the person

matches a fashion retail concept), attractiveness, weight, age, manner of dress, and hair-style. A

sample of consumers rated a model shown in a photograph on those characteristics. The results

suggest that if the model seems to “fit” the store concept, she seems attractive. If she is too big,

she is less attractive. Age is unrelated to attractiveness or fit, and moderness and coldness are

associated with lower attractiveness. The steps for using SPSS to find correlations are given.

Size and Weight

America seems obsessed with weight control. The previous research snapshot gave correlations

between factors related to attractiveness. What if the following hypothesis was tested: H1:

Perceptions that a female model is overweight are related negatively to perceptions of

attractiveness. This can be tested with a simple regression, and results support the hypothesis.

The  = -.275 is both in the expected direction (negative) and significant (p < .05). Therefore, a

person perceived as “too fat,” is seen as less attractive.

OUTLINE

I. THE BASICS

The mathematical symbol X is commonly used for an independent variable, and Y typically

denotes a dependent variable

The chi-square (2) test provides information about whether two or more less than interval

variables are interrelated.

Measurement characteristics influence which measure of association is most appropriate

(see Exhibit 23.1).

II. SIMPLE CORRELATION COEFFICIENT

The most popular technique for indicating the relationship of one variable to another is

correlation.

A correlation coefficient is a statistical measure of the covariation or association between

two variables.

Covariance is the extent to which a change in one variable corresponds systematically to a

change in another – it can be thought of as a standardized covariance.

When correlations estimate relationships between continuous variables, the Pearson

product-moment correlation is appropriate.

The correlation coefficient, r, ranges from +1.0 to -1.0.

If the value of r equals +1.0, a perfect positive relationship exists.

If the value of r equals -1.0, a perfect negative relationship exists.

No correlation is indicated if r equals 0.

A correlation coefficient indicates both the magnitude of the linear relationship and the

direction of that relationship.

The formula for calculating the correlation coefficient for two variables X and Y is as

follows:

= r

= X

– X

( )

– Y

( )

– X

( )

– Y

( )

å å

Where the symbols and represent the sample averages of X and Y,

respectively.

An alternative way to express the correlation formula is:

where

yxxy







= variance of X

= variance of Y

= covariance of X and Y

with

If associated values Xi and Yi differ from their means in the same direction, their covariance

will be positive; covariance will be negative if the values of Xi and Yi have a tendency to

deviate in opposite directions.

The Pearson correlation coefficient is a standardized measure of covariance, and researchers

find it useful because they can compare two correlations without regard for the amount of

variance exhibited by each variable separately.

An Example

While researchers do not need to calculate correlation manually, the calculation process

helps illustrate exactly what is meant by correlation and covariance.

Consider an investigation made to determine whether the average number of hours

worked in manufacturing industries is related to unemployment.

Exhibit 23.3 shows the correlation between the two variables is -.635, indicating an

inverse (negative) relationship (i.e., when number of hours goes up, unemployment

comes down).

Correlation, Covariance and Causation

Recall that concomitant variation is one condition needed to establish a causal

relationship between two variables.

When two variables covary, they display concomitant variation.

This systematic covariation does not in and of itself establish causality – the relationship

would also need to be nonspurious and that any hypothesized “cause” would have to

occur before any subsequent effect.

Coefficient of Determination

If we wish to know the proportion of variance in Y explained by X (or vice versa), we

can calculate the coefficient of determination (R2) by squaring the correlation

coefficient:



  

YYXX











Explained variance

Total variance

The coefficient of determination, R2, measures that part of the total variance of Y that

is accounted for by knowing the value of X.

R-squared really is just r squared.

Correlation Matrix

A correlation matrix is the standard form of reporting observed correlations among

multiple variables.

Each entry represents the bivariate relationship between a pair of variables.

Table 23.4 shows a correlation matrix.

Note that the main diagonal consists of correlations of 1.00, which will always be the

case when a variable is correlated with itself.

Had this been a covariance matrix, the diagonal would display the variance for any

given variable.

The procedure for determining statistical significance is the t-test of the significance

of a correlation coefficient.

Typically it is hypothesized that r = 0, and then a t-test is performed.

Statistical programs usually indicate the p-value associated with each correlation

and/or star significant correlations using asterisks.

III. REGRESSION ANALYSIS

Regression is another technique for measuring the linear association between a dependent and

an independent variable.

Although simple regression and correlation are mathematically equivalent in most respects,

regression is a dependence technique where correlation is an interdependence technique.

A dependence technique makes a distinction between dependent and independent

variables.

An interdependence technique does not make this distinction and simply is concerned

with how variables relate to one another.

Simple regression links a dependent (or criterion) variable, Y, to an independent (or predictor)

variable, X.

Regression analysis attempts to predict the values of a continuous, interval-scaled dependent

variable from the specific values of the independent variable.

The Regression Equation

Simple (bivariate) linear regression investigates a straight-line relationship of the type:

Y =  +

where Y is a continuous dependent variable, X is an independent variable that is usually

continuous, although dichotomous nominal or ordinal variables can be included in the

form of a dummy variable.

Alpha () and beta (

) are two parameters that must be estimated so that the equation

best represents a given set of data.

Determine the height of the regression line and the angle of the line relative to

horizontal.

Regression techniques have the job of estimating values for these parameters that

make the line fit the observations the best.

 represents the Y intercept (where the line crosses the y-axis).

 is the slope coefficient.

Parameter Estimate Choices

The estimates for  and  are the key to regression analysis.

In most business research, the estimate of  is most important because the explanatory

power of regression rests with this coefficient because this is where the direction and

strength of the relationship between the independent and dependent variable is explained.

The Y– intercept term is sometimes referred to as a constant because  represents a fixed

point.

An estimated slope coefficient () is sometimes referred to as a regression weight,

regression coefficient, parameter estimate or sometimes even as a path estimate because

of the way hypothesized causal relationships are often represented in diagrams.

These terms are used interchangeably.

Parameter estimates can be presented in either raw or standardized form.

A potential problem with raw parameter estimates is due to the fact that they reflect

the measurement scale range.

A standardized regression coefficient (β) provides a common metric allowing

regression results to be compared to one another no matter what the original scale

range may have been.

The standardized y-intercept term is always 0.

The most common short-hand is as follows:

B0 or b0 = raw (unstandardized) y-intercept term. What is referred to as  above.

B1 or b1 = raw regression coefficient or estimate.

1 = standardized regression coefficients.

Raw Regression Estimates (b1)

Have the advantage of retaining the scale metric – which is also their key

disadvantage.

Should the standardized or unstandardized coefficients be interpreted?

If the purpose of the regression analysis is forecasting, then raw parameter

estimates must be used – that is, the researcher is interested only in

prediction.

Standardized Regression Estimates ()

Have the advantages of a constant scale.

When should standardized regression estimates be used?

When the researcher is testing explanatory hypotheses – that is, when the

purpose of the research is more explanation than prediction.

Visual Estimation of a Simple Regression Model

Simple regression involves finding a best-fit line given a set of observations plotted in

two-dimensional space.

Many ways exist to estimate where this line should go: instrumental variables, maximum

likelihood, visual estimation, and ordinary least squares (OLS).

This book focuses on the latter two.

Exhibit 23.7 plots data in a scatter diagram.

The vertical axis indicates the value of the dependent variable, Y.

The horizontal axis indicates the value of the independent variable, X.

Each single point in the diagram represents an observation of X and Y at a given point

in time.

The values are simply points in a Cartesian plane.

One way to determine the relationship between X and Y is to simply visually draw the

best fit straight line through the points in the figure.

That is, try to draw a line the goes through the center of the plot of points.

The better one can estimate where the best fit line should be, the less will be the error in

prediction.

Errors in Prediction

The goal of regression analysis is an estimation technique which would place the line

so that the total sum of all errors over all observations is minimized.

Ordinary Least-Squares (OLS) Method of Regression Analysis

OLS is a relatively straight forward mathematical technique that guarantees that the

resulting straight line will produce the least possible total error in using X to predict Y.

The logic is based on how much better a regression line can predict values of Y compared

to simply using the mean as a prediction for all observations.

Unless the dependent and independent variables are perfectly related, no straight line can

connect all observations.

The procedure used in the least-squares method generates a straight line that minimizes

the sum of squared deviations of the actual values from this predicted regression line.

No other line can produce less error.

Using the symbol e to represent the deviations of the observations from the regression

line, the least-squares criterion is as follows:

i = 1

is minimum

where e = Yi –

(the residual)

Yi= actual value of the dependent variable

= estimated value of the dependent variable (“Y hat”)

n = number of observations

i = number of the particular observation

The general equation for a straight line is Y = b0 +

1X where a more appropriate

estimating equation includes an allowance for error:

eXbbY 

110

The equation means that the predicted value for any value of X (Xi) is determined as a

function of the estimated slope coefficient, plus the estimated intercept coefficient + some

error.

The raw parameter estimates can be found using the following formulas:

 

  



2

1)()(

))(()(

iiii

XXn

YXYXn

and

XbYb



where

Yi = ith observed value of the dependent variable

Xi= ith observed value of the independent variable

= mean of the dependent variable

X = independent variable

= mean of the independent variable

n = number of observations

bo = intercept estimate

b1 = slope estimate (regression weight)

The standardized regression coefficient from a simple regression equals the Pearson

correlation coefficient for the two variables.

See Appendix 23A for the arithmetic necessary to calculate the parameter estimates.

Statistical Significance of Regression Model

Like ANOVA, an F-test provides a way of testing the statistical significance of the

regression model.

The overall F-test for regression is illustrated in Exhibit 23.7.

1. The total line including the blue and red line represents the total deviation of the

observation from the mean:



2. The blue portion represents how much of the total deviation is explained by the

regression line:



3. The red portion represents how much of the total deviation is not explained by the

regression line (also equal to ei):

YY ˆ



These three components are mathematically related because the total deviation is a sum

of what is explained by the regression line and what is not explained by the regression

line:

)

()

()(

iiii

YYYYYY 

Total

Deviatio

(SST)

Deviation

explained by

the regression

(SSR)

Deviation

unexplained by

the regression

(SSE)

Just as in ANOVA, the total deviation represents the total variation to be explained.

Partitioning of the variation into components allows us to form a ratio of the explained

variation versus the unexplained variation:

SST = SSR + SSE

An F-test, or an analysis of variance, can be applied to a regression to test the relative

magnitude of the SSR (Sums of Squares – Regression) and SSE (Sums of Squared

Errors) with their appropriate degrees of freedom.

The equation for the F-test is:

MSE

MST

SSE

SSR

knk









)(

)1(

))(1(

Where,

MST is an abbreviation for Mean Squared Regression

MSE is an abbreviation for Mean Squared Error

k is the number of independent variables (always 1 for simple regression)

n is the sample size

Again, researchers need not calculate this by hand as regression programs will produce an

“ANOVA” table which will provide the F-value, a p-value (significance level) and

generally show the partitioned variation in some form.



The coefficient of determination, R2, reflects the proportion of variance explained by

the regression line. It can be found with this formula:

SST

SSR

R

For example, a coefficient of determination of .875 may be interpreted to mean that

87.5 percent of the variation in the dependent variable was explained by associating

the variable with the independent variable (however, in practice, do not expect to

often see a simple regression result with an R2 as high as this example).

What is an “acceptable” R2 value?

Depends on so many factors that a single precise guideline is inappropriate.

The focus should be on the F-test.

Interpreting Regression Output

Exhibit 23.8 provides a typical output for regression analysis.

Interpreting simple regression output is a simple two-step process:

1. Interpret the overall significance of the model.

a. The output will include a “model F” and a significance value.

b. The coefficient of determination or R2 can be interpreted.

2. The individual parameter coefficient is interpreted.

a. The t-value associated with the slope coefficient can be interpreted. For

simple regression, the p-value for the model F and for the t-test of the

individual regression weight will be the same.

b. A t-test for the intercept term (constant) is also provided; however, it is

seldom of interest because the explanatory power rests in the slope

coefficient.

c. If a need to forecast sales exists, the estimated regression equation is

needed.

Plotting the OLS Regression Line

A regression line on the scatter diagram, only two predicted values of Y need to be

plotted.

To determine the error (residual) of any observation, the predicted value of Y is first

calculated. The predicted value is then subtracted from the actual value.

Simple Regression and Hypothesis Testing

The explanatory power of regression lies in hypothesis testing.

Regression is often used to test relational hypotheses.

The outcome of the hypothesis test involves two conditions that must be satisfied:

1. The regression weight must be in the hypothesized direction.

2. The t-test associated with the regression weight must be significant.

IV. APPENDIX 23A: ARITHMETIC BEHIND OLS

Data from Exhibit 23.6 are used to solve for the parameter estimates using the OLS

equations.