correction term is 0.5 + 0.5 + 2 = 3
144
Problem 9 : Resolving ties in ranks
The following are the details of ratings scored by two popular
insurance schemes. Determine the rank correlation coefficient between
them.
Scheme I
80
80
83
84
87
87
89
90
Scheme II 55
56
57
57
57
58
59
60
Solution:
From the given values, we have to determine the ranks.
Step 1.
Arrange the scores for Insurance Scheme I in descending order and
rank them as 1,2,3,…,8.
Scheme
I
90
89
87
87
84
83
80
80
Score
Rank
1
2
3
4
5
6
7
8
The score 87 appears twice. The corresponding ranks are 3, 4.
Their average is (3 + 4) / 2 = 3.5. Assign this rank to the two equal scores
in Scheme I.
The score 80 appears twice. The corresponding ranks are 7, 8.
Their average is (7 + 8) / 2 = 7.5. Assign this rank to the two equal scores
in Scheme I.
The revised ranks for Insurance Scheme I are as follows:
Scheme
I
90
89
87 87
84
83 80
80
Score
Rank
1
2
3.5 3.5
5
6 7.5 7.5
145
Step 2.
Arrange the scores for Insurance Scheme II in descending order
and rank them as 1,2,3,…,8.
Scheme
II
60
59
58
57
57
57
56
55
Score
Rank
1
2
3
4
5
6
7
8
The score 57 appears thrice. The corresponding ranks are 4, 5, 6.
Their average is (4 + 5 + 6) / 3 = 15 / 3 = 5. Assign this rank to the three
equal scores in Scheme II.
The revised ranks for Insurance Scheme II are as follows:
Scheme
60
59
58
57
57
57
56
55
II Score
Rank
1
2
3
5
5
5
7
8
Step 3.
Calculation of D2: Assign the revised ranks to the given pairs of
values and calculate D2 as follows:
Scheme I Scheme II Scheme I Scheme II D=R - R
D2
Score
Score
Rank: R
Rank: R
1
2
1
2
80
55
7.5
8
- 0.5
0.25
80
56
7.5
7
0.5
0.25
83
57
6
5
1
1
84
57
5
5
0
0
87
57
3.5
5
- 1.5
2.25
87
58
3.5
3
0.5
0.25
89
59
2
2
0
0
90
60
1
1
0
0
Total
4
146
Step 4.
Calculation of ρ:
We have N = 8.
Since there are 2 ties with 2 items each and another tie with 3 items,
the correction term is 0.5 + 0.5 + 2 .
The rank correlation coefficient is
ρ = 1 - [{ 6 ∑ D2 + (1/2) + (1/2) +2 }/ (N3 – N)}]
= 1 – { 6 (4.+0.5+0.5+2) / (512 – 8) } = 1 – (6 x 7 / 504) = 1 - ( 42/504 )
= 1 - 0.083 = 0.917
Inference:
It is inferred that the two insurance schemes are highly, positively
correlated.
REGRESSION
In the pairs of observations, if there is a cause and effect relationship
between the variables X and Y, then the average relationship between
these two variables is called regression, which means “stepping back” or
“return to the average”. The linear relationship giving the best mean value
of a variable corresponding to the other variable is called a regression
line or line of the best fit. The regression of X on Y is different from the
regression of Y on X. Thus, there are two equations of regression and the
two regression lines are given as follows:
Regression of Y on X: Y − Y = b X − X
yx (
)
Regression of X on Y: X − X = b Y − Y
xy (
)
Where X , Y are the means of X, Y respectively.
Result:
Let σ , σ denote the standard deviations of x, y respectively. We
x
y
have the following result.
147
σ
σ
Y
X
b = r
and b = r
yx
xy
σ
σ
X
Y
2
∴ r = b b
and so r = b b
yx xy
yx xy
Result:
The coefficient of correlation r between X and Y is the square root
of the product of the b values in the two regression equations. We can find
r by this way also.
Application
The method of regression is very much useful for business
forecasting.
PRINCIPLE OF LEAST SQUARES
Let x, y be two variables under consideration. Out of them, let x
be an independent variable and let y be a dependent variable, depending
on x. We desire to build a functional relationship between them. For this
purpose, the first and foremost requirement is that x, y have a high degree
of correlation. If the correlation coefficient between x and y is moderate or
less, we shall not go ahead with the task of fitting a functional relationship
between them.
Suppose there is a high degree of correlation (positive or negative)
between x and y. Suppose it is required to build a linear relationship
between them i.e., we want a regression of y on x.
Geometrically speaking, if we plot the corresponding values of x
and y in a 2-dimensional plane and join such points, we shall obtain a
straight line. However, hardly we can expect all the pairs (x, y) to lie on
a straight line. We can consider several straight lines which are, to some
extent, near all the points (x, y). Consider one line. An observation (x , y )
1
1
may be either above the line of consideration or below the line. Project this
point on the x-axis. It will meet the straight line at the point (x , y e). Here
1
1
the theoretical value (or the expected value) of the variable is y e while the
1
148
observed value is y . When there is a difference between the expected and
1
observed values, there appears an error. This error is E = y –y . This is
1
1
1
positive if (x , y ) is a point above the line and negative if (x , y ) is a point
1
1
1
1
below the line. For the n pairs of observations, we have the following n
quantities of error:
E = y – y ,
1
1
1
E = y – y ,
2
2
2
E = y – y .
n
n
n
Some of these quantities are positive while the remaining ones are
negative. However, the squares of all these quantities are positive.
Y
(X1, Y1)
e1
e2
(X2, Y2)
O
X
i.e.,
E2 = (y – y )2 ≥ 0, E2 = (y –y )2 ≥ 0, …, E2 = (y –y )2 ≥ 0.
1
1
1
2
2
2
n
n
n
Hence the sum of squares of errors (SSE) = E2 + E2 + … + E2
1
2
n
= (y –y )2 + (y –y )2 + … + (y –y )2 ≥ 0.
1
2
2
2
n
n
149
Among all those straight lines which are somewhat near to the
given observations
(x , y ), (x , y ), …, (x , y ) , we consider that straight line as the ideal one
1
1
2
2
n
n
for which the sse is the least. Since the ideal straight line giving regression
of y on x is based on this concept, we call this principle as the Principle of
least squares.
Normal equations
Suppose we have to fit a straight line to the n pairs of observations
(x , y ), (x , y ), …, (x , y ). Suppose the equation of straight line finally
1
1
2
2
n
n
comes as
Y = a + b X (1)
Where
a, b are constants to be determined. Mathematically speaking, when
we require finding the equation of a straight line, two distinct points on
the straight line are sufficient. However, a different approach is followed
here. We want to include all the observations in our attempt to build a
straight line. Then all the n observed points (x, y) are required to satisfy
the relation
(1). Consider the summation of all such terms. We get
∑ y = ∑ (a + b x ) = ∑ (a .1 + b x ) = ( ∑ a.1) + ( ∑ b x ) = a ( ∑ 1 ) + b ( ∑ x).
i.e.
∑ y = an + b (∑ x) (2)
To find two quantities a and b, we require two equations. We have
obtained one equation i.e., (2). We need one more equation. For this
purpose, multiply both sides of (1) by
x. We obtain
x y = ax + bx2 .
Consider the summation of all such terms. We get
∑ x y = ∑ (ax + bx2 ) = (∑ a x) + ( ∑ bx2)
150
i.e.,
∑ x y = a (∑ x ) + b (∑ x2) ………….. (3)
Equations (2) and (3) are referred to as the normal equations associated
with the regression of y on x. Solving these two equations, we obtain
2
∑X ∑Y - ∑X ∑XY
a =
n ∑ X - (∑X)2
2
n ∑XY - ∑X ∑Y
and b =
n ∑X - (∑X)2
2
Note:
For calculating the coefficient of correlation,
we require ∑X, ∑Y, ∑ Xy, ∑ X2, ∑Y2.
For calculating the regression of y on x, we require ∑X, ∑Y, ∑ XY, ∑
X2. Thus, tabular column is same in both the cases with the difference that
∑Y2 is also required for the coefficient of correlation.
Next, if we consider the regression line of x on y, we get the equation
X = a + b y. The expressions for the coefficients can be got by interchanging
the roles of X and Y in the previous discussion. Thus, we obtain
2
∑ Y ∑X - ∑Y ∑XY
a =
n ∑ Y - (∑Y)2
2
n ∑XY - ∑X ∑Y
And b =
n ∑ Y - (∑Y)2
2
151
Problem 10
Consider the fol owing data on sales and profit.
X
5
6
7
8
9
10
11
Y
2
4
5
5
3
8
7
Determine the regression of profit on sales.
Solution:
We have N = 7. Take X = Sales, Y = Profit.
Calculate ∑ X, ∑y, ∑XY, ∑X2 as follows:
X
Y
XY
X2
5
2
10
25
6
4
24
36
7
5
35
49
8
5
40
64
9
3
27
81
10
8
80
100
11
7
77
121
Total: 56
34
293
476
a = {(∑ x2) (∑ y) – (∑ x) (∑ x y)} / {n (∑ x2) – (∑ x)2}
= (476 x 34 – 56 x 293) / ( 7 x 476 - 562 )
= (16184 – 16408 ) / ( 3332 – 3136 )
= - 224 / 196
= – 1.1429
152
b = {n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}
= (7 x 293 – 56 x 34)/ 196 = (2051 – 1904)/ 196
= 147 /196
= 0.75
The regression of Y on X is given by the equation
Y = a + b X
I.e.,
Y = – 1.14 + 0.75 X
Problem 11
The following are the details of income and expenditure of 10
households.
Income
40
70 50
60
80
50
90 40
60
60
Expenditure 25 60 45 50 45
20
55 30
35
30
Determine the regression of expenditure on income and estimate the
expenditure when the income is 65.
Solution:
We have N = 10. Take X = Income, Y = Expenditure
Calculate ∑ X, ∑y, ∑Xy, ∑X2 as follows:
X
Y
XY
X2
40
25
1000
1600
70
60
4200
4900
50
45
2250
2500
153
60
50
3000
3600
80
45
3600
6400
50
20
1000
2500
90
55
4950
8100
40
30
1200
1600
60
35
2100
3600
60
30
1800
3600
Total: 600
395
25100
38400
a = {(∑ x2) (∑ y) – (∑ x) (∑ x y)} / {n (∑ x2) – (∑ x) 2}
= ( 38400 x 395 - 600 x 25100 ) / (10 x 38400 - 6002)
= (15168000 – 15060000) / (384000 – 360000)
= 108000 / 24000
= 4.5
b = {n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}
= ( 10 x 25100 – 600 x 395) / 24000
= (251000- 237000) / 24000
= 14000 / 24000
= 0.58
The regression of y on x is given by the equation
Y = a + b X
i.e.,
Y = 4.5 + 0.583 X
154
To estimate the expenditure when income is 65:
Take X = 65 in the above equation. Then we get
Y = 4.5 + 0.583 x 65
= 4.5 + 37.895
= 42.395
= 42 (approximately).
Problem 12
Consider the following data on occupancy rate and profit of a hotel.
Occupancy 40 45 70 60 70 75 70 80 95 90
rate
Profit
50
55
65
70
90
95 105 110 120 125
Determine the regressions of
(i) profit on occupancy rate and
(ii) occupancy rate on profit.
Solution:
We have N = 10. Take X = Occupancy Rate, Y = Profit.
Note that in Problems 10 and 11, we wanted only one regression
line and so we did not take ∑Y2 . Now we require two regression lines.
Therefore,
155
Calculate ∑ X, ∑Y, ∑XY, ∑X2, ∑Y2.
X
Y
XY
X2
Y2
40
50
2000
1600
2500
45
55
2475
2025
3025
70
65
4550
4900
4225
60
70
4200
3600
4900
70
90
6300
4900
8100
75
95
7125
5625
9025
70
105
7350
4900
11025
80
110
8800
6400
12100
95
120
11400
9025
14400
90
125
11250
8100
15625
Total: 695
885
65450
51075
84925
The regression line of Y on X:
Y = a + b X
Where
a ={(∑ x2) (∑ y) – (∑ x) (∑ x y)} / {n (∑ x2) – (∑ x) 2}
and
b ={n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}