Solutions for Homework sheet 6¶

Problem 1¶

(a)¶

Alternatives for the tests are

$H_0: \rho_{12}=0$ i.e., $Y_1$ and $Y_2$ are independent

$H_a: \rho_{12} \neq 0$ i.e., $Y_1$ and $Y_2$ are not independent

Test statistics for the alternatives

$t^*=\frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$

Decision rules are

If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$

If $|t^*| \gt t(1-\alpha/2;n-2)$, conclude $H_a$

import numpy as np
from scipy import stats


r12=0.61
n=84

tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.05
t=stats.t.ppf(1-alpha/2,n-2)  

# fancy display of realtionships 
tstar0=format(tstar,'.3f');t0=format(t,'.3f');

from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))

We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., Y_1 and Y_2 are not independent

(b)¶

Interval estimation of $\rho_{12}$ using the $z'$ transformation

Fisher transformation:

$z' = \frac{1}{2}\ln\frac{1+r_{12}}{1-r_{12}} = \tanh^{-1}(r_{12})$

$E(z) = \zeta = \frac{1}{2}\ln\frac{1+\rho_{12}}{1-\rho_{12}} = \tanh^{-1}(\rho_{12})$

$\sigma(z') = \frac{1}{\sqrt{n-3}}$

And the $1-\alpha$ interval for $\zeta$ is:

$z' \pm z(1-\alpha/2)\sigma(z')$

To get the intervals for $\rho_{12}$ we use the transformation

$\rho_{12}=\tanh{\zeta}$ and $r_{12}=\tanh{z'}$

z_prime = np.arctanh(r12)
z = stats.norm.ppf(1 - (alpha/2))
sigma_of_z = 1/np.sqrt(n - 3)

ub0=z_prime + z*sigma_of_z;lb0=z_prime - z*sigma_of_z; 
r12_uplim = np.tanh(ub0);r12_lowlim = np.tanh(lb0)

r12_uplim0 = format(r12_uplim,'.3f'); r12_lowlim0=format(r12_lowlim, '.3f')
from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho_{12}$ is  ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))

(c)¶

r12_uplim0 = format(r12_uplim**2,'.3f'); r12_lowlim0=format(r12_lowlim**2, '.3f')

from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho^2_{12}$ is  ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))

$\rho^2_{12}$ may be interpreted as proportionate reduction of total variance of one variable associated with use of the other variable. Here if we use the variable $Y_1$ (value of contract) then with 95% confidence we can explain between 20.7% to 53.2% of the variability in $Y_2$ (profit contribution generated by the contract). As in correlation model both variables are treated as symmetric therefore this statement will be true even if we interchange the variables.

Problem 2¶

from read_data_fx import read_tb_data
data=read_tb_data(1,28)
# Percentage of individuals with high school diploma is X
Y1=data[:,1]
# Crime rate is Y
Y2=data[:,0];

(a)¶

# Pearson product-moment correlation coefficient function

def pearson_corrX(Y1,Y2):
    barY1=np.mean(Y1); barY2=np.mean(Y2)
    Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
    r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
    return r12

r12=pearson_corrX(Y1,Y2)
r12_0 = format(r12,'.4f');

display(Latex(r'Pearson product-moment correlation coefficient'))
display(Math('r_{12}= '+str(r12_0)))

(b)¶

Alternatives for the tests are

$H_0: \rho_{12} = 0$ (crime rate and percentage of individuals with high school diploma are independent)

$H_a: \rho_{12} \neq 0$ (crime rate and Percentage of individuals with high school diploma are not independent)

The test statistic

$t^* = \frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$

Decision rules are

If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$

If $|t^*| > t(1-\alpha/2;n-2)$, conclude $H_a$

n=len(Y1)
tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.01
t=stats.t.ppf(1-alpha/2,n-2)  

# fancy display of realtionships 
tstar0=format(tstar,'.3f');t0=format(t,'.3f');

from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))

We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., crime rate and Percentage of individuals with high school diploma are not independent

Problem 3¶

# read the data 
from read_data_fx import read_tb_data
data=read_tb_data(1,19)

#ACT test score (X)
X=data[:,1]
#GPA at the end of the freshman year (Y)
Y=data[:,0];

(a)¶

import matplotlib.pyplot as plt 
%matplotlib inline 

plt.figure(figsize=(6,6))
plt.boxplot(X, sym = 'ko',whis = 1.5)
plt.xticks([1],['ACT scores'],fontsize=18)
plt.yticks(range(10,40,5),fontsize=18)
plt.ylim([10,37]);

There are no noteworthy features in the box plot of ACT scores (predictor variable). It appears symmetric with no outliers.

(b)¶

# linear regression function 
def lin_reg(X,Y):
    barX=np.mean(X); barY=np.mean(Y)
    XminusbarX=X-barX; YminusbarY=Y-barY
    b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
    b0=barY-b1*barX
    Yhat=b0+b1*X
    e_i=Y-Yhat
    sse=np.sum(e_i**2)
    ssr=np.sum((Yhat-barY )**2)
    
    return {'Yhat':Yhat,'b0':b0,'b1':b1,'e_i': e_i,'sse':sse,'ssr':ssr}

#residuals
e_i = lin_reg(X,Y)['e_i'] 

# binning the residuals
n, bins = np.histogram(e_i, 25)

# bin centers 
bincenters = 0.5*(bins[1:]+bins[:-1])


# figure size
plt.figure(figsize=(8,8))

j=0
for i in n:
    y=range(0,i); x=[bincenters[j]]*len(y); j=j+1
    plt.plot(x,y,'ro',ms=12)

plt.ylim([-2,16])

plt.xticks(np.arange(-3.0,1.5,1.0),fontsize=18) ;
plt.yticks(np.arange(-2.0,16,2.0),fontsize=18) ;


plt.xlabel('$e_i$',fontsize=28);

Most residuals are small and their distribution resembles a normal distribution. There exists only few large residuals i.e., only few points in the data show larger departures from the linear regression model.

(c)¶

Yhat = lin_reg(X,Y)['Yhat']


plt.figure(figsize=(10,10))

plt.axhline(y = 0, color = 'r', linewidth=5)
plt.plot(Yhat,e_i,'co',ms=12);

plt.xlabel('$\hat{Y_i}$',fontsize=28);
plt.ylabel('$e_i$',fontsize=28);
plt.yticks(np.arange(-3.0,3.0,1.0),fontsize=18) ;
plt.xticks(np.arange(2.7,3.5,0.3),fontsize=18) ;
plt.xlim([2.6,3.6])
plt.ylim([-3,3])

(-3, 3)

Majority of the residual values are evenly scattered around the zero value indicating that linear model is indeed appropriate for the given data. We only also observe few outliers.

(d)¶

def normal_prob_plot(e_i):
    n=len(e_i);sse=np.sum(e_i**2)
    mse_sqrt=np.sqrt(sse/(n-2.0))
    from scipy import stats
    kr=stats.rankdata(e_i)
    e_ik=np.zeros(n)
    e_ik=mse_sqrt*stats.norm.ppf((kr-0.375)/(n+0.25))
    return e_ik

e_ik=normal_prob_plot(e_i)
plt.plot(e_ik,e_i,'ko',ms=10) 

Yhat_e=lin_reg(e_ik,e_i)['Yhat']
plt.plot(e_ik,Yhat_e,'r-',lw=1); 

plt.title('Normal probability plot',fontsize=15)  
plt.xlabel('Expected',fontsize=15);
plt.ylabel('Residual',fontsize=15);

def pearson_corrX(Y1,Y2):
    barY1=np.mean(Y1); barY2=np.mean(Y2)
    Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
    r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
    return r12


r12=pearson_corrX(e_i,e_ik); 
display(Math(r'\rho='+str(r12))) 
print 'n=',len(X)

n= 120

$H_0:$ Normal

$H_a:$ Not Normal

$\rho \geq \rho_c$ conclude H_0 else conclude $H_a$

$\rho_c$ is the critical values for coefficient of correlation between ordered residuals and expected values under normality when distribution of error terms is normal. TABLE B.6 IN TEXTBOOK OR “Use of the Correlation Coefficient with Normal Probability Plots,” The American Statistician, 75-79

$\rho_c>0.987$ for $n=120$ at $\alpha=0.05$

Here $\rho = 0.974$ i.e., $\rho \lt \rho_c$ therefore we conclude $H_a$ i.e., residuals are not normally distributed at $5\%$ significance level.

(e)¶

Decision rule for Brown-Forsythe Test:

If $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, conclude the error variance is constant

If $|t^*_{BF}| > t(1-\alpha/2;n-2)$, conclude the error variance is not constant

Where test statistics is given by

$$t^*_{BF} = \frac{\bar{d_1}-\bar{d_2}}{s\sqrt{\frac{1}{n_1}-\frac{1}{n_2}}}$$

xx,idxn1=np.nonzero([X < 26]); xx,idxn2=np.nonzero([X >= 26])


n1=len(idxn1); n2=len(idxn2); n=len(X)  



X1=X[idxn1];X2=X[idxn2]
Y1=Y[idxn1];X2=Y[idxn2]
e_i1=e_i[idxn1];e_i2=e_i[idxn2]

di1=abs(e_i1-np.median(e_i1))
di2=abs(e_i2-np.median(e_i2))
di1minusd1bar2=(di1-np.mean(di1))**2.0; di2minusd2bar2=(di2-np.mean(di2))**2;

s1=np.sum(di1minusd1bar2)+np.sum(di2minusd2bar2)


s=np.sqrt(s1/(n-2.0))

d12=np.mean(di1)-np.mean(di2); 
tbfstar=d12/(s*np.sqrt(1.0/n1+1.0/n2))

alpha=0.05; t=stats.t.ppf(1-alpha/2.0,n-2);

alpha=0.01; t=stats.t.ppf(1-alpha/2.0,n-2);

t0=format(t,'.3f');

tbfstar0 = format(abs(tbfstar),'.4f')
display(Math(r'|t^*_{BF}|='+str(tbfstar0))) 
display(Math(r't='+str(t0)))

We observe that $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, therefore we conclude that the error variance is constant.