Alternatives for the tests are
$H_0: \rho_{12}=0$ i.e., $Y_1$ and $Y_2$ are independent
$H_a: \rho_{12} \neq 0$ i.e., $Y_1$ and $Y_2$ are not independent
Test statistics for the alternatives
$t^*=\frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$
Decision rules are
If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$
If $|t^*| \gt t(1-\alpha/2;n-2)$, conclude $H_a$
import numpy as np
from scipy import stats
r12=0.61
n=84
tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.05
t=stats.t.ppf(1-alpha/2,n-2)
# fancy display of realtionships
tstar0=format(tstar,'.3f');t0=format(t,'.3f');
from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))
We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., Y_1 and Y_2 are not independent
Interval estimation of $\rho_{12}$ using the $z'$ transformation
Fisher transformation:
$z' = \frac{1}{2}\ln\frac{1+r_{12}}{1-r_{12}} = \tanh^{-1}(r_{12})$
$E(z) = \zeta = \frac{1}{2}\ln\frac{1+\rho_{12}}{1-\rho_{12}} = \tanh^{-1}(\rho_{12})$
$\sigma(z') = \frac{1}{\sqrt{n-3}}$
And the $1-\alpha$ interval for $\zeta$ is:
$z' \pm z(1-\alpha/2)\sigma(z')$
To get the intervals for $\rho_{12}$ we use the transformation
$\rho_{12}=\tanh{\zeta}$ and $r_{12}=\tanh{z'}$
z_prime = np.arctanh(r12)
z = stats.norm.ppf(1 - (alpha/2))
sigma_of_z = 1/np.sqrt(n - 3)
ub0=z_prime + z*sigma_of_z;lb0=z_prime - z*sigma_of_z;
r12_uplim = np.tanh(ub0);r12_lowlim = np.tanh(lb0)
r12_uplim0 = format(r12_uplim,'.3f'); r12_lowlim0=format(r12_lowlim, '.3f')
from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho_{12}$ is ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))
r12_uplim0 = format(r12_uplim**2,'.3f'); r12_lowlim0=format(r12_lowlim**2, '.3f')
from IPython.display import display, Math, Latex
display(Latex(r'The $95\%$ confidence interval for $\rho^2_{12}$ is ' ))
display(Math( '['+str(r12_lowlim0)+r' ~,~ '+str(r12_uplim0)+']'))
$\rho^2_{12}$ may be interpreted as proportionate reduction of total variance of one variable associated with use of the other variable. Here if we use the variable $Y_1$ (value of contract) then with 95% confidence we can explain between 20.7% to 53.2% of the variability in $Y_2$ (profit contribution generated by the contract). As in correlation model both variables are treated as symmetric therefore this statement will be true even if we interchange the variables.
from read_data_fx import read_tb_data
data=read_tb_data(1,28)
# Percentage of individuals with high school diploma is X
Y1=data[:,1]
# Crime rate is Y
Y2=data[:,0];
# Pearson product-moment correlation coefficient function
def pearson_corrX(Y1,Y2):
barY1=np.mean(Y1); barY2=np.mean(Y2)
Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
return r12
r12=pearson_corrX(Y1,Y2)
r12_0 = format(r12,'.4f');
display(Latex(r'Pearson product-moment correlation coefficient'))
display(Math('r_{12}= '+str(r12_0)))
Alternatives for the tests are
$H_0: \rho_{12} = 0$ (crime rate and percentage of individuals with high school diploma are independent)
$H_a: \rho_{12} \neq 0$ (crime rate and Percentage of individuals with high school diploma are not independent)
The test statistic
$t^* = \frac{r_{12}\sqrt{n-2}}{\sqrt{1-r^2_{12}}}$
Decision rules are
If $|t^*| \leq t(1-\alpha/2;n-2)$, conclude $H_0$
If $|t^*| > t(1-\alpha/2;n-2)$, conclude $H_a$
n=len(Y1)
tstar=(r12*np.sqrt(n-2))/np.sqrt(1-r12**2)
alpha=0.01
t=stats.t.ppf(1-alpha/2,n-2)
# fancy display of realtionships
tstar0=format(tstar,'.3f');t0=format(t,'.3f');
from IPython.display import display, Math, Latex
display(Math(r't^*='+str(tstar0)))
display(Math(r't='+str(t0)))
We observe $|t^*| \gt t(1-\alpha/2;n-2)$, therefore we conclude $H_a$ i.e., crime rate and Percentage of individuals with high school diploma are not independent
# read the data
from read_data_fx import read_tb_data
data=read_tb_data(1,19)
#ACT test score (X)
X=data[:,1]
#GPA at the end of the freshman year (Y)
Y=data[:,0];
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(6,6))
plt.boxplot(X, sym = 'ko',whis = 1.5)
plt.xticks([1],['ACT scores'],fontsize=18)
plt.yticks(range(10,40,5),fontsize=18)
plt.ylim([10,37]);
There are no noteworthy features in the box plot of ACT scores (predictor variable). It appears symmetric with no outliers.
# linear regression function
def lin_reg(X,Y):
barX=np.mean(X); barY=np.mean(Y)
XminusbarX=X-barX; YminusbarY=Y-barY
b1=sum(XminusbarX*YminusbarY)/sum(XminusbarX**2)
b0=barY-b1*barX
Yhat=b0+b1*X
e_i=Y-Yhat
sse=np.sum(e_i**2)
ssr=np.sum((Yhat-barY )**2)
return {'Yhat':Yhat,'b0':b0,'b1':b1,'e_i': e_i,'sse':sse,'ssr':ssr}
#residuals
e_i = lin_reg(X,Y)['e_i']
# binning the residuals
n, bins = np.histogram(e_i, 25)
# bin centers
bincenters = 0.5*(bins[1:]+bins[:-1])
# figure size
plt.figure(figsize=(8,8))
j=0
for i in n:
y=range(0,i); x=[bincenters[j]]*len(y); j=j+1
plt.plot(x,y,'ro',ms=12)
plt.ylim([-2,16])
plt.xticks(np.arange(-3.0,1.5,1.0),fontsize=18) ;
plt.yticks(np.arange(-2.0,16,2.0),fontsize=18) ;
plt.xlabel('$e_i$',fontsize=28);
Most residuals are small and their distribution resembles a normal distribution. There exists only few large residuals i.e., only few points in the data show larger departures from the linear regression model.
Yhat = lin_reg(X,Y)['Yhat']
plt.figure(figsize=(10,10))
plt.axhline(y = 0, color = 'r', linewidth=5)
plt.plot(Yhat,e_i,'co',ms=12);
plt.xlabel('$\hat{Y_i}$',fontsize=28);
plt.ylabel('$e_i$',fontsize=28);
plt.yticks(np.arange(-3.0,3.0,1.0),fontsize=18) ;
plt.xticks(np.arange(2.7,3.5,0.3),fontsize=18) ;
plt.xlim([2.6,3.6])
plt.ylim([-3,3])
Majority of the residual values are evenly scattered around the zero value indicating that linear model is indeed appropriate for the given data. We only also observe few outliers.
def normal_prob_plot(e_i):
n=len(e_i);sse=np.sum(e_i**2)
mse_sqrt=np.sqrt(sse/(n-2.0))
from scipy import stats
kr=stats.rankdata(e_i)
e_ik=np.zeros(n)
e_ik=mse_sqrt*stats.norm.ppf((kr-0.375)/(n+0.25))
return e_ik
e_ik=normal_prob_plot(e_i)
plt.plot(e_ik,e_i,'ko',ms=10)
Yhat_e=lin_reg(e_ik,e_i)['Yhat']
plt.plot(e_ik,Yhat_e,'r-',lw=1);
plt.title('Normal probability plot',fontsize=15)
plt.xlabel('Expected',fontsize=15);
plt.ylabel('Residual',fontsize=15);
def pearson_corrX(Y1,Y2):
barY1=np.mean(Y1); barY2=np.mean(Y2)
Y1minusbarY1=Y1-barY1; Y2minusbarY2=Y2-barY2
r12=np.sum(Y1minusbarY1*Y2minusbarY2)/np.sqrt(np.sum(Y1minusbarY1**2)*np.sum(Y2minusbarY2**2))
return r12
r12=pearson_corrX(e_i,e_ik);
display(Math(r'\rho='+str(r12)))
print 'n=',len(X)
$H_0:$ Normal
$H_a:$ Not Normal
$\rho \geq \rho_c$ conclude H_0 else conclude $H_a$
$\rho_c$ is the critical values for coefficient of correlation between ordered residuals and expected values under normality when distribution of error terms is normal. TABLE B.6 IN TEXTBOOK OR
$\rho_c>0.987$ for $n=120$ at $\alpha=0.05$
Here $\rho = 0.974$ i.e., $\rho \lt \rho_c$ therefore we conclude $H_a$ i.e., residuals are not normally distributed at $5\%$ significance level.
Decision rule for Brown-Forsythe Test:
If $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, conclude the error variance is constant
If $|t^*_{BF}| > t(1-\alpha/2;n-2)$, conclude the error variance is not constant
Where test statistics is given by
$$t^*_{BF} = \frac{\bar{d_1}-\bar{d_2}}{s\sqrt{\frac{1}{n_1}-\frac{1}{n_2}}}$$xx,idxn1=np.nonzero([X < 26]); xx,idxn2=np.nonzero([X >= 26])
n1=len(idxn1); n2=len(idxn2); n=len(X)
X1=X[idxn1];X2=X[idxn2]
Y1=Y[idxn1];X2=Y[idxn2]
e_i1=e_i[idxn1];e_i2=e_i[idxn2]
di1=abs(e_i1-np.median(e_i1))
di2=abs(e_i2-np.median(e_i2))
di1minusd1bar2=(di1-np.mean(di1))**2.0; di2minusd2bar2=(di2-np.mean(di2))**2;
s1=np.sum(di1minusd1bar2)+np.sum(di2minusd2bar2)
s=np.sqrt(s1/(n-2.0))
d12=np.mean(di1)-np.mean(di2);
tbfstar=d12/(s*np.sqrt(1.0/n1+1.0/n2))
alpha=0.05; t=stats.t.ppf(1-alpha/2.0,n-2);
alpha=0.01; t=stats.t.ppf(1-alpha/2.0,n-2);
t0=format(t,'.3f');
tbfstar0 = format(abs(tbfstar),'.4f')
display(Math(r'|t^*_{BF}|='+str(tbfstar0)))
display(Math(r't='+str(t0)))
We observe that $|t^*_{BF}| \leq t(1-\alpha/2;n-2)$, therefore we conclude that the error variance is constant.