4(1) Example of Pearson Correlation

Research Question

:

What is the pattern of relationships between eleven scientific fields of India’s cooperation links with foreign countries?

Methodology

:

Pearson correlation

Dataset

:

COOP.DAT

SYNTAX
$RUN PEARSON
$FILES
PRINT = PEARSON.LST
DICTIN = COOP.DIC
DATAIN =COOP.DAT
$SETUP
PROTOTYPE FOR PEARSON PROGRAM
BADDATA=MD1 -
MDHANDLING=CASE -
ROWVARS=(V1-V11) -
PRINT=(DICT,COVA,PAIR,XPRODUCTS)
WRITE=(CORR)
Extract from Computer Output

After filtering 105 cases read from the input data file

1 cases contained illegal characters and were treated according to BADDATA specification

Number of processed cases: 104

 

Variable adjusted Mean S. D. Mean S. D. T-test Correlation coeff.

Pair Wt. sum X X Y Y T R(i,j)

1 - 2 104. 2.365 10.650 32.769 102.245 17.469 .8657

1 - 3 104. 2.365 10.650 8.317 32.037 20.092 .8935

Unpaired means and standard deviations***

Variable Variable Adjusted Adjusted Mean S. D.

Name No. N Wt. sum Sum X Sum X2 X X

v1 1 104 104 2.4600000E+02 1.2264000E+04 2.365 10.650

v2 2 104 104 3.4080000E+03 1.1884340E+06 32.769 102.245

  Correlation matrix ***
0
VAR
1
2
3
4
5
6
7
8
9
10
0v2
2
.8657
                 
0v3
3
.8935
.9560
               
0v4
4
.8613
.9239
.9596
             
0v5
5
.8978
.9533
.9773
.9673
           
0v6
6
.7940
.8523
.8718
.9149
.8862
         
0v7
7
.8513
.9252
.9430
.9552
.9536
.9034
       
0v8
8
.9084
.9465
.9865
.9585
.9767
.8836
.9600
     
0v9
9
.9565
.9361
.9637
.9402
.9596
.8738
.9436
.9794
   
0v10
10
.9594
.8534
.9163
.8551
.9033
.7828
.8574
.9360
.9520
 
0v11
11
.8910
.9501
.9649
.9410
.9658
.8727
.9608
.9728
.9543
.9157
  Cross Products Matrix ***
0
VAR
1
2
3
4
5
6
7
8
9
10
0v2
2
.8657
                 
0v3
3
.8935
.9560
               
0v4
4
.8613
.9239
.9596
             
0v5
5
.8978
.9533
.9773
.9673
           
0v6
6
.7940
.8523
.8718
.9149
.8862
         
0v7
7
.8513
.9252
.9430
.9552
.9536
.9034
       
0v8
8
.9084
.9465
.9865
.9585
.9767
.8836
.9600
     
0v9
9
.9565
.9361
.9637
.9402
.9596
.8738
.9436
.9794
   
0v10
10
.9594
.8534
.9163
.8551
.9033
.7828
.8574
.9360
.9520
 
0v11
11
.8910
.9501
.9649
.9410
.9658
.8727
.9608
.9728
.9543
.9157
 

Covariance Matrix (with diagonal) ***

0
VAR
1
2
3
4
5
6
7
8
9
10
11
0v1
1
112.328
                   
0v2
2
933.613
10353.430
                 
0v3
3
301.913
3101.362
1016.505
               
0v4
4
134.171
1381.845
449.703
216.055
             
0v5
5
143.408
1461.800
469.609
214.278
227.124
           
0v6
6
63.132
650.611
208.540
100.898
100.205
56.288
         
0v7
7
321.283
3352.261
1070.613
499.984
511.739
241.349
1268.047
       
0v8
8
322.184
3222.930
1052.538
471.488
492.570
221.830
1144.002
1119.842
     
0v9
9
203.893
1915.839
617.986
277.958
290.881
131.858
675.861
659.229
404.559
   
0v10
10
50.534
431.523
145.185
62.463
67.656
29.186
151.738
155.664
95.163
24.697
 
0v11
11
82.038
839.904
267.279
120.173
126.452
56.885
297.240
282.837
166.760
39.538
75.481
INTERPRETATION

IDAMS reports that 105 cases were read; one case which had illegal characters was treated as bad data.

 

Descriptive statistics for all pairs of variables. Pairwise comparison of means by t-test.// descriptive statistics of single variables.

  Correlation matrix
 

Matrix of cross products

The elements of the matrix are computed by the following formula:

 

Covariance Matrix

The elements of the matrix are computed by the following formula:

Covariance (X,Y) = S (X-) ´ (Y-)

The Pearson, Covariance and Cross products measures are related. If each entry of the Cross product matrix is divided by n – 1, the result is a Covariance matrix. If each entry of the Covariance matrix is divided by the product of the standard deviations of the two variables, the result is a Correlation matrix.

4(2) Example of Chi-square

Research Question

:

Are there any differences in the distribution of academics of different ranks in different type of institutions? In other words, is there any association between rank of an academic and type of institution.

Methodology

:

Chi-square

Dataset

:

ANJU.DAT

SYNTAX
$RUN TABLES
$FILES
PRINT = MYTAB.LST
DICTIN = ANJU.DIC
DATAIN = ANJU.DAT
$SETUP
EXAMPLES OF TABLES
PRINT=DICT
TABLE
SR=V14 C=V9 CELL=(FREQ,ROWP,COLP,TOTP)-
STAT=(CHI,CV) MDHANDL=ALL
EXTRACT FROM COMPUTER OUTPUT

The data matrix is 2 variables and 1073 cases

Row Variable number: 14

Column Variable number: 9

sv:inst type

v204:rank

 
         |       1|       2|       3|       9|
         |prof    |reader  |lecturer|        |    Total  Revised
 ________|________|________|________|________|
        1|        |        |        |        |
 type1   |      97|     126|      51|       3|   377         374
 Row    %|   25.94|   33.69|   40.37|     .00|   100.00
 Col    %|   26.43|   33.51|   48.40|     .00|    35.45
 Tot    %|    9.19|   11.94|   14.31|     .00|    35.45
 ________|________|________|________|________|
        2|        |        |        |        |
 type2   |     147|      96|      42|      12|      297      285
 Row    %|   51.58|   33.68|   14.74|     .00|   100.00
 Col    %|   40.05|   25.53|   13.46|     .00|    27.01
 Tot    %|   13.93|    9.10|    3.98|     .00|    27.01
 ________|________|________|________|________|
        3|        |        |        |        |
 type3   |      41|      17|       2|       2|       62       60
 Row    %|   68.33|   28.33|    3.33|     .00|   100.00
 Col    %|   11.17|    4.52|     .64|     .00|     5.69
 Tot    %|    3.89|    1.61|     .19|     .00|     5.69
 ________|________|________|________|________|
        4|        |        |        |        |
 type4   |      82|     137|     117|       1|      337      336
 Row    %|   24.40|   40.77|   34.82|     .00|   100.00
 Col    %|   22.34|   36.44|   37.50|     .00|    31.85
 Tot    %|    7.77|   12.99|   11.09|     .00|    31.85
 ________|________|________|________|________|
Totals        367      376      312       18      1073
 Col    %   100.00   100.00   100.00      .00
 Tot    %    34.79    35.64    29.57      .00    100.00
Revised       367      376      312        0               1055

Column          9 is missing data and was deleted
  Chi square 118.50 Cramer's V .24 Contingency coefficient .32 Degrees of freedom 6 Adjusted n 1055
INTERPRETATION

IDAMS reports that there are two variables and 1073 cases.

Row variable is type of institution and column variable is rank.

  Cross tabulation of ranks of academics and types of institutions.
  The value of Chi-square is statistically highly significant (p < .301) which means that the association between categories of rank and type of institution is not random.

4(3) Example of Oneway Analysis of Variance

Research Question

:

How does the (time) involvement of an academic scientist in teaching vary with his rank?

Methodology

:

Oneway Analysis of Variance

Dataset

:

ANJU.DAT

SYNTAX
$RUN ONEWAY
$FILES
PRINT = ONE_WAY.LST
DICTIN = ANJU.DIC
DATAIN = ANJU.DAT
$SETUP
EFFECT OF RANK ON INVOLVEMENT IN ADMINISTRATIVE WOR 
BADDATA=MD1 -
PRINT=CDICT
DEPVARS=(V2) CONVARS=(V9)
BADDATA=MD1 -
PRINT=CDICT
DEPVARS=(V2) CONVARS=(V9)
Extract from Computer Output

After filtering 1073 cases read from the input data file

3 cases contained illegal characters and were treated

according to BADDATA specification

 

Control variable = var 9 v204:rank

Depend. variable = var 2 v262:teaching

 
Code
Label
N
Weight-sum
%
Mean
S.D.(estim.)
Sum of X
%
Sum of X-square
1
prof
363
363
35.0
34.824
16.076
.1264100E+05
29.0
.5337650E+06
2
reader
366
366
35.3
42.440
16.957
.1553300E+05
35.7
.7641770E+06
3
lecturer
309
309
29.8
49.693
18.412
.1535500E+05
35.3
.8674430E+06
Total
 
1038
1038
100.0
41.935
18.107
.4352900E+05
100.0
.2165385E+07
 
Total sum of squares = .3399767E+06
For 3 groups , Eta = .3301004E+00
For 3 groups , Etasq = .1089663E+00
For 3 groups , Eta(adj) = .3274820E+00
For 3 groups , Etasq(adj) = .1072445E+00
Between means sum of squares = .3704599E+05
Within groups sum of squares = .3029307E+06
F( 2,1035) = 63.286
INTERPRETATION
IDAMS reports that 1073 cases were read, out of 1038 cases were used in the analysis ( 3 cases with illegal characters and 32 cases with missing data were treated as bad data.

  Specification Dependent variable = Time spent on teaching, Control variable = Rank ( 3 categories – PRO(FESSOR), READER, LECTURER)
  Descriptive statistics
 

Eta indicates the strength of relationship between the dependent variable and the control variable (Eta=1 signifies perfect relationship and Eta=0 signifies no relationship).

Eta adjusted : Eta adjusted for degrees of freedom.

F ratio is statistically highly significant (p > .005. So we can conclude that the involvement of an academic scientist varies with his rank

4(4) Example of Simple Linear Regression

Research Question

:

How does the involvement of an academic scientist in teaching affect his involvement in research?

Methodology

:

Simple linear regression

Dataset

:

ANJU.DAT

SYNTAX
$RUN REGRESSN
$FILES
PRINT = ANJU.LST
DICTIN = ANJU.DIC
DATAIN = ANJU.DAT
$SETUP
REGRESSN OF ACADEMIC INVOLVEMENT
BADDATA=MD1 -
  MDHANDLING=50 -
  PRINT=(DICT,MATRIX)
  DEPVAR=V3 -
  VARS=(V2)
EXTRACT FROM COMPUTER OUTPUT

After filtering1073 cases read from the input data file
3 cases contained illegal characters and were treated according to BADDATA

  specification
Number of variables = 2
Number of cases = 1055
 

General statistics

Variable
   
Standard
Range
 
Number
Sum
Average
Deviation
Max
Min
Variable name
             
2
44103.00000
41.80379
18.08305
90.0000
.0000
v262:teaching
3
23662.00000
22.42844
12.45037
100.0000
.0000
v263:research
 

Total correlation matrix,R(i,j)

Variable

2 3
2 1.00000  

3

-.34391 1.00000
 

Dependent variable is V 3 v263:research

Standard error of estimate

11.70
     
F ratio for the regression
141.253
     

Multiple correlation coefficient

.34391
  adjusted  

Fraction of explained variance (RSQD)

.11828
  adjusted .11744

Determinant of the correlation matrix

1.0000
     
Residual degrees of freedom (N-K-1)
1053
     
Constant term
32.327
     
 
Var. no.
B
Sigma(B)
Beta
Sigma(Beta)
Partial
RSQD
Marg RSQD
T-ratio
Cov. ratio
Variable name
2
.2368
.0199
.3439
.0289
.1183
.1183
11.8850
.0000
v262:teaching
INTERPRETATION
IDAMS reports that 1073 cases were read, out of which 1055 cases were used in the analysis – 3 cases with illegal character and 15 cases with missing data were excluded..

 

Specification:

Number of variables=2; Dependent variable:= Time on Research, Independent variable =:Ttime on teaching

  Descriptive statistics of both dependent and independent variables.
  Correlation matrix shows that the two variables are correlated negatively.
 

Standard error of the estimate is a measure of the reliability of the estimating equation, indicating the variability of the observed points around the regression line – in other words, the extent to which the observed values differ from their predicted values on the regression line. :

F ratio in the aanalysis of variance table is used to test the hypothesis that the slope ( b ) of the regression line is 0. F ratio is large when the independent variable explains the variation in the dependent variable. There is a significant negative linear relationship between time spent on research and time spent on teaching.

(F ratio=142.153; degrees of freedom = 1, 1053 ; p < .001)

Multiple correlation coefficient (Multiple R) is the correlation between the dependent variable (cooperation links) and the predicted value. Greater the value of Multiple R, greater is the agreement between the predicted and observed values. .

Fraction of explained variance (RSQD) can be interpreted as the proportion of the variation in the dependent variable explained by the regression line. It is also called the coefficient of determination. Both Multiple R and Coefficient of Determination are indicators of goodness of overall effectiveness of the linear regression. If the value of R2=1, then the regression line is the perfect estimator. If R2 = 0, then there is no relationship between X and Y. .

Determinant of the Correlation Matrix .is the determinant of the correlation matrix of the predictors. It represents as a single number the generalized variance in a set of variables., and varies from 0 to 1. However, this has no meaning in the case of simple linear regression

Residual degrees of freedom: If the constant is not constrained to be zero, df=N-p-1., where N is the total number of observations and p is the number of predictors.

Constant term: This is the constant in the regression equation.

 

B is the regression coefficient i.e the slope of the regression line.

Sigma B is the standard error of the regression coefficient, which is a measure of the sample regression coefficient around the population regression coefficient. It is an indicator of the reliability of the coefficient. Smaller values indicate greater reliability.

Beta is the standardized regression coefficient, which is independent of the scale of measurement. In the case of simple regression , Beta is equal to Multiple R. . Sigma Beta is the standard error of Beta.

RSQD is the fraction of the explained variance. Marginal RSQD: Since there is only one predictor, Marginal RSQD ( .1183) is equal to RSQD (.1183).

T ratio is used to test the hypothesis that B =0. T ratio = B/ Sigma B. Its significance can be tested from the table of t with n-p-1 degrees of freedom. Here, the value of t =11.005, df = 1053, which is highly significant ( p < .0001).

Covariance ratio of a variable is equal to the square of Multiple correlation coefficient with other independent variables in the regression equation. It has no meaning in the case of simple linear regression.

4(5) Example of Simple Linear Regression with Dummy Variables

Research Question

:

What is the effect of the status of a scientist on the time devoted to administration?

Methodology

:

Simple linear regression

Dataset

:

ICSOPRU (R2CM.DAT)

SYNTAX
$RUN REGRESSN
$FILES
PRINT = DUM.LST
DICTIN = R2R3CM.DIC
DATAIN = R2CM.DAT
$SETUP
INCLUDE V1=360
DUMMY REGRESSION ICSOPRU DATA
BADDATA=MD1 -
MDHANDLING=50 -
CATE -
PRINT=(DICT,MATRIX)
V201(1,2)
DEPVAR=V222 -
VARS=(V201)
EXTRACT FROM COMPUTER OUTPUT

After filtering 1151 cases read from the input data file

Number of variables = 3

Number of cases = 1149

 

General statistics

Variable
   
Standard
Range
 
Number
Sum
Average
Deviation
Max
Min
Variable name
201-1
239
.20801
.40606
1.0000
.0000
Rank_1
201-2
605.
.52654
.49951
1.0000
.0000
Rank_2
222
7837
6.82071
9.27874
75.0000
.0000
% Administrative work
 

Total correlation matrix, R(i,j)

Variable 201 201 222
201 1.00000    
201 -.54045 1.00000  
222 .52016 -.23127 1.00000
 

Standard regression

dependent variable is V 222
J1C: % ADMINISTRATIVE WK

Standard error of estimate

7.913
 

F ratio for the regression

216.336
 

Multiple correlation coefficient

.52352
adjusted   .52231

Fraction of explained variance (RSQD)

.27407
adjusted   .27281

Determinant of the correlation matrix

.70791
 

Residual degrees of freedom (N-K-1)

1146
 

Constant term

3.4787
 
 
Var.no.
B
Sigma
Beta
Sigma
p
Marg
T-ratio
Cov.
Variable name
   
(B)
 
(Beta)
RSQD
RSQD
 
Ratio
 
201- 1
12.7556
.6835
.5582
.0299
.2331
.2206
18.6611
.2921
CM POSITION IN UNIT
201- 2
1.3081
.5557
.0704
.0299
.0048
.0035
2.3541
.2921
CM POSITION IN UNIT
INTERPRETATION

IDAMS reports that 1151 cases were taken after filtering, out of which 1149 cases were used for regression analysis. Two cases with missing data were excluded.

The independent variable Rank is categorized into two dummy variables ( Rank_1 = Head, Rank_2 = Scientist). Thus, the total number of varibles=3 (One dependent and two independent dummy variables)

Dependent variable is the percentage of work time spent on administrative work.

  Descriptive statistics of both dependent and independent variables.
  Total Correlation Matrix: The elements of this matrix are computed directly from the matrix of residual sums of squares and cross products.
 

Standard error of estimate is the standard deviation of the residuals.

F Ratio in the Analysis of Variance table is used to test the hypothesis: ( b 1, b 2 = 0 ). F Ratio is large when the independent variable explains the variation in the dependent variable. There is a significant linear relationship between the rank of a scientist and the time devoted to Administrative work. ( F Ratio=216.336; degrees of freedom = 2, 1146; p < .001) This implies that Rank does affect the time devoted by a scientist to Administrative work.

Multiple correlation coefficient (Multiple R) is the correlation between the dependent variable ( time spent on Administrative work) and the predicted value. Greater the value of Multiple R, greater is the agreement between the predicted and observed values. Here, the value of Multiple R is sufficiently large.

Fraction of explained variance (RSQD) can be interpreted as the proportion of the variation in the dependent variable explained by the predictor variables. It is also called the coefficient of determination. It is equal to the square of Multiple R. Adjusted squared Multiple R ( Adjusted Fraction of the variance explained) = R2 – (p-1) (1-r2)/ (n-p), where p is the number of predictors. Both Multiple R and Coefficient of Determination are indicators of goodness of overall effectiveness of the linear regression.

Determinant of the Correlation Matrix .is the determinant of the correlation matrix of the predictors. It represents as a single number the generalized variance in a set of variables., and varies from 0 to 1. Determinants near zero indicate that some or all predictors are highly correlated. Here, the determinant of the correlation matrix ( .70791) is quite large, which indicates that the predictor variables (i.e. the categories of Rank) are not highly correlated. Note that a high correlation among the predictors can threaten computational accuracy, since it inflates the standard errors of the regression coefficients, which in turn attenuate the associated F statistics.

Residual degrees of freedom: If the constant is not constrained to be zero, df=N-p-1., where N is the total number of observations and p is the number of predictors.

 

Regression coefficient for Rank_1 is statistically highly significant ( t = 18.66, df= 1146, p = .001 )

Partial R squared (RSQD). This is a partial correlation, squared, between the predictor (Rank_1 ) and the dependent variable, with the influence of the other variable (Rank_2 ) eliminated. The Partial correlation coefficient squared is a measure of that part of the variance in the dependent variable that is not explained by other predictors. Here, 23.31 % of the variance in the dependent variable is explained by the dummy variable Rank_1.

Regression coefficient for the dummy variable Rank_2 is also statistically significant ( t = 18.66, df= 1146, p = .001). The value of Partial correlation squared indicates that the dummy variable Rank_2 explains only 4.48 % of the variance.