### 1.4 Statistical Techniques available in IDAMS

Cluster Analysis

A procedure for partitioning a set of objects into groups or clusters in such a way that profiles of objects in the same cluster are very similar, whereas the profiles of objects in different clusters are quite distinct. The number and characteristics of clusters are not known a priori and are derived from the data.

(Module: CLUSFIND)

Correlation Analysis

Correlation is a measure of the relationship between two or more variables. The most commonly used type of correlation coefficient is Pearson's r, also called linear or product moment correlation. It is essential that the variables are measured on at least interval scales.

(Module: PEARSON)

Discriminant Analysis

A technique for classifying objects into one of two or more alternative groups (or populations) on the basis of a set of measurements (i.e.  Variables). The populations are known to be distinct and an object can belong to only one of them. The technique can also be used to identify which variables contribute to making the classification. Thus, the technique can be used for description as well as for prediction.

(Module: DISCRAN)

Principal Components Analysis

Principal components analysis (PCA) is performed to simplify the description of a set of interrelated variables in a data matrix. PCA transforms the original variables into new uncorrelated variables, called principal components. Each principal component is a linear combination of the original variables. The amount of information conveyed by a principal component is its variance. The principal components are derived in decreasing order of variance. Thus, the most informative principal component is the first, and the least informative is the last.

Factor analysis is similar to principal components analysis in that it is a technique for examining the interrelationships among a set of variables, but its objective is somewhat different. Classical factor analysis is viewed as a technique for clarifying the underlying dimensions or factors that explain the pattern of correlations among a much larger set of variables. This technique is applied to (i) reduce the number of variables, and (ii) detect the structure in the relationships among variables.

Modern factor analysis, implemented in IDAMS, aims to represent geometrically the information in a data matrix in a low-dimensional Euclidean space and to provide related statistics. The fundamental goal is to highlight relations among elements (variables/individuals), which are represented by points in graphical displays (called factorial maps), and reveal the structural features of the data matrix. In these maps, both variables and individuals can be displayed. Since, the number of individuals is often very large; they are represented by the centers of gravity of their categories.

Correspondence analysis (CA) is a multivariate technique for exploring cross-tabular data by converting them into graphical displays, called factorial maps, and related numerical statistics.  CA is primarily intended to reveal features in the data rather than to test hypotheses about the underlying processes, which generate the data. However, correspondence analysis and principal components analysis are used under different circumstances. PCA uses covariances or correlations (Euclidean metrics) for data reduction and is therefore applicable to continuous measurements. CA, on the other hand, uses chi-square metrics and is therefore applicable to contingency tables (cross- tabulations). By extension, correspondence analysis can also be applied to tables with binary coding.

The module can handle active as well as passive variables. Active variables are those, which participate in the determination of factorial axes. Passive variables are those, which do not participate in the determination of factorial axes, but they are projected on to the factorial axes.

(Module: FACTOR)

Multidimensional Scaling (MDS)

Multidimensional scaling is an exploratory data analysis technique that transforms the proximities (or distances) between each pair of objects (or variables) in a given data set into comparable Euclidean distances. MDS produces a spatial representation of the objects (usually two-dimensional maps) in such a way that maximizes the fit between the proximities for each pair of objects and the Euclidean distance between them in the spatial representation. The greater the proximity between the objects, the closer they are situated in the map. Like factor analysis, the main concern of MDS is to reveal the structure of relationships among the objects.

(Module: MDSCAL)

Multiple Classification Analysis (MCA)

MCA is a technique for examining the inter-relationships between several predictor variables and a dependent variable. The technique can handle predictors with no better than nominal measurements and interrelationships of any kind among predictors or between a predictor and the dependent variable. The dependent variable may be interval-scaled or dichotomous.

(Module: MCA)

Analysis of Variance

This statistical technique assesses the effect of an independent or 'control' categorical variable (factor) upon a continuous dependent variable.

(Module: ONEWAY)

POSCOR (Ranking program based on partially ordered sets)

POSCOR is a procedure for ranking of objects when more than one variable is considered simultaneously in rank - ordering. The procedure offers the possibility to give each object belonging to a given set its relative position in probabilistic terms vis-à-vis the other objects in the same set. The position of each object is measured by a score, called POSCOR score.

(Module: POSCOR)

Rank

The procedure allows the aggregation of individual opinions, expressing the choice of priorities, ranking of alternatives or selection of preferences. It determines a reasonable rank order of alternatives, using preference data as input and three different ranking procedures – two based on fuzzy logic and one based on classical logic.

(Module: RANK)

Regression Analysis

A technique for exploring the relationship between a dependent variable and one or more independent variables. Linear regression explores the relationship that can be described by straight lines or their generalization to many dimensions.

(Module: REGRESSN)

Search

Search is a binary segmentation procedure for developing a predictive model for dependent variable(s). It divides the sample through a series of binary splits into mutually exclusive series of subgroups such that at each binary split the two new subgroups reduce the predictive error more than a split into any other pair of subgroups.

(Module: SEARCH)

Typology

A clustering procedure for large data sets, which can handle nominal, ordinal and interval-scaled variables simultaneously. The procedure can handle active and passive variables. Active variables are those, which take part in the construction of the typology, whereas passive variables are those, which do not take part in the construction of the typology, but their average statistics are computed for each typology group.

(Module: TYPOL)

Non-parametric Statistics

Non-parametric statistics allow testing of hypothesis even when certain classical assumptions, such as interval-scale measurement or normal distribution are not met. In research practice, these classical assumptions are often strained. Basically, there is at least one non-parametric equivalent for each parametric general type of test. Non-parametric tests generally fall into the following groups:

• Tests of differences between groups
• Tests of differences between variables
• Tests of relationships between variables

Tests of Differences between Groups

Mann-Whitney U-Test: A non-parametric test equivalent to t-test. It tests whether two independent samples are from the same population. Requires an ordinal level of measurement. U is the number of times a value in the first group precedes a value in the second group when values are ordered in ascending order.

Relationships between Variables

Non-parametric equivalents of correlation coefficient are: Spearman's correlation coefficient Rho, Kendall's Tau and Gamma.

Spearman's Correlation Coefficient is a commonly used non-parametric measure of correlation between two ordinal variables. It can be thought of as the regular product moment correlation coefficient in terms of the proportion of variability accounted for.

Kendall's Tau is a non-parametric measure of association for ordinal or ranked variables. It is equivalent to Spearman's Rho with regard to the underlying assumptions. However, Spearman's Rho and Kendall's Tau are not identical in magnitude, since their underlying logic and computational formulae are quite different. Two different variants of Tau are computed: Tau b and Tau c. These measures differ only as to how tied ranks are handled. In most cases, these values are very similar, and when discrepancies occur, it is probably safer to interpret the lower value.

Another non-parametric measure of correlation is Gamma. In terms of the underlying assumptions, Gamma is equivalent to Spearman's Rho or Kendall's Tau. In terms of interpretation and computation, it is more similar to Kendall's Tau than Spearman's Rho. Gamma statistic is, however, preferable to Spearman's Rho and Kandall's Tau when the data contain many tied observations.

Chi-square test: This goodness of fit test compares the observed and expected frequencies in each category to test whether all the categories contain the same proportion of values.

(Module: TABLES)