45     Distribution and Lorenz Functions
Notation
|
|
|
| |
|
|
subscript for break point |
| |
|
| |
|
|
|
45.1 Formula for Break Points
The number of break points is one less than the number of requested
subintervals, e.g. medians imply two subintervals and one break point.
|
pi = V (a) + b [ V (a+ 1) - V (a) ] |
|
where V is an ordered data vector, e.g.
V (3) is the third item in the vector,
|
a = entier |
é ê
ë
|
i ( N + 1) s
|
ù ú
û
|
|
|
and entier(x) is the greatest integer not exceeding x.
45.2 Distribution Function Break Points
There are four possible situations:
- If a break point falls exactly on a value and the value is not
tied with any other value, then the value itself is the break point.
- If a break point falls between two values and the two values are not
the same, then the break point is determined using ordinary linear
interpolation.
- If a break point falls exactly on a value and the value is tied
with one or more other values, then the procedure involves computing new
midpoints.
Let k be the value, m be the frequency with which it occurs
and d be the minimum distance between items in the vector V.
The interval k±min(d,1)/2 is divided into m parts and
midpoints are computed for these new intervals. The break point is then
the appropriate midpoint.
- If a break point falls between two values which are identical,
the procedure involves both the calculation of new midpoints and
ordinary linear interpolation.
Let k be the value, m be the frequency with which it occurs
and d be the minimum distance between items in the vector V.
The interval k±min(d,1)/2 is divided into m parts and
midpoints are computed for these new intervals. Then linear interpolation is
performed between the two appropriate new midpoints.
45.3 Lorenz Function Break Points
To determine Lorenz function break points, the ordered data vector
is cumulated, and at each step the cumulated total is divided by the
grand total. Then the break points are found the same way as described
above.
45.4 Lorenz Curve
The Lorenz function plotted against the proportion of the ordered
population gives a Lorenz curve, which is always contained in the
lower triangle of the unit square. The QUANTILE program uses ten
subintervals for the Lorenz curve.
Note that Lorenz function values are called "Fraction of wealth"
on the printout.
45.5 The Gini Coefficient
The Gini coefficient represents twice the area between the Lorenz
function and the diagonal plotted in the unit square. It takes on
values between 0 and 1.
Zero (0) indicates "perfect equality" - all data values are equal.
One (1) indicates "perfect inequality" - there is one non-zero data value.
The program uses an approximation:
|
Gini coefficient = 1 - |
1 s
|
- |
2 s
|
|
s-1 å
i = 1
|
li |
|
where li is the ith Lorenz function break point.
This approximation becomes more accurate as the number of break points
is increased; it is recommended that at least ten be used.
45.6 Kolmogorov-Smirnov D Statistic
The Kolmogorov-Smirnov test is concerned with the agreement between
two cumulative distributions.
If two sample cumulative distributions are too far apart at any point,
it suggests that the samples come from different populations. The test
focuses on the largest difference between the two distributions.
Let V1 and V2 be the ordered data vectors
for the first and the
second variable respectively, and X the vector of codes which appear
in either distribution. The program creates the two cumulative step
functions F1 (x) and F2 (x) respectively. Then it looks for
maximum absolute difference between the distributions,
|
D = |
max
| ( | F1 (x) - F2 (x) | ) |
|
and prints:
|
|
|
|
the value where the first maximum absolute difference occurs |
| |
|
|
the value of F1 associated with the x |
| |
|
| the value of F2 associated with the x. |
|
|
If the N's for V1 and V2 are equal and less than 40,
the program prints K statistic equal to the difference in frequencies
associated with the maximum difference. A table of critical values of K
statistic, denoted KD, can be consulted to determine the significance
of the observed difference.
If the N's for V1 and V2 are unequal or larger than 40,
the program prints the following statistics:
|
Unadjusted deviation = D = | f1 - f2 | |
|
|
Adjusted deviation = D |
æ ú
Ö
|
|
|
|
where N1 and N2 are equal to the number of cases in
V1
and V2 respectively.
|
Chi-squared approximation = 4 D2 × |
N1 N2 N1 + N2
|
|
|
Note: The significance of the maximum directional deviation can be found
by referring this chi-square value to a chi-square distribution with
two degrees of freedom.
45.7 Note on Weights
For distribution function break points, Lorenz function break points, and
the Gini coefficients, data may be weighted by an integer.
If a weight is specified, each case is implicitly counted as "w" cases,
where "w" is the weight value for the case. The Kolmogorov-Smirnov test
is always performed on unweighted data.