Distribution and Lorenz Functions (statistics)

45     Distribution and Lorenz Functions

Notation


pi
=
value of ith break point
i
=
subscript for break point
s
=
number of subintervals
N
=
total number of cases.

45.1  Formula for Break Points

The number of break points is one less than the number of requested subintervals, e.g. medians imply two subintervals and one break point.


pi = V (a) + b [ V (a+ 1) - V (a) ]

where V is an ordered data vector, e.g. V (3) is the third item in the vector,


a = entier é
ê
ë
i ( N + 1)
s
ù
ú
û


b = i ( N + 1)
s
- a

and entier(x) is the greatest integer not exceeding x.

45.2  Distribution Function Break Points

There are four possible situations:

45.3  Lorenz Function Break Points

To determine Lorenz function break points, the ordered data vector is cumulated, and at each step the cumulated total is divided by the grand total. Then the break points are found the same way as described above.

45.4  Lorenz Curve

The Lorenz function plotted against the proportion of the ordered population gives a Lorenz curve, which is always contained in the lower triangle of the unit square. The QUANTILE program uses ten subintervals for the Lorenz curve.

Note that Lorenz function values are called "Fraction of wealth" on the printout.

45.5  The Gini Coefficient

The Gini coefficient represents twice the area between the Lorenz function and the diagonal plotted in the unit square. It takes on values between 0 and 1. Zero (0) indicates "perfect equality" - all data values are equal. One (1) indicates "perfect inequality" - there is one non-zero data value.

The program uses an approximation:


Gini coefficient = 1 - 1
s
- 2
s
s-1
å
i = 1 
li

where li is the ith Lorenz function break point.

This approximation becomes more accurate as the number of break points is increased; it is recommended that at least ten be used.

45.6  Kolmogorov-Smirnov D Statistic

The Kolmogorov-Smirnov test is concerned with the agreement between two cumulative distributions. If two sample cumulative distributions are too far apart at any point, it suggests that the samples come from different populations. The test focuses on the largest difference between the two distributions.

Let V1 and V2 be the ordered data vectors for the first and the second variable respectively, and X the vector of codes which appear in either distribution. The program creates the two cumulative step functions F1 (x) and F2 (x) respectively. Then it looks for maximum absolute difference between the distributions,


D = max
( | F1 (x) - F2 (x) | )

and prints:


x
:
the value where the first maximum absolute difference occurs
f1
:
the value of F1 associated with the x
f2
:
the value of F2 associated with the x.

If the N's for V1 and V2 are equal and less than 40, the program prints K statistic equal to the difference in frequencies associated with the maximum difference. A table of critical values of K statistic, denoted KD, can be consulted to determine the significance of the observed difference.

If the N's for V1 and V2 are unequal or larger than 40, the program prints the following statistics:


Unadjusted deviation = D = | f1 - f2 |


Adjusted deviation = D     æ
 ú
Ö

N1  N2
N1 + N2
 

where N1 and N2 are equal to the number of cases in V1 and V2 respectively.


Chi-squared approximation = 4 D2  ×  N1  N2
N1 + N2

Note: The significance of the maximum directional deviation can be found by referring this chi-square value to a chi-square distribution with two degrees of freedom.

45.7  Note on Weights

For distribution function break points, Lorenz function break points, and the Gini coefficients, data may be weighted by an integer. If a weight is specified, each case is implicitly counted as "w" cases, where "w" is the weight value for the case. The Kolmogorov-Smirnov test is always performed on unweighted data.