Linguistic Diversity of Internet Information Sources

A few studies undertake large-scale quantitative analysis of the languages used on the Internet. Generally, these focus on the World Wide Web, to the exclusion of other communications modes like email and chat, because the Web is more directly observable and easier to survey than other forms of Internet communication.

The Online Computer Library Center OCLC studies (Lavoie and O’Neill, 1999; O’Neill, Lavoie and Bennett, 2003) used a random sample of available websites on the Internet. They accomplished this by generating random IP numbers and attempting to connect to a website on each such address. If a Web server answered, they downloaded its main home page and ran an automated language classification system on it. This method has the advantage of being unbiased. All other methods of sampling rely directly or indirectly on search engines or “Web spiders”, programs which discover new Web pages by following all the links in a known set of Web pages.

The 1998-1999 survey suggested that some international expansion of the Web was taking place, and that the use of different languages was closely correlated with the domain in which each website originated. The 1999 sample of 2229 random websites, for example, provided 29 identifiable languages with the distribution presented in Figure 3.

As can be expected, English is clearly dominant with 72% of the total websites surveyed. The diversity index for this sample of Web pages is 2.47, less than that of a typical Southeast Asian country and more than a typical country in South Central Asia. It is also hundreds of times smaller than the global linguistic diversity. Hence, linguistic diversity of the worldwide Web, while it approaches that of many multilingual countries, is a poor representation of linguistic diversity worldwide.

