Uneingeschränkter Zugang

36 Million language pairs

   | 04. Juli 2012

Zitieren

Introduction

The foundational structure of the Internet as well as the World Wide Web was anything other than multi-language compatible in the beginning. In fact, it was monolingual or, to be precise, mono-dialectal. Imagined as a distributed architecture of computers for the US military, the Internet had been invented to provide a more robust infrastructure in case of attacks or emergencies. It was later developed into a mechanism and communication infrastructure to share information amongst scientists. At that time the Internet spoke American English. The invention and rise of the World Wide Web (WWW) is closely related to the invention of the Internet, also for lingual reasons. When I asked Sir Timothy Berners-Lee, inventor of the World Wide Web, whether it was a conscious decision to give his hypertext project the English name World Wide Web, he said:

I am English. The working languages at CERN, an international lab, were English and French. The dominant language on the Internet was American English at the time. (Berners-Lee 2010)

Berners-Lee’s statement reflects the lingual origin of the World Wide Web. Although invented and proposed in the French part of multilingual Switzerland, at the Conseil Européen pour la Recherche Nucléaire (CERN), it was American English that was dominant in the most common codes of the computer and inter-computer (Internet) environments at that time. This is the cultural and historical context out of which the WWW emerged.

The architecture of a World Wide Web in which content in different languages co-exists and co-evolves (I call it the lateral web) did not just magically appear, and neither did web content in various languages. For it to emerge decisions about rules and standards had to be taken, some of which precede the invention of the World Wide Web (e.g. ISO standards). Those rules and standards that were chosen have helped remake the World Wide Web. Much of that remaking has made many people around the world (wide web) happier while other aspects of that remaking have been rather problematic. The rules and standards that helped shape the lateral web are its codes (e.g. character encodings). These codes are one foundational process of the lateral web, but to be meaningful codes need to be built into the actual processing layers. This is the implementation into software (e.g. operating systems, web browser, fonts). This second foundational process is, on the one hand, largely organized around Westernised, English-based categories and language concepts, most notably exemplified by programming and scripting languages (Golumbia 2009), and, on the other hand, often relies on political and cultural support that helps enforce the implementation of specific measures. If both these conditions are met, then a third fundamental process comes into play: the generation of content in the preferred language of users.

Being able to use one’s preferred language is a necessary requirement in any environment, including the Internet. Today, almost two billion people around the world use the Internet environment, most of whom do not speak English as a native language. (Internet World Stats 2010) While access to the Internet environment remains an issue in some parts of the world, it is a matter of fact that an increasing number of users populate the web – an additional 100 million per year. (Yunker 2010) The two billion worldwide users add up to at least 1,500 languages, according to an estimate of language presences on the web. (Crystal 2004) Language and content availability have a decisive impact on the diffusion of the Internet. In fact, the last mile for the Internet's diffusion is not just physical infrastructure, but language and content availability. (United Nations/ITU 1999; Viard and Economides 2010) Language availability requires the technical implementation of characters, scripts and writing systems in major operating systems, web standards, devices such as mobile phones, and so on. This is aided by political and cultural support which may evolve from public pressure by governments and interest groups, or the use of popular culture, for example. Once language availability is provided, content follows. Thus, a thorough understanding is required of the regimes that regulate how languages (scripts) come into existence at critical sites of the Internet environment.

Governance of languages in the Internet environment

Regulatory regimes that determine scale and scope of linguistic pluralism in the Internet environment can be subsumed under the larger framework of governance. Governance describes changes in policy making, and depending upon the observer’s perspective may mean authority, control, or governmental power, to name but a few. (cf. for example Benz 2004; Bevir 2009; Kersbergen & Waarden 2004; Koolman 2003; Mayntz 2004; Rhodes 1996; Schuppert 2005; Treib, Bähr & Falkner 2007) This includes the prescription of principles such as enactments in favour of or against particular interests of others, as well as an understanding of dialogue or consensus that rests upon a confined number of participants. Media and communication scholars have used the notion of 'media governance' to describe new polities, politics, and policies in the media environment. (Bardoel &d’Haenens 2004; Hans Bredow Institut & Institute of European Media Law 2006; Jarren et al 2002; Latzer, Just, Saurwein & Slominski 2002; Schulz & Held 2004; Tambini, Leonardi & Marsden 2008) Covering both collective and organisational governance, media governance has been defined as 'the regulatory structure as a whole, i.e. the entirety of forms of rules that aim to organize media systems.' (Puppis 2010: 138) Such an approach to media governance identifies shortcomings of existing regulatory regimes thereby assisting in regulatory innovation, and it argues that governance is not necessarily in the public interest as it may 'assist the actors involved in realizing other aims like gaining or maintaining power.' (145) The relationship between language and governance, on the other hand, developed out of 'a need to move beyond treating language rights as sui generis and for integrating language-related needs into the fabric of government decision-making at all levels where appropriate.' (Williams 2007: 39) In this context, it is suggested that language governance provides an institutional bridge to a more democratic, inclusive and purposeful treatment of 'long-beleaguered language minorities.' (40)

This article draws from these accounts which combine governance with language and media, but places its focus differently. It analyses the governance of languages in the Internet environment. In the early 2000s, Viktor Mayer-Schönberger and Deborah Hurley pointed to the potential fusion of governance, languages and the Internet environment:

People desiring to access the net also need to know how to navigate and explore a still largely English, text-oriented web regardless of how easy the actual information access appliances will have become. … Overcoming the challenges implicit in such an analysis of a two-tier society will pose another serious governance issue. (Mayer-Schönberger & Hurley 2000: 149)

The governance of languages in the Internet environment varies depending on the site under observation. Following Jeanette Hofmann’s definition of Internet governance I understand the governance of languages in the Internet environment as a 'regulative idea in flux' (Hofmann 2007) which is closely related to the changing sites of investigation. In other words, different sites of investigation adhere to different principles and paradigms pertaining to the governance of languages. The governance of languages analyses

Who decides upon the participation of languages, characters and scripts in the Internet environment?

How and in what ways are these decisions being formed?

And, what are the consequences that those processes bear for individual users and human societies that participate or aim at participating in the Internet environment?

In other words, which languages and language communities participate depends on how agreements, procedures, conventions and policies are negotiated and communicated, and how decision makers are being held accountable. The following examples of critical sites in the Internet environment at this point in time are indicative of the variety of approaches to the governance of languages.

Regulatory regimes of three linguistic network hubs

The first example is the largest Internet governing body to date, the Internet Corporation for Assigned Names and Numbers (ICANN). ICANN was created under the auspices of the US Department of Commerce as a non-profit organisation in 1998. Most of its work concerning the governance of languages plays out under the hierarchy of top-level domain names (TLDs) which consist of generic TLDs (EDU, COM, NET, ORG, GOV, and so on) and two letter country codes (ccTLDs, as defined by the International Organization for Standardization (ISO), for example DE or CN). A more recent field of activity is internationalized country code top-level domains (IDN ccTLDs). IDN ccTLDs pertain to the display of domain names in language-native characters or scripts in end user applications such as web browsers. For example, ICANN announced in 2009 that non-Latin characters in top-level domains will be introduced. Some described this decision as the biggest change to the way the Internet functions since its inception 40 years ago (BBC 2009). Others criticized that the decision was delayed for half a decade and that it still has its limitations. As Paul Hoffman, one of the authors behind the original development of the technology for non-Latin characters, noted:

ICANN's announcement 'only' covers IDNs in country names, not in new TLDs. The latter will (or won't) happen a few years from now, and the political discussions there will be even more difficult than it was for the country names. (Hoffmann 2009)

The governance of languages and scripts within ICANN’s domain of responsibility is directly dependent on the strategic and operational aims of its members. In the case of IDN ccTLDs, for example, paying members (annual fee: US$185.000) determine the process of evaluation as both decision-makers and financial contributors. It is under such organizational, political and corporate constraints that decisions about the governance of languages are being made by ICANN.

A second example is Wikimedia. Wikimedia’s governance of languages is in many respects antithetic to that of ICANN. Open source software and peer-governance are at the core of Wikimedia’s operational structure. This is reflected in the Wikimedia Foundation’s governance of languages. A multiple-step process determines the addition or rejection of a new language version. The application procedure is extensively documented.

For Wikimedia’s language proposal policy see http://meta.wikimedia.org/wiki/Meta:Language_proposal_policy (last accessed February 15, 2010)

The specific requirements Wikimedia has for a new language proposal to be approved are:

A new language edition must not already exist on any project of Wikimedia.

The language must have a valid ISO-639 1–3 code.

This means it must be listed in an ISO-639 database, or standards organizations must be convinced to create an ISO-639 code for a ‗new' language.

The language must be sufficiently unique that it could not coexist on a more general wiki.

This, in most cases, excludes regional dialects and different written forms of the same language.

A sufficient number of living native speakers form a viable community and audience.

This requirement, which must be met for the final approval, is discussed in an open discussion. To do so, a project will be initiated where interest by individual speakers or supporters of the language is registered and arguments for and against the admission of the new language are gathered. Then a decision will be made by the language committee.

Wikimedia’s governance of languages has adapted a traceable adoption procedure so as to involve the community and concerned individuals. This has allowed Wikipedia, Wikimedia’s most prominent project, to become the most linguistically diverse project in the Internet environment at this point in time.

Cf. http://stats.wikimedia.org/wikimedia/animations/growth/index.html (last accessed February 15, 2010)

Wikipedia is available in more than 270 languages and dialects

For more updated numbers of supported languages and dialects as well as their key parameters such as articles, users and depth see http://en.wikipedia.org/wiki/Lists_of_languages (last accessed February 1, 2011)

which make it accessible, at least in principle, to more than five billion people, that is, 75 per cent of the world’s population.

The third example is Google’s language service Google Translate. It is the most used and equally contested instant translation site, currently featuring an array of more than 50 wider and lesser-used languages that cover approximately 50 per cent of the world’s population. Google Translate is attractive for all languages for two reasons: its ability to inter-translate between featured languages, and its seamless integration into web browsers which allows for automatic web page translation. Google Translate’s major currency is not languages but language pairs. For example, supporting 58 languages amongst which can be translated actually allows for 3,306 language pairs, that is, 58 languages between which can be translated. By being embedded into central focal points of attention such as web browsers, search engines or social network services, Google Translate can provide automatic translation of information from and into its featured languages. This way, language boundaries, a limitation to socio-cultural, political and economic evolution, are being bypassed. A relevant question thus becomes what the requirements are for new languages to be added to Google Translate. This depends on one decisive aspect: sufficient amounts of parallel text corpora, described by Google in its New Languages Support section as follows:

We're working to support other languages and will introduce them as soon as the automatic translation meets our standards. It's difficult to project how long this will take, as the problem is complex and each language presents its own unique challenges. In order to develop new systems, we need large amounts of bilingual texts.

The request for and adoption of a new language might not be as transparent as in the case of Wikimedia at first sight. However, this is largely due to the individual techno-linguistic threshold that determines when languages can be included on Google’s translation service. Whilst the general procedure of adoption is clear – sufficient parallel text corpora and Google’s standards of automatic translation – it is harder to define what the required parallel text corpora thresholds are. Once a new language has been included on Google Translate it forms part of a network of language pairs. The next section discusses the significance of language pairs as a specific feature for language and content availability in the Internet environment.

It’s the language pairs, stupid!

In the late 1990s, the International Telecommunication Union noted that language availability is paramount to the emergence and use of content in the Internet environment. (United Nations/ITU 1999) In a first phase, content becomes available in a specific language with the implementation of scripts, characters, writing systems and encoding standards in major operating systems, web standards, and devices such as mobile phones. The task of safeguarding linguistic pluralism in such a first phase is to ensure that most of the world’s 6,000 languages gain a language presence as a prerequisite for content availability. Once a language can be used, content will be created. In such a first phase, the size of available content is correlated with the size of the language, i.e. its speakers, participating users, existing range of content etcetera. In a second phase, one that is gaining momentum now, the size of available content is getting distinctively less dependent from the size of the language. The main parameter for such a fundamental change is 'language pairs.'

Inter-language links in Wikipedia

Language pairs allow for information exchange and knowledge transfer from source languages to target languages. Wikipedia’s inter-language links is a light or preliminary version of language pairs. To Wikipedia readers, the inter-language links show up in a box with the title "languages" at the left column of almost every Wikipedia entry article. This box contains a set of links that lead readers to other language versions of the same (or nearly equivalent) entry. Figure 1 shows one example where five other language versions (French, Malay, Japanese, Cantonese and Chinese) exist for the English entry "Grass Mud Horse." Such inter-language links can be edited as part of the entry, with a straightforward syntax of the target language code followed by the title of the entry in that target language as shown below:

Figure 1

A typical Wikipedia entry page showing the location of the inter-language links

[[fr:Cheval de l'herbe et de la boue]]

[[ms:Kuda Lumpur Rumput]]

[[ja:草泥馬]]

[[zh-yue:草泥馬]]

[[zh:草泥马]]

Certain popular (or universal) entries are destined to appear in almost all language versions, such as the entry for Wikipedia itself. Some regional or parochial entries are expected to be bound within certain language versions. It is worth mentioning that some entries in Wikipedia start as translation of articles that exist in other language versions. (Huvila 2010) This is a process that the inter-language links help to facilitate. Hence, it is possible for some parochial entries to spread across other languages. The number of inter-language links for a given entry can be thought to grow from zero (meaning the entry does not link to any other language version), all the way to the sum of all language versions minus one (meaning the entry has links to every other language version that exists). At any given moment, it is expected that some entries may have more inter-language links than others. Those that have fewer inter-language links cluster languages into meaningful groups whereas those that have more inter-language links have universally popular content. Wikipedia provides a premier observation site to highlight important ties between language versions and thus for the analysis of patterns of spread, diffusion, and distribution of content. (cf. Petzold & Liao 2010) The environment in which Wikipedia operates is a largely self-organising system that creates and generates independently from its system’s operator. The growing nature of inter-language links becomes possible because of such a self-generating and self-creating environment. Here, any individual may decide to add another language version to an article, thereby creating a new language pair.

The Wikipedia inter-language links highlight how information and knowledge can be linked from source languages to target languages. It is insofar a light version of language pairs as it is still largely dependent on the size of participating users, their interests and levels of motivation. The larger the participating user base, the more likely a more diverse range of topics that is being covered in any one language. The size of available content is therefore still correlated to some extent to the size of the language. However, Wikipedia is increasingly being used for large-scale knowledge transfer (instead of linking) from data rich source languages to target languages where less data exists. In a joint venture between Google and Wikipedia in 2010, for example, Google provided its translation software to help a team of volunteers, translators and Wikipedians across India, the Middle East and Africa to translate several million words for Wikipedia into languages such as Arabic, Gujarati, Hindi, Kannada, Swahili, Tamil and Telugu. In a competing approach, Microsoft’s translation engine is being used to create a system for multilingual Wikipedia content. (Kumaran et al. 2010)

Language Pairs in Google Translate

It was stated earlier that the major currency of Google’s translation service is language pairs. This provides the distinctive advantage that once a language is implemented, it can translate from and into all other Google Translate languages. This opens up a plethora of new content that becomes available. In fact, it changes the correlation probability between size of available content and the size of language. Consider the following example: By the end of October 2010, the size of the World Wide Web indexed by Google’s search engine was estimated at around 20 billion web pages. In comparison, the size of the Dutch World Wide Web indexed by Google comprised around 100 million web pages.

For more updated numbers, a larger set of search engine providers, and detailed information about the method see http://www.worldwidewebsize.com/ (last accessed 31 October, 2010)

In other words, users whose preferred language is Dutch were able to search 0.5 per cent of the World Wide Web’s information indexed by Google. The distinctive advantage of Google Translate (but also other translation tools) is that it becomes more and more seamlessly embedded in the web environment. Google Translate, for example, was integrated into the standard Google Chrome browser to allow for automatic web content translation. This means, whenever a web page is accessed in another Google Translate language, it can be translated into any one of the 50+ Google Translate languages, for example Dutch. This way, more content is de facto available in the preferred language of users, who become more independent from their original demographic affordances (size of speakers, participating users, etcetera).

The specific feature of language pairs makes it pivotal for any one language to be implemented in Google Translate. However, in some cases it is problematic to acquire sufficient corpora of high quality parallel texts. As a result, Google extended the source of parallel text acquisition from institutional resources (such as UNESCO, the EU and others) to individual contributions. It started to harness user contributions more efficiently in 2009 when it introduced a feature that allowed users to improve translations. (cf. Figure 2)

Figure 2

Google Translate’s Contribute a better translation feature

This way, the translation model was not supplied with ever more amounts of parallel texts, but instead with more nuanced translations derived from individual contributions. The feature 'Contribute a better translation' is now defunct. However, in 2009 Google also released the Translator Toolkit which uses crowd-sourced human translation. If measured by standards of 'crowd-sourcing' initiatives in other fields, providing such a toolkit scales up the amount of contributions of parallel texts exponentially. Opening up its general method of parallel text acquisition to the wider public allows direct user input for existing and several hundred other languages for which parallel text corpora are often much more difficult to acquire. This helps, for example, to preserve the future of some of the most vulnerable languages. Research on Google Translator Toolkit by Māori language specialist Te Taka Keegan, for example, suggests that it can help smaller languages by providing shared translation resources to help unify the language’s written form or increase translation speed and quality of documents published in that language. (Helft 2010) While the Translator Toolkit was initially designed to attract collaborative translation efforts, it is by now more and more used also in commercial translation activities:

The significance of the Google Translator Toolkit is its position as a fully online software-as-a-service (SaaS) that mainstreams some backend enterprise features and hitherto fringe innovations, presaging a radical change in how and by whom translation is performed. (Garcia and Stevenson 2009)

Languages covered on Google Translate Toolkit (some 340+) exceed the languages implemented on Google Translate (50+).

For a comparative overview of supported languages see Appendices A (Google Translate) & B (Google Translator Toolkit)

Once most or all languages covered by the Translator Toolkit become fully operational, that is, implemented into Google Translate, it is likely to supersede Wikipedia in terms of languages that it covers. In fact, the more language pairs on Google Translate, the more content becomes available to speakers of supported languages, ultimately without the user even being consciously aware of the act of translation and its impact on the original text (which, in turn, raises interesting new questions and concerns).

What people give their attention to depends on a whole set of factors that have nothing to do with what's best. At the most simplistic level, consider the role of language. People will pay attention to content that is in their language, even if they can get access to content in any language.

This article has argued that new regimes of intersection amongst languages, here discussed by the notion of language pairs, reconsider the correlation impasse by which content availability depends on the individual user’s demographic affordances. The ramification of this on policy making is that the existence of language pairs extends the traditional idea of linguistic pluralism, that is, to make as many of the 6,000 languages available in the Internet environment as possible. This continues to be important, however, the ultimate extrapolation of an enhanced understanding of linguistic pluralism is not 6,000 languages but 36 million language pairs (6,000 x 6,001 languages ~ 36 million language pairs). With the 3,306 language pairs Google Translate offers (and the more than 100,000 language pairs supported by the Translator Toolkit), and the current status of Wikipedia’s inter-language links in terms of inter-translatability, it is reasonable to conclude that we are only beginning to see this new kind of content globalisation evolving.

Conclusion

This article has provided a tour d’horizon to some of the key aspects relating to the co-evolution of languages and content in the Internet environment. It has done so by discussing two related aspects. First, it outlined the governance of languages in the Internet environment by means of three selected linguistic network hubs: the Internet Corporation for Assigned Names and Numbers (ICANN), online encyclopaedia Wikipedia and Google’s translation service Google Translate. It was noted that these sites differ in their regulatory regimes of language request and adoption. The main aspect of ICANN’s decision making process is the negotiation of organizational, political and corporate interests and constraints by its stakeholders. Wikipedia, on the other hand, pursues a strategy that is based on a language request and adoption process that can be monitored at all times and relies on peer-governance. Google’s approach to the governance of languages, finally, largely revolves around a linguistic aspect, that is, sufficient parallel text corpora.

Following on from the governance discussion, the second part of this article presents new dynamics in the co-evolution of languages and discusses possible policy implications related to linguistic pluralism. It argues that policies which centre on language availability in the Internet environment must shift focus to actual content availability. As social media researcher danah boyd (2009) noted:

What people give their attention to depends on a whole set of factors that have nothing to do with what's best. At the most simplistic level, consider the role of language. People will pay attention to content that is in their language, even if they can get access to content in any language.

This article has argued that new regimes of intersection amongst languages, here discussed by the notion of language pairs, reconsider the correlation impasse by which content availability depends on the individual user’s demographic affordances. The ramification of this on policy making is that the existence of language pairs extends the traditional idea of linguistic pluralism, that is, to make as many of the 6,000 languages available in the Internet environment as possible. This continues to be important, however, the ultimate extrapolation of an enhanced understanding of linguistic pluralism is not 6,000 languages but 36 million language pairs (6,000 x 6,001 languages ~ 36 million language pairs). With the 3,306 language pairs Google Translate offers (and the more than 100,000 language pairs supported by the Translator Toolkit), and the current status of Wikipedia’s inter-language links in terms of inter-translatability, it is reasonable to conclude that we are only beginning to see this new kind of content globalisation evolving.