The largest annual conference in computational linguistics is in Beijing this year. Over the next few days, the world’s top researchers will present their latest research. Here are the languages they are studying:
Calculating the languages studied
We took the languages mentioned in the ~300 abstracts, counting each one named. When multiple were named, we divided the count between them. We employed crowdsourced workers for the task, asking them to read the abstract and reporting number of languages studied and listing them, when named. We took the mode response of four workers per abstract, and had ~10 papers elevated to an NLP PhD when the crowdsourced workers disagreed.
The over-represented languages
Only English is over-represented among widely-spoken languages. Despite Chinese (Mandarin) being the second most common language and the conference being hosted in China, there are still fewer papers on the Chinese language at ACL 2015 than we would expect given the overall numbers of world-wide speakers.
Of the others, they are over-represented by virtue of having few speakers. Akawaio has about 5,350 speakers and Aukan around 25,000 speakers, so any mention is enough to make them seem over-represented. The Braille system has about 100,000 speakers, but could also be thought of as an alternate writing system for English in this case and not a separate language. The single paper on Tibetan (or more accurately, languages in the Tibetic continuum) make it twice as prevalent in ACL 2015 as in the wider world.
It’s safe to say that these languages are not at other computational linguistics conferences this year, so in general all languages other than English are under-represented.
The least represented languages
The most under-represented named language is Cakchiquel (a dated spelling more commonly called Kaqchikel or Kaqchiquel today). By virtue of being paper of a paper looking at 100 different languages, it had the least representation of any named language in the proceedings relative to 4.5 million speakers.
The next most under-represented named language is Spanish. Despite the large number of resources to Spanish, it simply isn’t well-represented relative to the actual speakers. Kinyarwanda (literally, the language of Rwanda) is one of four languages in a paper, but this wasn’t enough to give a share at ACL 2015 that is equal to the actual conversations in the language.
Why it’s important to study more languages
Most advances in NLP are by 2-3%. Most advantages of 2-3% are specific to the problem and language at hand, so they do not carry over. In my 2012 PhD (at Stanford under Chris Manning and Dan Jurafsky, with a committee including Tapan Parikh at UC Berkeley), I showed that methods that were significant in typical languages were not significant in English.
In order to understand how computational linguistics applies to the full breath of human communication, we need to test the technology across a representative diversity of languages. For vocabulary, word-order, morphology, standardized of spelling, and more, English is an outlier, telling little about how well a result applies to the 95% of the worlds communications that are in other languages.
Great scientific diversity
Researchers don’t need to tie their goals to what industry is demanding (it would all be domain adaptation if so), but science does need to be tied to the world. I hope there is greater linguistic diversity present at future ACL conferences.