July 27th, 2015 by Rob

The largest annual conference in computational linguistics is in Beijing this year. Over the next few days, the world’s top researchers will present their latest research at ACL2015. Here are the languages they are studying:

Languages at ACL 2015.

Languages at ACL 2015. The comparison of language prevalence (actual speakers), and how many papers in ACL 2015 focused on that language, shows that almost every major world language is underrepresented except English. They are ordered from least-to-most speakers. The shading from green to red indicates representativeness, with deep green indicating the most papers per actual speaker and deep red indicating the fewest papers per actual speaker.

Calculating the languages studied

We took the languages mentioned in the ~300 abstracts, counting each one named. When multiple were named, we divided the count between them. We employed crowdsourced workers for the task, asking them to read the abstract and reporting number of languages studied and listing them, when named. We took the mode response of four workers per abstract, and had ~10 papers elevated to an NLP PhD when the crowdsourced workers disagreed.

The over-represented languages

Only English is over-represented among widely-spoken languages. Despite the conference being hosted in China, there are still fewer papers on the Chinese language at ACL 2015 than we would expect given the overall numbers of world-wide speakers: 5% of presented research vs 15% of speakers globally.

Of the others, they are over-represented by virtue of having few speakers, not too much research. Akawaio has about 5,350 speakers and Aukan around 25,000 speakers, so any mention is enough to make them seem over-represented. The Braille system has about 100,000 speakers, but could also be thought of as an alternate writing system for English in this case and not a separate language. The single paper on Tibetan (or more accurately, languages in the Tibetic continuum) make it twice as prevalent in ACL 2015 as in the wider world.

It’s safe to say that these languages are not at other computational linguistics conferences this year, so in general all languages other than English are under-represented. In fact, 81% of the papers didn’t even state what language they studied, and we had to infer from the content that they only looked at English.

The least represented languages

The most under-represented named language is Cakchiquel (a dated spelling more commonly called Kaqchikel or Kaqchiquel today). By virtue of being paper of a paper looking at 100 different languages, it had the least representation of any named language in the proceedings relative to 4.5 million speakers.

The next most under-represented named language is Spanish. Despite the large number of resources to Spanish, it simply isn’t well-represented relative to the actual speakers. Kinyarwanda (literally, the language of Rwanda) is one of four languages in a paper, but this wasn’t enough to give a share at ACL 2015 that is equal to the actual conversations in the language.

Why it’s important to study more languages

Most advances in NLP are by 2-3%. Most advantages of 2-3% are specific to the problem and language at hand, so they do not carry over. In my 2012 PhD (at Stanford under Chris Manning and Dan Jurafsky, with a committee including Tapan Parikh at UC Berkeley), I showed that methods that were significant in typical languages were not significant in English.

In order to understand how computational linguistics applies to the full breath of human communication, we need to test the technology across a representative diversity of languages. For vocabulary, word-order, morphology, standardized of spelling, and more, English is an outlier, telling little about how well a result applies to the 95% of the worlds communications that are in other languages.

Great scientific diversity

Researchers don’t need to tie their goals to what industry is demanding (it would all be domain adaptation if so), but science does need to be tied to the world. I hope there is greater linguistic diversity present at future ACL conferences.

Robert Munro
July, 2015

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>