The Language Bank of Finland serves digital humanities and social sciences

The Language Bank of Finland serves digital humanities and social sciences

The Language Bank of Finland is a service built by Finnish universities and research institutes. It is coordinated by the University of Helsinki, and CSC provides its technical solutions. Together, they form the the FIN-CLARIN consortium which, in turn, is the Finnish part of the European CLARIN infrastructure.

The Language Bank began as CSC’s science support branch for language research that went on to build close collaboration with first the University of Helsinki and later other Finnish universities and research institutes. The collaboration has since gone international via CLARIN, practically multiplying the language resources (corpora and tools) at researchers’ disposal. The last few years have seen the Language Bank also actively seek to reach out to other digital humanities and social sciences.

– The Language Bank accepts text, audio and video corpora. Compiling and preprocessing data requires a lot of work, and one corpus can be used in many kinds of research, which is why depositing data makes a lot of sense, says Mietta Lennes, project planning officer at the University of Helsinki.

Mietta Lennes


In addition to research, the Language Bank’s resources are also widely used in teaching. Mietta Lennes is one of the instructors of many annual online courses that are open to all humanities and social sciences for getting to know the Language Bank’s services and learning to use and process corpora.

– The students are delighted to discover what a wide variety of phenomena can be studied in text and speech samples, as long as you know which methods to use, Lennes says.

Presently, the Language Bank contains approximately 20 billion written words and over 10000 hours of audio and video recordings. The text corpora mainly consist of newspapers and magazines, literature and social media discussions. The speech data includes the plenary sessions of the Parliament of Finland as well as samples of Finnish dialects and spoken language of Helsinki.

The Language Bank also has corpora in sign language. In addition, the Language Bank grants access to various tools that facilitate studying and processing text or speech data.

The Language Bank offers instructions and support for the different phases of producing and compiling corpora. In order to guarantee and maximize usability, it is important to consider publication matters as early in the process as possible, so that e.g. the published corpus version will eventually be able to have an end user license that allows the intended kind of use, whether it is freely available, protected by an individual access application procedure, or anything in between.

Every language resource deposited in the Language Bank is assigned at least one persistent identifier (PID) in order to ensure that the corpus or tool (and a specific version thereof) and its metadata can still be located years later, even when systems and websites evolve. The Language Bank uses primarily URN identifiers (Uniform Resource Name). Providing a PID makes it easier to reproduce any given study using the language resource.

It is also essential for language resources (and their producers) that they get cited. Following a uniform citation pattern directs the credit for creating and curating the corpus where it is due. The Language Bank facilitates this by offering automatically formatted citations for each corpus, in three different formats. A Google Scholar query to find out how many times a language resource has been cited can be run directly via the corpus list.


Check out the renewed Data management webpages and the service catalog.

Are you applying for funding from Academy of Finland? The information package for the academy applicant gathers useful links to our renewed data management service pages.

More about this topic » Go to insights and news »

Tero Aalto

The author is a language technologist and works with the Language Bank of Finland.