High-Performance Digitisation

Funding source: INEA

Project duration: 1.9.2018 – 31.8.2020


Efficient use of the growing resources of digital data is today seriously hampered by insufficient search functions and findability due to deficits in data quality and lacking metadata. Metadata has traditionally been added manually and errors or ambiguities in the data, like those very frequently occurring in digitisation processes (OCR), have made full text search and automatic annotation difficult. The quickly growing data resources in combination with development in computational power can, however, be better managed and turned into a real advantage by harnessing the new technologies for artificial intelligence and machine learning. This offers us an opportunity to substantially enhance data quality and thereby boost the value of data.

The main objective of the High-Performance Digitisation initiative is to create an intelligent annotation pipeline and a corresponding service for processing and enriching archived material, such as scanned newspapers, books and official documents. This material is digitised in large scale but the usability of the archives could and should be much higher. The new pipeline runs in a supercomputing environment and uses high-performance GPU accelerated machine learning methods for computer vision and artificial intelligence based annotation. The High-Performance Digitisation initiative collects required data with collaborating archive data sources, performs training of advanced machine learning models, implements a production-quality software pipeline and service, and finally provides the integration back to data sources.

The pipeline will be developed into a 24x7 service that will be offered to memory organisations such as libraries and archives. Data and metadata standards used by the pipeline will be designed in collaboration with the domain experts at the National Library of Finland and at the National Archives of Finland. The delivered pipeline will be available as open source software, making it possible to provide supplementary and interoperable services across Europe. The pipeline will be taken into sustainable production use in CSC's cloud computing platform. In the High-Performance Digitisation initiative, CSC works with National Library and National Archives in Finland and extends our European network for wider collaboration.

The end users of the solution are citizens, researchers, businesses and public administration via the national portals of National Library and National Archives in Finland, and available for harvesting by the European Data Portal and the Europeana portal. The added value to the end users are the annotated documents and especially images: The vast masses of scanned archive images can be considered to be unaccessible at their current state as the required human labour to organise and discover content is very high. The High-Performance Digitisation initiative provides a solution to the problem and opens up unique datasets for public use and refining.