What a researcher should know about persistent identifiers

What a researcher should know about persistent identifiers

Even the most useful information has no practical value if nobody can access it. Digitalization only aggravates the problem. While it is rare for a library or a book to suddenly disappear or be replaced by something completely different, websites do that all the time. Easier and faster searching only goes halfway to solving the problem.

Persistent identifiers (PID) are the internet’s way to point at documents and other items that should remain findable as long as possible. As the name suggests, the idea is that once a persistent identifier has been assigned, years passing, people changing or systems evolving will not stop a user from finding the identifier’s intended target. Conventional website addressed are not very persistent. Web infrastructures evolve constantly, locations change, and even the contents of a particular page may end up wildly different despite the page itself still being there.

Persistence is especially important for scientific knowledge. Reliability is the most obvious factor, being able to trust that the research results are what they claim to be. This already opens up a number of questions because simply reliably discovering a certain publication is not enough. The reader needs to be able to trust the paper’s sources and other references as well.

Repeatability is another factor that contributes to reliable research and for which persistence is at least as important. There is no way to test a given result if the source material is not available. Repeating a study or a part thereof requires knowing the original data as well as its version. The same applies to the tools used in the study, as they most likely also come in a variety of versions, in the worst case producing significantly different results.

For the same reason, it is a good idea to provide persistent identifiers for one’s own publications and published research data, so that others can refer to them in turn. Identifiers not only increase visibility and findability but it is also easier to keep track and make statistics of in which contexts and how many times a document or body of data has been referred to.

What then are potential recipients of persistent identifiers? Publications and data are easy answers but not the only ones. Sometimes parts of data or where they are located also deserve to be reliably pointed at. One very relevant target is metadata that describes e.g. the data type, amount, and the license according to which it can be used.

There are many different persistent identifier systems, the oldest of which are almost as old as the internet itself. URN works similarly to a traditional URL (Uniform Resource Locator), except that a URN is always a part of a namespace and globally unique. For example, an ISBN (International Standard Book Number) that identifies a book can be a part of a URN. The National Library administers URNs in Finland.

Another widely used system is Handle, developed by CNRI (Corporation for National Research Initiatives). CSC supports both of these systems. DOI (Digital Object Identifier) is an implementation of Handle that also contains metadata. ISO (International Organization for Standardization) has standardized DOI. DOIs granted by DataCite include essential metadata and are suitable for research data. This kind of DOIs are provided by EUDAT B2Share and soon also Fairdata.fi.

Identifiers are also assigned for persons and other instances. Identifying and finding researchers is vital and can be done using an ORCID (Open Researcher and Contributor ID). For organizations, there are e.g. ROR (Research Organization Registry) identifiers. Persons and organizations should be properly identified because both may move from one place to another or change their name over time.

However, not everything should have a persistent identifier. If certain data resides in a permanent download location, it can be referred to, but pointing at a living data stream is not helpful. At the end of the day, even the best of identifiers is only as good as how well it is maintained. An out-of-date identifier is worse than no identifier at all. Every persistent identifier is a promise. Should the target move, the identifier is updated. If the target is deleted for good, the identifier can be pointed at a gravestone page informing users about the target’s fate and what to do in its absence. An identifier should never be completely removed. If one cannot commit to PID maintenance (or delegate it to somebody else), perhaps assigning persistent identifiers was not a good idea in the first place.

It is yet another question when the identifier’s target has changed enough to require a new identifier and when the original one can still be used. There is no universal answer to this question. It is ultimately up to what the resource was intended for. Fixing a single character in a text corpus is usually not worth creating a new version for. On the other hand, it may not require a major change in source data for algorithms based on the data to start producing significantly different results. This is a decision for the data owner to make. The Language Bank of Finland’s (CSC’s service for language research and other digital humanities and social sciences) Life cycle and metadata model of language resources is an example of an approach modeled for a certain context.

Persistent identifiers benefit all parties when used properly. Data, publications and people will still be found years from now and can receive the attention and appreciation they deserve.

This publication’s persistent identifier: http://urn.fi/urn:nbn:fi:lb-202004215


Image: Adobe Stock

More about this topic » Go to insights and news »

Tero Aalto

The author is a language technologist and works with the Language Bank of Finland.