Thanks to advanced measuring and data collection methods, data collection today continues to become easier and quicker. Also computational speed and performance have grown exponentially, which makes it possible for scientists to use more interesting and demanding computing methods than before.
The speed of development is illustrated by the fact that the first human genome was revealed in 2001, as a culmination of years of efforts, whereas today a corresponding result could be reached within a few days. In the future, interesting scenarios for DNA technologies are expected in, for example, the development of precision drugs and, perhaps, criminal research.
The down side of this development is the huge growth of data volumes, leading to the fact that the compiled data cannot always be processed. The information flood is also seen in academic research, when more and more powerful measuring methods produce an enormous flow of bits, and the most interesting items of information have to be extracted from the mass.
Searching for the most refined algorithm
Hence, ALGODAN, the Finnish Centre of Excellence for Algorithmic Data Analysis Research, headed by Professor Esko Ukkonen, is busy: the group concentrates on computational data analysis and data mining. The joint group of the University of Helsinki and Helsinki University of Technology is dedicated to developing new, functional solution methods, algorithms, to support new applications. Computer programs are created based on these algorithms, and the programs are able to harvest data from quite massive data volumes.
“We develop methods working together with researchers specializing in molecular biology, medicine, environmental science, linguistics, and information technology, which means that the end result is always a sum total of cross-cultural and close cooperation,” Ukkonen emphasizes.
The development work is started, when the researchers and the application specialist define the boundaries for a computationally important and interesting problem. The research problem could be, for example, how genes regulate each other.
The problem is mathematically formulated by defining the concepts and a mathematical model of the problem, and the next job is to develop a solution method that is as efficient as possible. Finally, the algorithm is converted into a computer program, which is fine-tuned by running real data with it.
“Usually a computer analysis suggests an interesting finding that needs to be verified through empirical work. A bit of luck is also needed, because usually a choice has to be made from several things that might be suggested,” says Ukkonen.
Often, from the point of view of the application specialist, merely visualization and illustration of the data is sufficient. As to the algorithm, visualization is rather simple, but it helps the application specialist to move on.
“Data integration is another important issue during the process. Often the analysis contains several different types of data. For example, when the behavior of the genome is being studied, the actual DNA sequence must be taken into account, and in addition to that, gene-specific activity can be measured separately,” Ukkonen explains.
The human genome is a sensitive entity
Esko Ukkonen’s own field of specialty is cross-disciplinary bioinformatics, which brings together not only biosciences and medical sciences but also data processing, mathematics, and statistics. The field gained popularity at the turn of the 1970s and -80s, when the DNA sequencing methods started to develop.
Today, the DNA sequences of several living organisms have been determined and hence, there is plenty of data available to serve as source material for new analyses.
“The starting ground is particularly well prepared for DNA handling, because it is in character string format, just like text in a natural language. This is the type of data that computer scientists are used to processing,” says Ukkonen.
Although it seems that there are no limits to the possibilities in bioinformatics, Ukkonen emphasizes that the foundation for all development is still basic research. After the human genome had been solved, research in bioinformatics has been focused especially on how the genome works and how genes regulate one another.
The genome may be thought of as a string composed of the characters A, C, T, and G, repeated three billion times in different orders. Many problems of biology can be solved by studying the properties of these character sequences.
When one of the characters in the DNA string is replaced by another one, we are dealing with a genetic error, a mutation. This type of a change can have a strong effect on the properties or morbidity of the human carrier. Lactose intolerance, for example, is a result of just one character being changed in the string: if there is a C instead of a T at a particular position, the person has a reduced tolerance for milk.
Reseach has revealed predisposition to cancer
Ukkonen’s research group has also participated in mapping corresponding genetic anomalies.
“Our most recent finding is associated with predisposition to cancer of the large intestine. The work conducted jointly by the groups of Academy Professors Lauri Aaltonen and Jussi Taipale showed that people with a certain type of mutation in their DNA sequence are more susceptible to cancer of the large intestine,” Ukkonen explains.
In this case it is also a matter of a single character substitution. When T is mutated into G, predisposition to cancer is increased by 50%. The result is interesting, because cancer of the large intestine is one of the most common cancers in the world; approximately 75% of Europeans and almost all Africans carry this risk-increasing form in their genome.
A program developed by Ukkonen’s group, EEL (enhancer element locator), was used in this research. The program browses DNA sequence looking for regions, where a large number of proteins are bound side by side and regulate gene expression.
“Since the program is very computationally intensive, a large part of the computation has been performed on CSC’s supercomputers,” says Ukkonen.
Putting the pieces together requires robust machinery
The DNA character strings are always assembled from shorter strings, which are then analyzed using various algorithmic techniques.
“The DNA is written into chromosomes, which have to be opened before we can read the short DNA strings. This leads to a jigsaw puzzle: from the randomly chosen short strings we have to be able to reassemble a complete strand, which is our actual target,” Ukkonen explains.
For example, when the human genome was being assembled for the first time, there were 30 million strings that had to be combined into a strand that contained roughly three billion characters. Thanks to powerful algorithms, it was possible to assemble the puzzle from such massive data.
At least for the moment, this is how things are done. Ukkonen believes that one day it will be possible to read the string of characters directly from one end to the other, determining the characters one by one, without having to assemble the strips. Even then, there will still be the challenge to understand what is written on the DNA character strand.
The sequence makes, as Ukkonen puts it, miraculous things happen in the cell. Let us say, if you take a long enough section of DNA, you will find that it contains sections that repeat as exact copies.
“One study comprised a human DNA character strain containing roughly 50 million characters, and a string of approximately 2500 characters appeared occurred twice in it. It is extremely unlikely that a string of this length would recur by chance, so it indicates some sort of copying,” says Ukkonen.
Although better and better algorithms are continuously being developed, genome-wide analyses cannot be performed without robust machines.
“Finding such repeating strings, for example, represents a demanding algorithmic problem, so besides state-of-the-art methods we also need powerful supercomputers,” says Ukkonen.
From fossil findings to automatic language translation
The Centre of Excellence headed by Ukkonen does not concentrate exclusively on molecular biology research. They have investigated, for example, paleontological data, such as fossil findings, which have been studied using algorithmic techniques to allocate them to the correct time scale.
The approach applied by the unit can also be applied in other areas, for example, in analyzing music. Without the name of the artist or piece of music, it is difficult to locate the desired piece from a digital music database. Since music notation is text, just like DNA is, it can be processed by using suitable algorithms, and the desired piece of music can be found just by humming or whistling a tune.
“Algorithms make it possible to distinguish between different genres of music. By listening to, say Mozart or Bach, it is usually easy to notice that the styles are different, but what exactly is it that the composer has done in order to attain his characteristic unique sound?”
Today, classification of news is also increasingly automatic. There is a constant flow of short, up-to-date news articles appearing on the internet; we need to know whether a particular item is recently transmitted or already in the database. Automation facilitates picking up the signal from the ocean of data. A similar, already widely used application, is junk mail filtration, which means that only a small number of undesired messages get through to the end-receiver.
Automatic language translation is also being developed. Immense databases have been compiled from different language pairs, of which a good example is the document collection of the EU. We can teach the computer to know how different linguistic patterns in each language are translated. Here, too, the massive amount of data is the key.
Katja Liesilinna
Image © Yhtyneet kuvalehdet/Sanakunta/Jorma Marstio
Additional information
The Finnish Centre of Excellence for Algorithmic Data Analysis Research
The Academy of Finland's page on CoE Algodan
Tuupanen S. et al. The common colorectal predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nature Genetics, Vol. 41, number 8, August 2009.