Using R to process large data files
R is said to be slow, memory hungry and only capable of handling small datasets. Indeed, R is not very effective in importing large amounts of raw data. Calculations are typically rather efficient, but data loading can cause problems. An example of handling a large dataset is presented with some solutions to data handling. This example is probably not even close to being optimal, and is only meant to outline the methods for processing large data sets using R.
Example data
The example dataset, downloaded from ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/TABM/E-TABM-185/hgu133a_gcrma_reduced.txt, was about 2.2 GBs in size. In contains numerical data in 22284 rows and 5897 columns, including row and column names. All the analyses were carried out using the Murska server.
How to find out the dimensions of a file?
The dimensions of the example file were given, but how can the dimensions be found out, if the file is previously unknown? This knowledge is needed, since the example R scripts to be introduced assume that the dimensions of the file are known. The number of rows in the file can be found out using a few UNIX tools. Assuming the file is tab-delimited (as the example file is), extracting the first column using an awk command, and then redirecting the output to wc (word count) would do the trick:
awk '{print $1}' < hgu133a_gcrma_reduced.txt | wc –l Or the same from inside R:
system(“awk '{print $1}' < hgu133a_gcrma_reduced.txt | wc –l”) The number of columns is slightly harder to resolve. One possibility is to read only the first line of the data file into R, and count the number of elements separated by tabs:
length(scan("hgu133a_gcrma_reduced.txt", nlines=1, sep="\t", what="character")) Reading the whole file of just a line at a time?
The R version 2.5.1 on Murska is a 64-bit version, so the whole data would fit simultaneously into the memory. However, processing a file one line at a time is more memory and time efficient than reading the whole file at once. There are at least two possible methods to read the file one line at a time. Those are scan( ) and readLines( ).
Using scan
The first solution that comes to mind is to use command scan( ) that has a parameter skip. With skip user can define how many lines from the beginning of the file to skip before reading n lines. Iterating over the number of skipped lines would then result into reading the file one line at a time. This is a memory efficient, but not time efficient way to read the data. Before reading n lines, scan needs first to read the number of lines specified by skip. If the file contains 22284 lines, this results into scan having to read nearly 250 million lines! Clearly, using scan is not the way to go.
Using readLines and connections
A better approach to scan is to use readLines and open a connection to the text file. This method is outlined below.
Here the aim is to read the data into a matrix. Let’s first create the matrix:
m<-matrix(nrow=22283, ncol=5896)
Note that the number of rows and columns are both one smaller than the dimensions of the input file, since the script used for reading the data will drop both row and column names, and there’s no need to reserve space to them in the matrix.
Now we can start reading the data. First we need to open a connection to the text file so that it does not need to be opened again for every single line:
filecon<-file("hgu133a_gcrma_reduced.txt", open="r") We can then find out the position of the first line:
pos<-seek(filecon, rw="r")
Next, knowing the number of rows, we can iterate over those in a loop. We probably want to get some indication of the run, so the first if-clause inside the loop just prints the iteration number in hundreds. Next we read one line into a variable tt. This variable is then further parsed into a variable tt2. Variable tt2 is a vector, and to put it into a certain row of the matrix m, it needs to be transposed. As the last step, we seek for the position of the next line. This loop gets rid of the row and column names, so if it is necessary to save them, parsing commands and the size of the matrix storing the parsed values need to be slightly modified (not shown).
for(i in 1:22284) {
if(i %% 100 == 0) {
print(i)
}
tt<-readLines(filecon, n=1)
tt2<-na.omit(as.numeric(unlist(strsplit(tt, "\t"))))
if(i!=1) {
m[(i-1),]<-t(tt2)
}
pos<-seek(filecon, rw="r")
}
After executing the loop, the file connection need to be closed:
close.connection(filecon)
Now the matrix m (and all other object in the memory) can be saved in R binary format with the command:
save.image()
or saving only the matrix m:
save(m, file=“.Rdata”)
This creates a file .Rdata that can be loaded into R memory just by starting R in the same directory where .Rdata resides. This binary file is loaded into R memory very fast compared to reading and parsing the original text file. As reading text files is slow, parsing them more than once should be avoided, if at all possible. The file size is slightly more compact than a database (see below), taking only 1.15 GBs of disk space.
Note that in the example above the data was imported into a matrix. A matrix was used instead of a data frame, because assigning new data to a data frame is significantly slower than assigning to a matrix.
There are also two if-statements inside the loop that slow down the execution, but their contribution to the execution time is very small compared to the reading and parsing operations.
Parsing the data into a database
As an alternative to an R binary file, the data can be stored directly into a database. The following method is a slight modification from the previous method that read the file one line at a time to an R object. Here we additionally open a connection (dbcon) to an SQLite database. Every line read and parsed from the text file is immediately saved in the SQLite database table results using the command dbWriteTable. Here, only 737 first values are saved in the database. Note that the new line is appended to the existing table in the database.
library(RSQLite)
filecon<-file("hgu133a_gcrma_reduced.txt", open="r")
dbcon<- dbConnect(dbDriver("SQLite"), dbname="E-TABM-185.dbms")
pos<-seek(filecon, rw="r")
for(i in 1:22284) {
if(i %% 100 == 0) {
print(i)
}
tt<-readLines(filecon, n=1)
tt2<-na.omit(as.numeric(unlist(strsplit(tt, "\t"))))
if(i!=1) {
dbWriteTable(conn=dbcon, name="results",
value=as.data.frame(t(tt2[1:737])), append=T)
}
pos<-c(pos, seek(filecon, rw="r"))
}
dbDisconnect(dbcon)
close.connection(filecon)
Reading into a database seem to be slightly faster than reading into an R object. However, the database consumes a little bit more space, about 1.3 GBs.
The idea behind reading the data into a database is, of course, letting the database do some of the work. Database management systems are typically very efficient in, for example, filtering the data. Of course, the filtering steps might be combined into the data reading and parsing inside the R loop, if possible, but they will slow down the parsing process.
Summary
It took less than 2 hours to read in the example file one line at a time, and to save it as an R binary file. This binary file took about 1.15 GBs of disk space, and it was possible to load the binary file into R memory in about 18 minutes. Thus, if possible, the large files that are to be processed in R, should preferentially be read line-by-line. The processed data may be stored in a database, but for certain purposes it is also possible to save the data as an R binary file.