The lost archives of the world wide web
10.05.2000
Nedlib archiver robot coded by CSC
Nedlib - Networked European Deposit Library - is a collaborative project of European national libraries. It aims to develop a common architectural framework and basic tools for building a deposit library for electronic publications. The Finnish national center for scientific computing CSC, has been responsible for designing and building web archiver software, i.e. a harvester to retrieve, pack, and archive documents from Web servers.
The archiver is primarily intended for the national libraries who collect and store Web documents as a part of their (legal) deposit activities. It may be used by any other organisation who wants to archive its own Web documents.
Nedlib harvester has not yet been in production use but the biggest test so far has been a round in which about 500 000 Urls were collected in 20 hours. The peak performance of the robot was 35 000 Urls per hour when we utilized 10 harvester processes. The software is available in the public domain. The harvester consists of about 10 000 lines of C code.
Web harvesting is a simple process in theory. You have a set of Ulrs as a starting point for harvesting. You pop out one Url from this set and retrieve the related document from the web. From this document you glean out the new Urls and abandon duplicates and Urls which have been fetched previously in this harvesting round. New unfetched Urls are appended into the set of Urls. Then you can pop the next Url from this set and repeat the just described process until the set of Urls will be empty.
Altought the program was developed for pure archiving purposes it's possible to apply the robot to other tasks, too. You can verify validity of links of web servers because the program keeps tracks of broken links and inserts them into the log table. Or you can analyze a content of web servers by exploring the collected metadata. You can easily find the number of Ulrs, how many HTML-pages or gif-pictures there was and so on.
The project was launched on 1st January 1988 with funding from the European Commission's Telematic Application Programme. The project will continue to the end of 2000 and it's coordinated by the Koninkljike Bibliothek, National Library of the Netherlands. The project partner here in Finland is the Finnish national library, Helsinki University Library (HUL). CSC has worked as a subcontractor for HUL.