Tehdyt toimenpiteet

hcs

The Helsinki Corpus of Swahili

Description

The Helsinki Corpus of Swahili (HCS) is an annotated corpus of Standard Swahili text. It contains news texts from several current Swahili newspapers as well as from the news site of Deutsche Welle. It also contains extracts from a number of books containing prose text, including fiction, education and sciences.

HCS has been annotated with SALAMA (Swahili Language Manager), a multi-purpose language management environment, developed at the University of Helsinki by Arvi Hurskainen, Professor of African languages. The corpus contains information of such features as the base form of the word (lemma), part-of-speech, and morphology, including noun class affiliation and verb morphology. It also contains the etymology of loan words and glosses in English.

Home Page: http://www.aakkl.helsinki.fi/cameel/corpus/intro.htm

Version and Size

Version: The corpus has no version information.

Size: The total size of the corpus is 12.5 million words.

Content and Structure

subcollection title directory documents tokens
Alasiri articles/alasiri/ 2779 1125958
An-nuur articles/annuur/ 659 837990
Books books/ 72 1055425
Dwelle articles/dwelle/ 9831 2479606
Kasheshe articles/kasheshe/ 7 15388
Kiongozi articles/kiongozi/ 70 828256
Komesha articles/komesha/ 1 8948
Lengo articles/lengo/ 6 2347
Majira articles/majira/ 6992 3309197
Mfanyakazi articles/mfanyakazi/ 19 7503
Mzalendo articles/mzalendo/ 451 366307
Nipashe articles/nipashe/ 4086 2019471
Rai articles/rai/ 33 242281
Uhuru articles/uhuru/ 816 311581

Directory in the Corpus Server

/kielipankki/hcs/

Directory Listing

/kielipankki/hcs/articles/
/kielipankki/hcs/articles/alasiri
/kielipankki/hcs/articles/an-nuur
/kielipankki/hcs/articles/dwelle
/kielipankki/hcs/articles/kasheshe
/kielipankki/hcs/articles/kiongozi
/kielipankki/hcs/articles/komesha
/kielipankki/hcs/articles/lengo
/kielipankki/hcs/articles/majira
/kielipankki/hcs/articles/mfanyakazi
/kielipankki/hcs/articles/mzalendo
/kielipankki/hcs/articles/nipashe
/kielipankki/hcs/articles/rai
/kielipankki/hcs/articles/uhuru
/kielipankki/hcs/books/
/kielipankki/hcs/teksti
/kielipankki/hcs/teksti/swa
/kielipankki/hcs/teksti/swa/books
/kielipankki/hcs/teksti/swa/books/other
/kielipankki/hcs/teksti/swa/DE
/kielipankki/hcs/teksti/swa/DE/dwelle
/kielipankki/hcs/teksti/swa/KE
/kielipankki/hcs/teksti/swa/KE/lengo
/kielipankki/hcs/teksti/swa/TZ
/kielipankki/hcs/teksti/swa/TZ/alasiri
/kielipankki/hcs/teksti/swa/TZ/an-nuur
/kielipankki/hcs/teksti/swa/TZ/kasheshe
/kielipankki/hcs/teksti/swa/TZ/kiongozi
/kielipankki/hcs/teksti/swa/TZ/komesha
/kielipankki/hcs/teksti/swa/TZ/majira
/kielipankki/hcs/teksti/swa/TZ/mfanyakazi
/kielipankki/hcs/teksti/swa/TZ/mzalendo
/kielipankki/hcs/teksti/swa/TZ/nipashe
/kielipankki/hcs/teksti/swa/TZ/rai
/kielipankki/hcs/teksti/swa/TZ/uhuru

Access Rights and Conditions

HCS: Conditions of Use

General Conditions Of Use for the CSC Services
  1. The user account on the corpus.csc.fi server is valid for the maximum period of two (2) years at the time, starting from the day of admission. CSC will notify the user well before the expiration date of the user account. User information will be updated in connection with the renewal of the license. If not agreed otherwise, unused accounts and the corresponding files will be removed one year after the last use at the latest. The user account will expire immediately when a task or a study has been completed or the user has left the university or polytechnic.
  2. Each user account is personal. No user shall pass the password on to a third party.
  3. If there is a reason to doubt that unauthorized people have used or tried to use the resources, CSC must be notified immediately. Storing delicate or confidential information or sending it over the network should be negotiated in advance with CSC.
  4. The resources obtained must be used for the proposed task only.
  5. Due to licence conditions, foreign user accounts and the use of the resources from abroad should be negotiated separately.
  6. The user id must be secured by a password that is difficult to anticipate.
  7. Some software can be used by academic users only. Other users must settle the matter with software contact persons.
  8. CSC stores customer files for a maximum time of two years after the user account has expired.
  9. CSC takes back-up copies of the customers files regularly. However, CSC declines any responsibility for files lost due to system failure.
  10. This agreement will dissolve immediately, if the licensee brakes the rules and regulation stipulated in this agreement.
  11. Neither contracting party is liable to compensate the other party for such damage preventing the fulfillment of this agreement that is caused by force majeure.

Additional conditions for the use of the Helsinki Corpus of Swahili

Usage of the Helsinki Corpus of Swahili is limited to authorized users only. Permission to use HCS for research purposes will be granted on the following conditions:

General

The rights to use the Helsinki Corpus of Swahili (HCS) consists of the permission to utilize the corpus texts

  • in accordance with Finnish and international copyright laws and regulations
  • solely for academic research purposes
  • for extracting linguistic features such as statistical frequency information, grammar rules and semantic descriptions
  • for extracting short excerpts

Publishing

HCS or part of it may not be published in print or in electronic form. Corpus texts may be cited as is needed in research reporting. When HCS is used in research, due reference to the corpus must be made, e.g.:

HCS 2004. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC – IT Center for Science.

Copying

HCS may be copied only in the extent as required by research or teaching. Those who use copies must be informed about the use conditions of the texts. Copies may not be redistributed or made available in any form, e.g. through the Internet. Copies must be destroyed after use.

Mistakes

The service providers are not responsible for the mistakes found in the corpus and annotation. All texts have been manually edited and typing errors of the original texts corrected. However, a small amount of mistakes remains in the texts. Also, because the annotation was carried out without human intervention or checking, some mistakes inevitably remain also in annotation.

The Group of Unix Users Having Access to the Resource: swahili

References

Making Bibliographical Reference to the Material:

Corpus texts may be cited as is needed in research reporting. When HCS is used in research, due reference to the corpus must be made, e.g.:

HCS 2004. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC – IT Center for Science.

Other References

Bugs

Known Bugs

All texts have been manually edited and typing errors of the original texts corrected. However, a small amount of mistakes remains in the texts. Also, because the annotation was carried out without human intervention or checking, some mistakes inevitably remain also in annotation.

Field of science:
Language research
Available:
  • hippu
License:
A