Sciences and methods > Biosciences > Programs > CD-HIT
 
Tehdyt toimenpiteet

CD-HIT at CSC

Program for creating sequence clusters and non redundant sequence sets

Versions

CD-HIT version 4.5.4. and CD-HIT-454 are available in Hippu, Vuori  and Murska.


Description


CD-HIT can be used for clustering large sequence sets  or  removing identical or highly similar sequences from a sequence set. CD-HIT is often used as a tool to produce a non redundant sequence set for further analysis of a large sequence set. CD-HIT recognizes fasta and fastq sequence formats.



CD-HIT programs


Program Description
cdhit
Clustering and redundance removal tool for protein sequences
cdhit-est      
Clustering and redundance removal tool for nucleic acid sequences (only for sequences that do not contain introns)
cdhit-2d
Tool to compare two protein sequence sets
cdhit-est-2d
Tool to compare two nucleic sequence sets
 cdhit-454  A program to identify artificial duplicates from raw 454 sequencing reads


Using CD-HIT

 The setup command for CD-HIT in is:

module load cd-hit

After the setup command, the server recognizes CD-HIT commands.

You can list the command line options of CD-HIT programs by using option -help. For example:

cdhit -help

A simple analysis for a protein sequence set can be done for example with command:

cdhit -i my_proteins.fasta -o reduced_set.fasta -c 0.95


The sample command above produces two result files.

1.  reduced_set.fasta contains a pruned sequence set. In this case, if two sequences are more than 95% identical, only the longer one is included to the results.
2.  reduced_set.fasta.clstr contains information about the clustering of the sequences that share higher similarity than the give threshold value (in this case 95%).


More information


Mattila Kimmo Kimmo.Mattila at csc.fi
Saren Ari-Matti Ari-Matti.Saren at csc.fi