Versions
CD-HIT version 4.0 and CD-HIT-454 are available in Hippu, Vuori and Murska.
Description
CD-HIT can be used for clustering large sequence sets or removing identical or highly similar sequences from a sequece set. CD-HIT is often used as a tool to produce a non redudndant sequence set for further analysis of a large sequence set.
CD-HIT programs
| Program | Description |
|---|---|
| cdhit |
Clustering and redundance removal tool for protein sequences |
| cdhit-est |
Clustering and redundance removal tool for nucleic acid sequences (only for sequences that do not contain introns) |
| cdhit-2d |
Tool to compare two protein sequence sets |
| cdhit-est-2d |
Tool to compare two nucleic sequence sets |
| cdhit-454 | A program to identify artificial duplicates from raw 454 sequencing reads |
Using CD-HIT
The setup command for CD-HIT in is:
module load cd-hit
After the setup command, the server recognizes CD-HIT commands.
You can list the command line options of CD-HIT programs by using option -help. For example:
cdhit -help
A simple analysis for a protein sequence set can be done for example with command:
cdhit -i my_proteins.fasta -o reduced_set.fasta -c 0.95
The sample command above produces two result files.
1. reduced_set.fasta contains a pruned sequence set. In this case, if two sequences are more than 95% identical, only the longer one is included to the results.
2. reduced_set.fasta.clstr contains information about the clustering of the sequences that share higher similarity than the give threshold value (in this case 95%).
More information
| Mattila Kimmo | Kimmo.Mattila at csc.fi |
| Saren Ari-Matti | Ari-Matti.Saren at csc.fi |