Bioinformatics with large data-sets - Bioinformatics with large data-sets - Training
|Date:||14.02.2012 9:00 - 24.06.2015 17:00|
|lecturers:|| Kimmo Mattila |
Modern research methods in biosciences can rapidly produce very large data-sets. While the actual analysis methods typically do not change, the handling of large data-sets gives rise to new challenges and scalability issues. While it is for example quite feasible to do BLAST analysis for tens of thousands of sequences, it is not feasible to do it by cutting-and-pasting them one by one into a web form.
The CSC computing resources are well suited to performing such large scale analysis. There can, however, be huge difference in the time and resources required depending how the analysis is set up.
This course will offer an introduction on how to best utilize the CSC computing environment to move, store and analyze large data-sets. Examples used on the course will be mostly from sequence analysis, but the methods presented are easily adapted to any kind of computational analysis.
The course is intended as an introduction to the subject. No previous familiarity with large scale computation is required, but basic familiarity with command line based systems (such as the CSC application servers) is helpful.
Topics we will touch on this course include:
- How to move large data-sets between CSC and your own system
- How and where to store the data at CSC
- Using the IDA service to store and share your data
- How to best utilize the CSC computing resources for your data
- Which machine is best suited to your job
- What are your options for running your job
- How to run batch jobs and array jobs
- How to optimize your batch script
- How to use grid resources (grid sample case: mapping NGS data with BWA)
9.00 Registration and coffee
9.15 Course starts
14.00 Coffee17.00 Course ends