Platforms and Technology

Computing Resources:

The Frazer lab has a high performance compute cluster consisting of 17 compute nodes from Supermicro. Each node has two Intel E5-2640 CPUS (34 logical cores) and a high-speed local scratch disk. Fourteen of the nodes have 128 GB of RAM, 2 have 256 GB, and one has 512 GB. Two login nodes provide interactive sessions and have 256GB of RAM and 64 virtual cores. These systems also serve Internet based applications such as iPython notebooks and are connected at 10Gbit/sec to the high speed SDSC network to facilitate a range of rapid download of common genomic data such as CGHUB and GATK Bundle. The cluster runs the common Gridengine scheduler (SGE) with a set of configurations designed to allow for ease of user jobs and reasonable resource controls to prevent conflict. All 544 of the cores on the 17 compute nodes are controlled by the scheduler. A legacy collection of systems and our previous cluster provides some additional compute options and a set of departmental filesystems to provide backups and shared project storage with the desktops. These older nodes provide an addition 544 cores for lower priority efforts and have access to the same data and networks. The entire cluster environment is connected via high speed 10Gbit/sec Ethernet to an Intel Enterprise Lustre parallel filesystem, which is currently half a Petabyte in size (524TB). Importantly, our Enterprise Lustre parallel filesystem has the ability to scale as new storage requirements arise.

These computing, storage, and archiving equipment are housed in the Colocation facilities of the San Diego Supercomputer Center, and enable the lab to run and manage the data sets of all Next Generation Sequencing Tools, such as read aligners (BWA, MAQ, GSNAP), various post-alignment tools for SNP calling or indels-mapping (MAQ, SAMTools, GATK, BEDTools), somatic mutations detection (MuTect, Strelka), SNP annotation (SnpEff, VEP, Gemini), RNA isoforms identification (TopHat, Cufflinks, STAR, ExPress, etc.), ChIP-seq and ATAC-seq peak identification and analysis (Homer), as well as custom tools for quality control and other data-processing needs. We have an integrated next generation sequence analysis pipeline with version tracking which ensures reproducibility and tracking of the data and timely quality control. A laboratory Information Management System and Sequencing Tracking systems are implemented on this infrastructure, leveraging MySQL, Postgres, and SQL Server; applications software includes JBoss, Apache Tomcat, Apache Httpd, and as well as more specialized development tools.