MacVector icon

De Novo Short Read Assembly

Assembler has the ability to assemble short read data of the type generated by the various Next Generation sequencers. Assembler uses Phrap to create de novo assemblies and Bowtie to create reference assemblies. This page is about de novo assemblies.

Next generation sequencers are capable of producing a very large amount of data. Far greater than is seen with traditional Sanger sequencers. Assembler is capable of generating de novo assemblies of up to 1.5 million short reads using Phrap. However, it must be understood that such assemblies do require very large amounts of RAM. Greater than is commonly found on even very recent desktop and laptop Macs. Therefore the size of assemblies that you can realistically perform on your average iMac and Macbook Pro is far more conservative.

The following table shows the approximate maximum number of reads that can be assembled with varying amounts of RAM. However, these are not absolute limits, and if you have many other applications open, then your assembly will take far longer. Assemblies larger than these limits may also be assembled, however it must be understood that the run time of such jobs will be orders of magnitude longer and will likely take days, rather than hours. Please also note that on lower end Macs (especially 32 bit G4 models) excessively large jobs may silently fail within a few minutes of being run.

- Less than 1Gb RAM = 50,000 reads.

- Over 1Gb = 100,000 reads.

- Over 2Gb = 200,000 reads.

- Over 4Gb = 500,000 reads.

Again please note that these limits are for Macs with no running applications other than the OS.

Phrap is also very CPU intensive, so Assembler will submit jobs with reduced priority. This will allow you to continue using your Mac and MacVector for normal work, as Phrap will take less of the CPU.. However, if you run any other CPU intensive jobs, then the assembly will take much longer.

Here are some examples of time taken to assemble A sample of 262,000 reads with an average length of 200bp. This has the accession number of SRR015579, taken from the NCBI's Short Read Archive

Assembler produced 408 contigs with an average length of 16,396bp

- Less than 30 minutes on a Mac Pro with two dual core CPUs and 8Gb of RAM.

- 6 hours on a MacBook Pro 3Gb of RAM

- 8 hours on a dual G4 MDD with 1.8Gb of RAM.

- >8 Days on a iMac with 1Gb of RAM

A sample of 233,000 reads with an average length of 200bp. This has the accession number of SRR015575, taken from the NCBI's Short Read Archive

Assembler produced 48 contigs with an average length of 100,396bp

- Less than 30 minutes on a Mac Pro with two dual core CPUs and 8Gb of RAM.

- 12 hours on a MacBook Pro 3Gb of RAM

- Fails to start on a dual G4 MDD with 1.8Gb of RAM.

A sample of 200,000 reads with average length of 44bp. This is a subset of ERR00092.

Assembler supports short read data in the Fastq format. This is a widely accepted format, that only contains basecalled sequence and quality data for each read.

Related Topics.

Assembler

Importing Fastq data

Assembling sequences

Base calling

Bowtie

Assembling