New to MacVector 18.6 is the ability to sort and assemble reads from different datasets into individual sub-projects. This functionality is located in the phrap parameters dialog. When enabled and configured appropriately for your dataset it will automatically break out the input reads into sub-projects to be assembled separately.
A simple pattern-matching text box lets you define which characters in the input filenames should be treated as project names, and which should be treated as read names. After assembly, contigs can be exported (to a variety of file formats, including fasta and fastq) retaining the project name in the contig names.
This function can be a great time saver if you do a lot of related small sequencing projects as long as you use a well-defined naming convention.
The reads in your datasets must have a defined naming standard. You need to construct a pattern that will match the project name and read name. There are a set of characters that you can use to construct a pattern that defines what is the read name and what is the project name. As an aid to construction a pattern when you type these in the dialog the sub-project name will be dynamically updated to show what the sub-projects will be named. These characters are:
This is best demonstrated with an example. Here we have a sequencing dataset called BASENAME. Each individual sample that had been sequenced was numbered 1000 to 1100. Typical read names are:
BASENAME-1001g07_0x00.s01_1.scf
BASENAME-1001g07_0x00.s02_1.scf
BASENAME-1003g07_1a03.m22_2.scf
BASENAME-1003g07_1b06.m23_1.scf
BASENAME-1005g07_2c07.s01_1.scf
BASENAME-1005g07_0x00.s01_1.scf
Your pattern for this could be:
PPPPPPPP-PPPPxxxx
We can break this down as follows for the first readname:
BASENAME-1001g07_0x00.s01_1.scf
The above set of reads would produce the following three sub-assemblies: