Contenido principal

seqsplit

Split sequences into separate files based on barcodes

Description

seqsplit(fastqFile,barcodeFile) splits sequences in fastqFile according to the barcodes in barcodeFile and saves the sequences in separate files. By default, the output file name consists of the input file name followed by the barcode identifier. Sequences that do not match any provided barcodes, or that match multiple barcodes ambiguously, are saved in a file with the suffix '_unmatched' instead of the barcode identifier.

example

seqsplit(___,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

[outFiles,N] = seqsplit(___) returns the names of output files in a cell array outFiles. N represents a vector containing the numbers of sequences saved in each output file.

example

Examples

collapse all

Create a tab-delimited file with barcode IDs and barcode sequences.

 barcodeInfo = {'ID1', 'AAAAC'; 'ID2', 'AGATT'; 'ID3', 'GACTT'};
 writetable(cell2table(barcodeInfo), 'barcodeExample.txt', ...
        'Delimiter', '\t', 'WriteVariableNames', false);

Split sequences into separate output files based on the barcode sequences. By default, the function assumes that the barcode is located at the 5' end of each sequence, and no mismatches are allowed during barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt');

Check the number of sequences in each output file after splitting.

N
N = 3×1

     2
     1
     1

Allow up to two mismatches during the barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt', ...
        'MaxMismatches',2,'OutputSuffix','_MM2_split');
N
N = 3×1

     5
     9
     5

Input Arguments

collapse all

Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.

Example: 'SRR005164_1_50.fastq'

Name of barcode file with barcode information, specified as a character vector or string. The file must be tab-formatted, containing barcode IDs and barcode sequences. Each ID must be followed by a barcode sequence, and all barcode sequences must have the same length.

Example: 'barcodeExample.txt'

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'MaxMismatches',2 specifies to allow up to 2 mismatches during barcode matching.

Maximum number of mismatches allowed during barcode matching, specified as a nonnegative integer. The default is 0, that is, no mismatches are allowed.

Type of barcode to match, specified as 3 or 5. A value of 5 corresponds to the barcode located at the 5' end of each sequence, and 3 corresponds to the 3' end.

Example:

Whether to remove the barcode and corresponding quality information from the matched sequences, specified as true or false. The default is true.

Whether to save unmatched sequences and corresponding quality information in a separate output file, specified as true or false. The output file name has the suffix '_unmatched' instead of the barcode ID.

Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.

Example: 'OutputDir','F:\results'

Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the barcode ID. The default is '_split'.

Option to perform computations in parallel using a parallel pool of workers, specified as one of these values:

  • "off" — Run in serial on the MATLAB® client.

  • "auto" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, run in serial on the MATLAB client.

  • "on" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, throw an error.

If you do not have a parallel pool open and automatic pool creation is enabled, MATLAB opens a pool using the default cluster profile. To use a parallel pool to run computations in MATLAB, you must have Parallel Computing Toolbox™.

Before R2026a: You can specify this argument as true or false only. The default value is false. To run computations in parallel, set this argument to true.

Note

There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.

Output Arguments

collapse all

Output file names, returned as a cell array of character vectors. By default, the name of each output file consists of the input file name followed by the output suffix ('_split') and the barcode identifier.

Numbers of sequences saved in each output file, returned as a scalar or an n-by-1 vector, where n is the number of output files. If there are multiple output files, the order within N corresponds to the order of the output files.

Extended Capabilities

expand all

Version History

Introduced in R2016b

expand all