seqsplit
Split sequences into separate files based on barcodes
Description
seqsplit( splits
sequences in fastqFile,barcodeFile)fastqFile according to the barcodes
in barcodeFile and saves the sequences in separate
files. By default, the output file name consists of the input file
name followed by the barcode identifier. Sequences that do not match
any provided barcodes, or that match multiple barcodes ambiguously,
are saved in a file with the suffix '_unmatched' instead
of the barcode identifier.
seqsplit(___, uses
additional options specified by one or more Name,Value)Name,Value pair
arguments.
Examples
Create a tab-delimited file with barcode IDs and barcode sequences.
barcodeInfo = {'ID1', 'AAAAC'; 'ID2', 'AGATT'; 'ID3', 'GACTT'};
writetable(cell2table(barcodeInfo), 'barcodeExample.txt', ...
'Delimiter', '\t', 'WriteVariableNames', false);Split sequences into separate output files based on the barcode sequences. By default, the function assumes that the barcode is located at the 5' end of each sequence, and no mismatches are allowed during barcode matching.
[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt');
Check the number of sequences in each output file after splitting.
N
N = 3×1
2
1
1
Allow up to two mismatches during the barcode matching.
[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt', ... 'MaxMismatches',2,'OutputSuffix','_MM2_split');
N
N = 3×1
5
9
5
Input Arguments
Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.
Example: 'SRR005164_1_50.fastq'
Name of barcode file with barcode information, specified as a character vector or string. The file must be tab-formatted, containing barcode IDs and barcode sequences. Each ID must be followed by a barcode sequence, and all barcode sequences must have the same length.
Example: 'barcodeExample.txt'
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN, where Name is
the argument name and Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
Example: 'MaxMismatches',2 specifies to allow
up to 2 mismatches during barcode matching.
Maximum number of mismatches allowed during barcode matching, specified as a nonnegative integer. The default is 0, that is, no mismatches are allowed.
Type of barcode to match, specified as 3 or 5. A value of 5 corresponds to the barcode located at the 5' end of each sequence, and 3 corresponds to the 3' end.
Example:
Whether to remove the barcode and corresponding quality information from the matched sequences, specified as true or false. The default is true.
Whether to save unmatched sequences and corresponding quality information in a separate output file, specified as true or false. The output file name has the suffix '_unmatched' instead of the barcode ID.
Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.
Example: 'OutputDir','F:\results'
Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the barcode ID. The default is '_split'.
Option to perform computations in parallel using a parallel pool of workers, specified as one of these values:
"off"— Run in serial on the MATLAB® client."auto"— Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, run in serial on the MATLAB client."on"— Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, throw an error.
If you do not have a parallel pool open and automatic pool creation is enabled, MATLAB opens a pool using the default cluster profile. To use a parallel pool to run computations in MATLAB, you must have Parallel Computing Toolbox™.
Before R2026a: You can specify this argument as
true or false only. The default
value is false. To run computations in parallel, set this
argument to true.
Note
There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.
Output Arguments
Output file names, returned as a cell array of character vectors.
By default, the name of each output file consists of the input file
name followed by the output suffix ('_split') and
the barcode identifier.
Numbers of sequences saved in each output file, returned as
a scalar or an n-by-1 vector,
where n is the number of output files. If there
are multiple output files, the order within N corresponds
to the order of the output files.
Extended Capabilities
seqsplit has automatic parallel support.
To run computations in parallel, set the UseParallel argument to
"on" or "auto".
Version History
Introduced in R2016bThe UseParallel name-value argument now accepts
"off", "auto", or "on"
instead of true or false. This change gives
you more control over when to use a parallel pool for parallel execution.
Specifying the UseParallel argument as
true or false is not recommended.
This table shows how to update your code depending on your goal.
| Goal | Not recommended | Recommended |
|---|---|---|
| Write code that runs on the MATLAB client | seqsplit(fastqFile,barcodeFile,UseParallel=false) | seqsplit(fastqFile,barcodeFile,UseParallel="off")
(default) |
| Write portable code that runs on a parallel pool and, if a pool is not available runs on the MATLAB client. | seqsplit(fastqFile,barcodeFile,UseParallel=true) | seqsplit(fastqFile,barcodeFile,UseParallel="auto")
|
| Write code that runs on a parallel pool and errors if a pool is not available. | N/A | seqsplit(fastqFile,barcodeFile,UseParallel="on") |
There are no plans to remove support for true or
false values.
See Also
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Seleccione un país/idioma
Seleccione un país/idioma para obtener contenido traducido, si está disponible, y ver eventos y ofertas de productos y servicios locales. Según su ubicación geográfica, recomendamos que seleccione: .
También puede seleccionar uno de estos países/idiomas:
Cómo obtener el mejor rendimiento
Seleccione China (en idioma chino o inglés) para obtener el mejor rendimiento. Los sitios web de otros países no están optimizados para ser accedidos desde su ubicación geográfica.
América
- América Latina (Español)
- Canada (English)
- United States (English)
Europa
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)