OpenMS
|
Class for the enzymatic digestion of sequences. More...
#include <OpenMS/CHEMISTRY/EnzymaticDigestion.h>
Public Types | |
enum | Specificity { SPEC_NONE = 0 , SPEC_SEMI = 1 , SPEC_FULL = 2 , SPEC_UNKNOWN = 3 , SPEC_NOCTERM = 8 , SPEC_NONTERM = 9 , SIZE_OF_SPECIFICITY = 10 } |
when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important More... | |
Public Member Functions | |
EnzymaticDigestion () | |
Default constructor. More... | |
EnzymaticDigestion (const EnzymaticDigestion &rhs) | |
Copy constructor. More... | |
EnzymaticDigestion & | operator= (const EnzymaticDigestion &rhs) |
Assignment operator. More... | |
virtual | ~EnzymaticDigestion () |
Destructor. More... | |
Size | getMissedCleavages () const |
Returns the number of missed cleavages for the digestion. More... | |
void | setMissedCleavages (Size missed_cleavages) |
Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used. More... | |
String | getEnzymeName () const |
Returns the enzyme for the digestion. More... | |
virtual void | setEnzyme (const DigestionEnzyme *enzyme) |
Sets the enzyme for the digestion. More... | |
Specificity | getSpecificity () const |
Returns the specificity for the digestion. More... | |
void | setSpecificity (Specificity spec) |
Sets the specificity for the digestion (default is SPEC_FULL). More... | |
Size | digestUnmodified (const StringView &sequence, std::vector< StringView > &output, Size min_length=1, Size max_length=0) const |
Performs the enzymatic digestion of an unmodified sequence. More... | |
Size | digestUnmodified (const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=1, Size max_length=0) const |
Performs the enzymatic digestion of an unmodified sequence. More... | |
bool | isValidProduct (const String &protein, int pep_pos, int pep_length, bool ignore_missed_cleavages=true) const |
Is the peptide fragment starting at position pep_pos with length pep_length within the sequence protein generated by the current enzyme? More... | |
Size | countInternalCleavageSites (const String &sequence) const |
Counts the number of internal cleavage sites (missed cleavages) in a protein sequence. More... | |
bool | filterByMissedCleavages (const String &sequence, const std::function< bool(const Int)> &filter) const |
Filter based on the number of missed cleavages. More... | |
Static Public Member Functions | |
static Specificity | getSpecificityByName (const String &name) |
Static Public Attributes | |
static const std::string | NamesOfSpecificity [SIZE_OF_SPECIFICITY] |
Names of the Specificity. More... | |
static const std::string | NoCleavage |
Name for no cleavage. More... | |
static const std::string | UnspecificCleavage |
Name for unspecific cleavage. More... | |
Protected Member Functions | |
bool | isValidProduct_ (const String &sequence, int pos, int length, bool ignore_missed_cleavages, bool allow_nterm_protein_cleavage, bool allow_random_asp_pro_cleavage) const |
supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_() More... | |
std::vector< int > | tokenize_ (const String &sequence, int start=0, int end=-1) const |
Digests the sequence using the enzyme's regular expression. More... | |
Size | digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< StringView > &output, Size min_length=0, Size max_length=-1) const |
Helper function for digestUnmodified() More... | |
Size | digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=0, Size max_length=-1) const |
Size | countMissedCleavages_ (const std::vector< int > &cleavage_positions, Size seq_start, Size seq_end) const |
Counts the number of missed cleavages in a sequence fragment. More... | |
Protected Attributes | |
Size | missed_cleavages_ |
Number of missed cleavages. More... | |
const DigestionEnzyme * | enzyme_ |
Used enzyme. More... | |
std::unique_ptr< boost::regex > | re_ |
Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_()) More... | |
Specificity | specificity_ |
specificity of enzyme More... | |
Class for the enzymatic digestion of sequences.
Digestion can be performed using simple regular expressions, e.g. [KR] | [^P] for trypsin. Also missed cleavages can be modeled, i.e. adjacent peptides are not cleaved due to enzyme malfunction/access restrictions. If n missed cleavages are given, all possible resulting peptides (cleaved and uncleaved) with up to n missed cleavages are returned. Thus no random selection of just n specific missed cleavage sites is performed.
enum Specificity |
when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important
Default constructor.
EnzymaticDigestion | ( | const EnzymaticDigestion & | rhs | ) |
Copy constructor.
|
virtual |
Destructor.
Counts the number of internal cleavage sites (missed cleavages) in a protein sequence.
sequence | Sequence |
|
protected |
Counts the number of missed cleavages in a sequence fragment.
cleavage_positions | Positions of cleavage in protein as obtained from tokenize_() |
seq_start | Index into sequence |
seq_end | Past-the-end index into sequence |
|
protected |
|
protected |
Helper function for digestUnmodified()
This function implements digestUnmodified() starting from the result of tokenize_(). The separation enables derived classes to modify the result of tokenize_() during the in-silico digestion.
Size digestUnmodified | ( | const StringView & | sequence, |
std::vector< std::pair< Size, Size >> & | output, | ||
Size | min_length = 1 , |
||
Size | max_length = 0 |
||
) | const |
Performs the enzymatic digestion of an unmodified sequence.
By returning only positions into the original string this is very fast and compared to the StringView output version of this function it is independent of the original sequence. Can be used for matching products to determine e.g. missing ones.
sequence | Sequence to digest |
output | Digestion products as vector of pairs of start and end positions |
min_length | Minimal length of reported products |
max_length | Maximal length of reported products (0 = no restriction) |
Size digestUnmodified | ( | const StringView & | sequence, |
std::vector< StringView > & | output, | ||
Size | min_length = 1 , |
||
Size | max_length = 0 |
||
) | const |
Performs the enzymatic digestion of an unmodified sequence.
By returning only references into the original string this is very fast.
sequence | Sequence to digest |
output | Digestion products |
min_length | Minimal length of reported products |
max_length | Maximal length of reported products (0 = no restriction) |
bool filterByMissedCleavages | ( | const String & | sequence, |
const std::function< bool(const Int)> & | filter | ||
) | const |
Filter based on the number of missed cleavages.
sequence | Unmodified (!) amino acid sequence to check. |
filter | A predicate that takes as parameter the number of missed cleavages in the sequence and returns true if the sequence should be filtered out. |
Referenced by IDFilter::PeptideDigestionFilter::operator()().
String getEnzymeName | ( | ) | const |
Returns the enzyme for the digestion.
Size getMissedCleavages | ( | ) | const |
Returns the number of missed cleavages for the digestion.
Specificity getSpecificity | ( | ) | const |
Returns the specificity for the digestion.
|
static |
convert spec string name to enum returns SPEC_UNKNOWN if name
is not valid
bool isValidProduct | ( | const String & | protein, |
int | pep_pos, | ||
int | pep_length, | ||
bool | ignore_missed_cleavages = true |
||
) | const |
Is the peptide fragment starting at position pep_pos
with length pep_length
within the sequence protein
generated by the current enzyme?
Checks if peptide is a valid digestion product of the enzyme, taking into account specificity and the MC flag provided here.
protein | Protein sequence |
pep_pos | Starting index of potential peptide |
pep_length | Length of potential peptide |
ignore_missed_cleavages | Do not compare MC's of potential peptide to the maximum allowed MC's |
|
protected |
supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_()
EnzymaticDigestion& operator= | ( | const EnzymaticDigestion & | rhs | ) |
Assignment operator.
|
virtual |
Sets the enzyme for the digestion.
Reimplemented in RNaseDigestion.
void setMissedCleavages | ( | Size | missed_cleavages | ) |
Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used.
Referenced by NucleicAcidSearchEngine::main_().
void setSpecificity | ( | Specificity | spec | ) |
Sets the specificity for the digestion (default is SPEC_FULL).
|
protected |
Digests the sequence using the enzyme's regular expression.
The resulting split positions include start
as first position, but not end. If start is negative, it is reset to zero. If end is negative or beyond sequence's
size(), it is set to size(). All returned positions are relative to the full sequence
.
Returned positions include start
and any positions between start and end matching the regex.
sequence | ... |
start | Start digestion after this point |
end | Past-the-end index into sequence |
start
, but not end
)
|
protected |
Used enzyme.
|
protected |
Number of missed cleavages.
|
static |
Names of the Specificity.
|
static |
Name for no cleavage.
|
protected |
Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_())
|
protected |
specificity of enzyme
|
static |
Name for unspecific cleavage.