This class holds the functionality of calculating the database suitability. More...

#include <OpenMS/QC/DBSuitability.h>

Inheritance diagram for DBSuitability:

Collaboration diagram for DBSuitability:

Classes
struct	SuitabilityData
	struct to store results More...

Public Member Functions
	DBSuitability ()

	~DBSuitability () override=default
	Destructor. More...

void	compute (std::vector< PeptideIdentification > &&pep_ids, const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &original_fasta, const std::vector< FASTAFile::FASTAEntry > &novo_fasta, const ProteinIdentification::SearchParameters &search_params)
	Computes suitability of a database used to search a mzML. More...

const std::vector< SuitabilityData > &	getResults () const
	Returns results calculated by this metric. More...

Public Member Functions inherited from DefaultParamHandler
	DefaultParamHandler (const String &name)
	Constructor with name that is displayed in error messages. More...

	DefaultParamHandler (const DefaultParamHandler &rhs)
	Copy constructor. More...

virtual	~DefaultParamHandler ()
	Destructor. More...

DefaultParamHandler &	operator= (const DefaultParamHandler &rhs)
	Assignment operator. More...

virtual bool	operator== (const DefaultParamHandler &rhs) const
	Equality operator. More...

void	setParameters (const Param &param)
	Sets the parameters. More...

const Param &	getParameters () const
	Non-mutable access to the parameters. More...

const Param &	getDefaults () const
	Non-mutable access to the default parameters. More...

const String &	getName () const
	Non-mutable access to the name. More...

void	setName (const String &name)
	Mutable access to the name. More...

const std::vector< String > &	getSubsections () const
	Non-mutable access to the registered subsections. More...

Private Member Functions
double	getDecoyDiff_ (const PeptideIdentification &pep_id) const
	Calculates the xcorr difference between the top two hits marked as decoy. More...

double	getDecoyCutOff_ (const std::vector< PeptideIdentification > &pep_ids, double reranking_cutoff_percentile) const
	Calculates a xcorr cut-off based on decoy hits. More...

bool	isNovoHit_ (const PeptideHit &hit) const
	Tests if a PeptideHit is considered a deNovo hit. More...

bool	checkScoreBetterThanThreshold_ (const PeptideHit &hit, double threshold, bool higher_score_better) const
	Tests if a PeptideHit has a score better than the given threshold. More...

std::pair< String, Param >	extractSearchAdapterInfoFromMetaValues_ (const ProteinIdentification::SearchParameters &meta_values) const
	Looks through meta values of SearchParameters to find out which search adapter was used. More...

void	writeIniFile_ (const Param &parameters, const String &filename) const
	Writes parameters into a given file. More...

std::vector< PeptideIdentification >	runIdentificationSearch_ (const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &fasta_data, const String &adapter_name, Param &parameters) const
	Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR. More...

std::vector< FASTAFile::FASTAEntry >	getSubsampledFasta_ (const std::vector< FASTAFile::FASTAEntry > &fasta_data, double subsampling_rate) const
	Creates a subsampled fasta with the given subsampling rate. More...

void	calculateSuitability_ (const std::vector< PeptideIdentification > &pep_ids, SuitabilityData &data) const
	Calculates all suitability data from a combined deNovo+database search. More...

void	appendDecoys_ (std::vector< FASTAFile::FASTAEntry > &fasta) const
	Calculates and appends decoys to a given vector of FASTAEntry. More...

double	extractScore_ (const PeptideHit &pep_hit) const
	Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned. More...

double	calculateCorrectionFactor_ (const SuitabilityData &data, const SuitabilityData &data_sampled, double sampling_rate) const
	Calculates the correction factor from two suitability calculations. More...

UInt	numberOfUniqueProteins_ (const std::vector< PeptideIdentification > &peps, UInt number_of_hits=1) const
	Determines the number of unique proteins found in the protein accessions of PeptideIdentifications. More...

Size	getIndexWithMedianNovoHits_ (const std::vector< SuitabilityData > &data) const
	Finds the SuitabilityData object with the median number of de novo hits. More...

double	getScoreMatchingFDR_ (const std::vector< PeptideIdentification > &pep_ids, double FDR, const String &score_name, bool higher_score_better) const
	Extracts the worst score that still passes a FDR (q-value) threshold. More...

Private Attributes
std::vector< SuitabilityData >	results_
	result vector More...

const boost::regex	decoy_pattern_
	pattern for finding a decoy string More...

Friends
class	DBSuitability_friend
	To test private member functions. More...

Additional Inherited Members
Static Public Member Functions inherited from DefaultParamHandler
static void	writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const String &key_prefix="")
	Writes all parameters to meta values. More...

Protected Member Functions inherited from DefaultParamHandler
virtual void	updateMembers_ ()
	This method is used to update extra member variables at the end of the setParameters() method. More...

void	defaultsToParam_ ()
	Updates the parameters after the defaults have been set in the constructor. More...

Protected Attributes inherited from DefaultParamHandler
Param	param_
	Container for current parameters. More...

Param	defaults_
	Container for default parameters. This member should be filled in the constructor of derived classes! More...

std::vector< String >	subsections_
	Container for registered subsections. This member should be filled in the constructor of derived classes! More...

String	error_name_
	Name that is displayed in error messages during the parameter checking. More...

bool	check_defaults_
	If this member is set to false no checking if parameters in done;. More...

bool	warn_empty_defaults_
	If this member is set to false no warning is emitted when defaults are empty;. More...

Detailed Description

This class holds the functionality of calculating the database suitability.

To calculate the suitability of a database for a specific mzML for identification search, it is vital to perform a combined deNovo+database identification search. Meaning that the database should be appended with an additional entry derived from concatenated deNovo sequences from said mzML. Currently only Comet search is supported.

This class will calculate q-values by itself and will throw an error if any q-value calculation was done beforehand.

The algorithm parameters can be set using setParams().

Allows for multiple usage of the compute function. The result of each call is stored internally in a vector. Therefore old results will not be overridden by a new call. This vector then can be returned using getResults().

This class serves as the library representation of DatabaseSuitability

Constructor & Destructor Documentation

◆ DBSuitability()

DBSuitability ( )

Constructor Settings are initialized with their default values: no_rerank = false, reranking_cutoff_percentile = 1, FDR = 0.01

◆ ~DBSuitability()

~DBSuitability ( )

overridedefault

Destructor.

Member Function Documentation

◆ appendDecoys_()

void appendDecoys_ ( std::vector< FASTAFile::FASTAEntry > & fasta ) const

private

Calculates and appends decoys to a given vector of FASTAEntry.

Each sequence is digested with Trypsin. The resulting peptides are reversed and appended to one another. This results in the decoy sequences. The identifier is given a 'DECOY_' prefix.

Parameters

fasta reference to fasta vector where the decoys are needed

Referenced by DBSuitability_friend::appendDecoys().

◆ calculateCorrectionFactor_()

double calculateCorrectionFactor_	(	const SuitabilityData &	data,
		const SuitabilityData &	data_sampled,
		double	sampling_rate
	)		const

private

Calculates the correction factor from two suitability calculations.

Two suitability calculations need to be done for this. One with the original data and one with data from a search with a sampled database. The number of db hits and deNovo hits behaves linear. The two searches can than be used to calculate the corresponding linear functions. The factor is calculated with the negative ratio of the db slope and the deNovo slope.

Parameters

data	suitability data from the original search
data_sampled	vector of suitability data from the sampled search(s)
sampling_rate	the sampling rate used for sampled db [0,1)

Returns: correction factor

Referenced by DBSuitability_friend::calculateCorrectionFactor().

◆ calculateSuitability_()

void calculateSuitability_	(	const std::vector< PeptideIdentification > &	pep_ids,
		SuitabilityData &	data
	)		const

private

Calculates all suitability data from a combined deNovo+database search.

Counts top database and top deNovo hits.

Calculates a decoy score cut-off to compare high scoring deNovo hits with lower scoring database hits. If the score difference is smaller than the cut-off the database hit is counted and the deNovo hit ignored.

Suitability is calculated: # database hits / # all hits

Parameters

pep_ids	peptide identifications coming from the combined search, each peptide identification should be sorted
data	SuitabilityData object where the result should be written into

Exceptions

MissingInformation	if no target/decoy annotation is found on `pep_ids`
MissingInformation	if no xcorr is found, this happens when another adapter than CometAdapter was used

◆ checkScoreBetterThanThreshold_()

bool checkScoreBetterThanThreshold_	(	const PeptideHit &	hit,
		double	threshold,
		bool	higher_score_better
	)		const

private

Tests if a PeptideHit has a score better than the given threshold.

Parameters

hit	PepHit in question
threshold	threshold to check against
higher_score_better	true/false depending if a higher or a lower score is better

Returns: true/false

◆ compute()

void compute	(	std::vector< PeptideIdentification > &&	pep_ids,
		const MSExperiment &	exp,
		const std::vector< FASTAFile::FASTAEntry > &	original_fasta,
		const std::vector< FASTAFile::FASTAEntry > &	novo_fasta,
		const ProteinIdentification::SearchParameters &	search_params
	)

Computes suitability of a database used to search a mzML.

Top deNovo and top database hits from a combined deNovo+database search are counted. The ratio of db hits vs all hits yields the suitability. To re-rank cases, where a de novo peptide scores just higher than the database peptide, a decoy cut-off is calculated. This functionality can be turned off. This will result in an underestimated suitability, but it can solve problems like different search engines or to few decoy hits.

Parameters can be set using the functionality of DefaultParamHandler. Parameters are: no_rerank - re-ranking can be turned off with this (will be set automatically if no cross correlation score is found) reranking_cutoff_percentile - percentile that determines which cut-off will be returned FDR - q-value that should be filtered for Preliminary tests have shown that database suitability is rather stable across common FDR thresholds from 0 - 5 % keep_search_files - temporary files created for and by the internal ID search are kept disable_correction - disables corrected suitability calculations force - forces re-ranking to be done even without a cross correlation score, in which case the default main score is used

The calculated suitability is then tried to be corrected. For this a correction factor for the number of found top deNovo hits is calculated. This is done by perfoming an additional combined identification search with a smaller sample of the database. It was observed that the number of top deNovo and db hits behave linear according to the sampling ratio of the database. This can be used to extrapolate the number of database hits that would be needed to get a suitability of 1. This number in combination with the maximum number of deNovo hits (found with an identification search where only deNovo is used as a database) can be used to calculate a correction factor like this: #database hits for suitability of 1 / #maximum deNovo hits This formula can be simplified in a way that the maximum number of deNovo hits isn't needed:

(database hits slope) / deNovo hits slope Both of these values can easily be calculated with the original suitability data in conjunction with the one sampled search.

Correcting the number of found top deNovo hits with this factor results in them being more comparable to the top database hits. This in return results in a more linear behaviour of the suitability according to the sampling ratio. The corrected suitability reflects what sampling ratio your database represents regarding to the theoretical 'perfect' database. Or in other words: Your database needs to be (1 - corrected suitability) bigger to get a suitability of 1.

Both the original suitability as well as the corrected one are reported in the result.

Since q-values need to be calculated the identifications are taken by copy. Since decoys need to be calculated for the fasta input those are taken by copy as well.

Result is appended to the result member. This allows for multiple usage.

Parameters

pep_ids	vector containing pepIDs with target/decoy annotation coming from a deNovo+database identification search without FDR (Comet is recommended - to use other search engines either disable reranking or set the '-force' flag) vector is modified internally, and is thus copied
exp	MSExperiment that was searched to produce the identifications given in `pep_ids`
original_fasta	FASTAEntries of the database used for the ID search (without decoys)
novo_fasta	FASTAEntry derived from deNovo peptides
search_params	SearchParameters object containing information which adapter was used with which settings for the identification search that resulted in `pep_ids`

Exceptions

MissingInformation	if no target/decoy annotation is found on `pep_ids`
MissingInformation	if no xcorr is found, this happens when another adapter than CometAdapter was used
Precondition	if a q-value is found in `pep_ids`

◆ extractScore_()

double extractScore_ ( const PeptideHit & pep_hit ) const

private

Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned.

Parameters

pep_hit PeptideHit of which the score is needed

Returns: cross correlation score normalized by MW or current score

Exceptions

MissingInformation if no xcorr is found and 'force' flag isn't set

◆ extractSearchAdapterInfoFromMetaValues_()

std::pair<String, Param> extractSearchAdapterInfoFromMetaValues_ ( const ProteinIdentification::SearchParameters & meta_values ) const

private

Looks through meta values of SearchParameters to find out which search adapter was used.

Checks for the following adapters: CometAdapter, MSGFPlusAdapter, MSFraggerAdapter, MyriMatchAdapter, OMSSAAdapter and XTandemAdapter

Parameters

meta_values SearchParameters object, since the adapters write their parameters here

Returns: A pair containing the name of the adapter and the parameters used to run it

Exceptions

MissingInformation if none of the adapters above is found in the meta values

◆ getDecoyCutOff_()

double getDecoyCutOff_	(	const std::vector< PeptideIdentification > &	pep_ids,
		double	reranking_cutoff_percentile
	)		const

private

Calculates a xcorr cut-off based on decoy hits.

Decoy differences of all N pepIDs are calculated. The (1-reranking_cutoff_percentile)*N highest one is returned. It is assumed that this difference accounts for 'reranking_cutoff_percentile' of the re-ranking cases.

Parameters

pep_ids	vector containing the pepIDs
reranking_cutoff_percentile	percentile that determines which cut-off will be returned

Returns: xcorr cut-off

Exceptions

IllegalArgument	if reranking_cutoff_percentile isn't in range [0,1]
IllegalArgument	if reranking_cutoff_percentile is too low for a decoy cut-off to be calculated
MissingInformation	if no more than 20 % of the peptide IDs have two decoys in their top ten peptide hits

◆ getDecoyDiff_()

double getDecoyDiff_ ( const PeptideIdentification & pep_id ) const

private

Calculates the xcorr difference between the top two hits marked as decoy.

Searches for the top two decoys hits and returns their score difference. By default the xcorr from Comet is used. If no xcorr can be found and the 'force' flag is set the main score from the peptide hit is used, else an error is thrown.

If there aren't two decoys, DBL_MAX is returned.

Parameters

pep_id pepID from where the decoy difference will be calculated

Returns: xcorr difference

Exceptions

MissingInformation	if no target/decoy annotation is found
MissingInformation	if no xcorr is found

◆ getIndexWithMedianNovoHits_()

Size getIndexWithMedianNovoHits_ ( const std::vector< SuitabilityData > & data ) const

private

Finds the SuitabilityData object with the median number of de novo hits.

If the median isn't distinct (e.g. two entries could be considered median) the upper one is chosen.

Parameters

data	vector of SuitabilityData objects

Returns: index to object with median number of de novo hits

Referenced by DBSuitability_friend::getIndexWithMedianNovoHits().

◆ getResults()

const std::vector<SuitabilityData>& getResults ( ) const

Returns results calculated by this metric.

The returned vector contains one DBSuitabilityData object for each time compute was called. Each of these objects contains the suitability information that was extracted from the identifications used for the corresponding call of compute.

Returns: DBSuitabilityData objects in a vector

◆ getScoreMatchingFDR_()

double getScoreMatchingFDR_	(	const std::vector< PeptideIdentification > &	pep_ids,
		double	FDR,
		const String &	score_name,
		bool	higher_score_better
	)		const

private

Extracts the worst score that still passes a FDR (q-value) threshold.

This can be used to 'convert' a FDR threshold to a threshold for the desired score (score and FDR need to be dependent)

Parameters

pep_ids	vector of PeptideIdentifications
FDR	FDR threshold, hits with a worse q-value score aren't looked at
score_name	name of the score to search for The score name doesn't need to be the exact metavalue name, but a metavalue key should contain it. i.e. "e-value" as metavalue "e-value_score"
higher_score_better	true/false depending if a higher or lower score (`score_name`) is better

Returns: the worst score that is still in the FDR threshold

Exceptions

IllegalArgument	if `score_name` isn't found in the metavalues
Precondition	if main score of `pep_ids` isn't 'q-value'

Referenced by DBSuitability_friend::getScoreMatchingFDR().

◆ getSubsampledFasta_()

std::vector<FASTAFile::FASTAEntry> getSubsampledFasta_	(	const std::vector< FASTAFile::FASTAEntry > &	fasta_data,
		double	subsampling_rate
	)		const

private

Creates a subsampled fasta with the given subsampling rate.

The subsampling is based on the number of amino acides and not on the number of fasta entries.

Parameters

fasta_data	fasta of which the subsampling should be done
subsampling_rate	subsampling rate to be used [0,1]

Returns: fasta entries with total number of AA = original number of AA * subsampling_rate

Exceptions

IllegalArgument if subsampling rate is not between 0 and 1

Referenced by DBSuitability_friend::getSubsampledFasta().

◆ isNovoHit_()

bool isNovoHit_ ( const PeptideHit & hit ) const

private

Tests if a PeptideHit is considered a deNovo hit.

To test this the function looks into the protein accessions. If only the deNovo protein is found, 'true' is returned. If at least one database protein is found, 'false' is returned.

This function also uses boost::regex_search to make sure the deNovo accession doesn't contain a decoy string. This is needed for 'target+decoy' hits.

Parameters

hit	PepHit in question

Returns: true/false

◆ numberOfUniqueProteins_()

UInt numberOfUniqueProteins_	(	const std::vector< PeptideIdentification > &	peps,
		UInt	number_of_hits = `1`
	)		const

private

Determines the number of unique proteins found in the protein accessions of PeptideIdentifications.

Parameters

peps	vector of PeptideIdentifications
number_of_hits	the number of hits to search in (if this is bigger than the actual number of hits all hits are looked at)

Returns: number of unique protein accessions

Exceptions

MissingInformation if no target/decoy annotation is found on peps

Referenced by DBSuitability_friend::numberOfUniqueProteins().

◆ runIdentificationSearch_()

std::vector<PeptideIdentification> runIdentificationSearch_	(	const MSExperiment &	exp,
		const std::vector< FASTAFile::FASTAEntry > &	fasta_data,
		const String &	adapter_name,
		Param &	parameters
	)		const

private

Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR.

Which adapter should run with which parameters can be controlled. Make sure the search adapter you wish to use is built on your system and the executable is on your PATH variable.

Indexing and FDR are always done the same way.

The inputs are stored in temporary files to execute the Adapter. (MSExperiment -> .mzML, vector<FASTAEntry> -> .fasta, Param -> .INI)

Parameters

exp	MSExperiment that will be searched
fasta_data	represents the database that should be used to search
adapter_name	name of the adapter to search with
parameters	parameters for the adapter

Returns: peptide identifications with annotated q-values

Exceptions

MissingInformation	if no adapter name is given
InvalidParameter	if a not supported adapter name is given
InternalToolError	if any error occures while running the adapter
InternalToolError	if any error occures while running PeptideIndexer functionalities
InvalidParameter	if the needed FDR parameters are not found

◆ writeIniFile_()

void writeIniFile_	(	const Param &	parameters,
		const String &	filename
	)		const

private

Writes parameters into a given file.

Parameters

parameters	parameters to write
filename	name of the file where the parameters should be written to

Exceptions

UnableToCreateFile if filename isn't writable

Friends And Related Function Documentation

◆ DBSuitability_friend

friend class DBSuitability_friend

friend

To test private member functions.

Member Data Documentation

◆ decoy_pattern_

const boost::regex decoy_pattern_

private

pattern for finding a decoy string

◆ results_

std::vector<SuitabilityData> results_

private

result vector

Classes

Public Member Functions

Private Member Functions

Private Attributes

Friends

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

◆ DBSuitability()

◆ ~DBSuitability()

Member Function Documentation

◆ appendDecoys_()

◆ calculateCorrectionFactor_()

◆ calculateSuitability_()

◆ checkScoreBetterThanThreshold_()

◆ compute()

◆ extractScore_()

◆ extractSearchAdapterInfoFromMetaValues_()

◆ getDecoyCutOff_()

◆ getDecoyDiff_()

◆ getIndexWithMedianNovoHits_()

◆ getResults()

◆ getScoreMatchingFDR_()

◆ getSubsampledFasta_()

◆ isNovoHit_()

◆ numberOfUniqueProteins_()

◆ runIdentificationSearch_()

◆ writeIniFile_()

Friends And Related Function Documentation

◆ DBSuitability_friend

Member Data Documentation

◆ decoy_pattern_

◆ results_