OpenMS
GridBasedClustering< Metric > Class Template Reference

2D hierarchical clustering implementation optimized for large data sets containing many small clusters i.e. dimensions of clusters << dimension of entire dataset More...

#include <OpenMS/ML/CLUSTERING/GridBasedClustering.h>

Inheritance diagram for GridBasedClustering< Metric >:
[legend]
Collaboration diagram for GridBasedClustering< Metric >:
[legend]

Public Types

typedef GridBasedCluster::Point Point
 cluster centre, cluster bounding box, grid index More...
 
typedef GridBasedCluster::Rectangle Rectangle
 
typedef ClusteringGrid::CellIndex CellIndex
 
typedef std::multiset< MinimumDistance >::const_iterator MultisetIterator
 
typedef std::unordered_multimap< int, MultisetIterator >::const_iterator NNIterator
 
- Public Types inherited from ProgressLogger
enum  LogType { CMD , GUI , NONE }
 Possible log types. More...
 

Public Member Functions

 GridBasedClustering (Metric metric, const std::vector< double > &data_x, const std::vector< double > &data_y, const std::vector< int > &properties_A, const std::vector< int > &properties_B, std::vector< double > grid_spacing_x, std::vector< double > grid_spacing_y)
 initialises all data structures More...
 
 GridBasedClustering (Metric metric, const std::vector< double > &data_x, const std::vector< double > &data_y, std::vector< double > grid_spacing_x, std::vector< double > grid_spacing_y)
 initialises all data structures More...
 
void cluster ()
 performs the hierarchical clustering (merges clusters until their dimension exceeds that of cell) More...
 
void extendClustersY ()
 extends clusters in y-direction if possible (merges clusters further in y-direction, i.e. clusters can now span multiple cells) More...
 
void removeSmallClustersY (double threshold_y)
 removes clusters with bounding box dimension in y-direction below certain threshold More...
 
std::map< int, GridBasedClustergetResults () const
 returns final results (mapping of cluster indices to clusters) More...
 
- Public Member Functions inherited from ProgressLogger
 ProgressLogger ()
 Constructor. More...
 
virtual ~ProgressLogger ()
 Destructor. More...
 
 ProgressLogger (const ProgressLogger &other)
 Copy constructor. More...
 
ProgressLoggeroperator= (const ProgressLogger &other)
 Assignment Operator. More...
 
void setLogType (LogType type) const
 Sets the progress log that should be used. The default type is NONE! More...
 
LogType getLogType () const
 Returns the type of progress log being used. More...
 
void setLogger (ProgressLoggerImpl *logger)
 Sets the logger to be used for progress logging. More...
 
void startProgress (SignedSize begin, SignedSize end, const String &label) const
 Initializes the progress display. More...
 
void setProgress (SignedSize value) const
 Sets the current progress. More...
 
void endProgress (UInt64 bytes_processed=0) const
 
void nextProgress () const
 increment progress by 1 (according to range begin-end) More...
 

Private Member Functions

void init_ (const std::vector< double > &data_x, const std::vector< double > &data_y, const std::vector< int > &properties_A, const std::vector< int > &properties_B)
 initialises all data structures More...
 
bool mergeVeto_ (const GridBasedCluster &c1, const GridBasedCluster &c2) const
 checks if two clusters can be merged Each point in a cluster can (optionally) have two properties A and B. Properties A need to be the same, properties B need to differ in each cluster. Methods checks if this is violated in the merged cluster. More...
 
bool findNearestNeighbour_ (const GridBasedCluster &cluster, int cluster_index)
 determines the nearest neighbour for each cluster More...
 
void eraseMinDistance_ (const std::multiset< MinimumDistance >::const_iterator it)
 remove minimum distance object and its related data More...
 

Private Attributes

Metric metric_
 metric for measuring the distance between points in the 2D plane More...
 
ClusteringGrid grid_
 grid on which the position of the clusters are registered used in cluster method More...
 
std::map< int, GridBasedClusterclusters_
 list of clusters maps cluster indices to clusters More...
 
std::map< int, GridBasedClusterclusters_final_
 list of final clusters i.e. clusters that are no longer merged More...
 
std::multiset< MinimumDistancedistances_
 list of minimum distances stores the smallest of the distances in the head More...
 
std::unordered_multimap< int, std::multiset< MinimumDistance >::const_iterator > reverse_nns_
 reverse nearest neighbor lookup table for finding out which clusters need to be updated faster More...
 
std::unordered_map< int, std::multiset< MinimumDistance >::const_iterator > distance_it_for_cluster_idx_
 cluster index to distance iterator lookup table for finding out which clusters need to be updated faster More...
 

Additional Inherited Members

- Protected Attributes inherited from ProgressLogger
LogType type_
 
time_t last_invoke_
 
ProgressLoggerImplcurrent_logger_
 
- Static Protected Attributes inherited from ProgressLogger
static int recursion_depth_
 

Detailed Description

template<typename Metric>
class OpenMS::GridBasedClustering< Metric >

2D hierarchical clustering implementation optimized for large data sets containing many small clusters i.e. dimensions of clusters << dimension of entire dataset

The clustering problem therefore simplifies to a number of local clustering problems. Each of the local problems can be solved on a couple of adjacent cells on a larger grid. The grid spacing is determined by the expected typical cluster size in this region.

Each data point can have two additional properties A and B. In each cluster all properties A need to be the same, all properties B different.

Member Typedef Documentation

◆ CellIndex

◆ MultisetIterator

typedef std::multiset<MinimumDistance>::const_iterator MultisetIterator

◆ NNIterator

typedef std::unordered_multimap<int, MultisetIterator>::const_iterator NNIterator

◆ Point

cluster centre, cluster bounding box, grid index

◆ Rectangle

Constructor & Destructor Documentation

◆ GridBasedClustering() [1/2]

GridBasedClustering ( Metric  metric,
const std::vector< double > &  data_x,
const std::vector< double > &  data_y,
const std::vector< int > &  properties_A,
const std::vector< int > &  properties_B,
std::vector< double >  grid_spacing_x,
std::vector< double >  grid_spacing_y 
)
inline

initialises all data structures

Parameters
metricMetric for measuring the distance between points in the 2D plane
data_xx-coordinates of points to be clustered
data_yy-coordinates of points to be clustered
properties_Aproperty A of points (same in each cluster)
properties_Bproperty B of points (different in each cluster)
grid_spacing_xgrid spacing in x-direction
grid_spacing_ygrid spacing in y-direction

References GridBasedClustering< Metric >::init_().

◆ GridBasedClustering() [2/2]

GridBasedClustering ( Metric  metric,
const std::vector< double > &  data_x,
const std::vector< double > &  data_y,
std::vector< double >  grid_spacing_x,
std::vector< double >  grid_spacing_y 
)
inline

initialises all data structures

Parameters
metricMetric for measuring the distance between points in the 2D plane
data_xx-coordinates of points to be clustered
data_yy-coordinates of points to be clustered
grid_spacing_xgrid spacing in x-direction
grid_spacing_ygrid spacing in y-direction

References GridBasedClustering< Metric >::init_().

Member Function Documentation

◆ cluster()

◆ eraseMinDistance_()

void eraseMinDistance_ ( const std::multiset< MinimumDistance >::const_iterator  it)
inlineprivate

remove minimum distance object and its related data

Remove the distance object behind it from distances_ and remove all corresponding data from the auxiliary data structures reverse_nns_ and distance_it_for_cluster_idx_.

Parameters
itIterator of distance to be removed from distances_

References GridBasedClustering< Metric >::distance_it_for_cluster_idx_, GridBasedClustering< Metric >::distances_, and GridBasedClustering< Metric >::reverse_nns_.

Referenced by GridBasedClustering< Metric >::cluster().

◆ extendClustersY()

◆ findNearestNeighbour_()

bool findNearestNeighbour_ ( const GridBasedCluster cluster,
int  cluster_index 
)
inlineprivate

determines the nearest neighbour for each cluster

Note
If no nearest neighbour exists, the cluster is removed from the list. The deletion is done outside of the method, see return boolean.
If two clusters cannot be merged (merge veto), they are no viable nearest neighbours.
Parameters
clustercluster for which the nearest neighbour should be found
cluster_indexindex of cluster
Returns
Should the cluster be removed from the cluster list?

References GridBasedClustering< Metric >::cluster(), GridBasedClustering< Metric >::clusters_, GridBasedClustering< Metric >::clusters_final_, GridBasedClustering< Metric >::distance_it_for_cluster_idx_, GridBasedClustering< Metric >::distances_, GridBasedCluster::getCentre(), ClusteringGrid::getClusters(), ClusteringGrid::getIndex(), GridBasedClustering< Metric >::grid_, ClusteringGrid::isNonEmptyCell(), GridBasedClustering< Metric >::mergeVeto_(), GridBasedClustering< Metric >::metric_, and GridBasedClustering< Metric >::reverse_nns_.

Referenced by GridBasedClustering< Metric >::cluster(), and GridBasedClustering< Metric >::init_().

◆ getResults()

std::map<int, GridBasedCluster> getResults ( ) const
inline

returns final results (mapping of cluster indices to clusters)

References GridBasedClustering< Metric >::clusters_final_.

◆ init_()

void init_ ( const std::vector< double > &  data_x,
const std::vector< double > &  data_y,
const std::vector< int > &  properties_A,
const std::vector< int > &  properties_B 
)
inlineprivate

initialises all data structures

Parameters
data_xx-coordinates of points to be clustered
data_yy-coordinates of points to be clustered
properties_Aproperty A of points (same in each cluster)
properties_Bproperty B of points (different in each cluster)

References ClusteringGrid::addCluster(), GridBasedClustering< Metric >::cluster(), GridBasedClustering< Metric >::clusters_, GridBasedClustering< Metric >::findNearestNeighbour_(), ClusteringGrid::getIndex(), GridBasedClustering< Metric >::grid_, and ClusteringGrid::removeCluster().

Referenced by GridBasedClustering< Metric >::GridBasedClustering().

◆ mergeVeto_()

bool mergeVeto_ ( const GridBasedCluster c1,
const GridBasedCluster c2 
) const
inlineprivate

checks if two clusters can be merged Each point in a cluster can (optionally) have two properties A and B. Properties A need to be the same, properties B need to differ in each cluster. Methods checks if this is violated in the merged cluster.

Parameters
c1cluster 1
c2cluster 2
Returns
veto for merging clusters true -> clusters can be merged false -> clusters cannot be merged

References GridBasedCluster::getPropertiesB(), and GridBasedCluster::getPropertyA().

Referenced by GridBasedClustering< Metric >::findNearestNeighbour_().

◆ removeSmallClustersY()

void removeSmallClustersY ( double  threshold_y)
inline

removes clusters with bounding box dimension in y-direction below certain threshold

Parameters
threshold_yminimal dimension of the cluster bounding box

References GridBasedClustering< Metric >::clusters_final_, DIntervalBase< D >::maxY(), and DIntervalBase< D >::minY().

Member Data Documentation

◆ clusters_

std::map<int, GridBasedCluster> clusters_
private

◆ clusters_final_

◆ distance_it_for_cluster_idx_

std::unordered_map<int, std::multiset<MinimumDistance>::const_iterator> distance_it_for_cluster_idx_
private

cluster index to distance iterator lookup table for finding out which clusters need to be updated faster

Referenced by GridBasedClustering< Metric >::cluster(), GridBasedClustering< Metric >::eraseMinDistance_(), and GridBasedClustering< Metric >::findNearestNeighbour_().

◆ distances_

std::multiset<MinimumDistance> distances_
private

list of minimum distances stores the smallest of the distances in the head

Referenced by GridBasedClustering< Metric >::cluster(), GridBasedClustering< Metric >::eraseMinDistance_(), and GridBasedClustering< Metric >::findNearestNeighbour_().

◆ grid_

◆ metric_

Metric metric_
private

metric for measuring the distance between points in the 2D plane

Referenced by GridBasedClustering< Metric >::findNearestNeighbour_().

◆ reverse_nns_

std::unordered_multimap<int, std::multiset<MinimumDistance>::const_iterator> reverse_nns_
private

reverse nearest neighbor lookup table for finding out which clusters need to be updated faster

Referenced by GridBasedClustering< Metric >::cluster(), GridBasedClustering< Metric >::eraseMinDistance_(), and GridBasedClustering< Metric >::findNearestNeighbour_().