Prototype-based learning for large and multimodal data sets

Research Areas: 

The goal of the project is to extend intuitive prototype- or exemplar-based machine learning methods to streaming and multimodal data sets and to test their suitability for data inspection and visualization. One particular focus will be put on kernelized or relational variants which offer a very general interface to the data in terms of the dissimilarities which can be chosen according to the specific data characteristics at hand. In this context, important questions are how prototypes can be represented intuitively to the user, how can structural constituents be combined and weighted accordingly, and how can sparse models be derived efficiently for complex structural objects.


Methods and Research Questions: 

Prototype- and exemplar-based machine learning constitutes a very intuitive way to deal with data since the derived models represent their decisions in terms of relevant data points which can directly be inspected by experts in the field. Due to their efficient and intuitive training, excellent generalization ability, and flexibility to deal with missing values or streaming data many, successful applications in diverse areas such as robotics or bioinformatics have been conducted. One problem of the techniques when facing modern data sets, however, is their restriction to Euclidean settings – hence they cannot adequately deal with modern, complex and inherently non-Euclidean data sets. In the project, extensions to general dissimilarity data by means of relational extensions will be considered, and it will be investigated how the benefitial aspects of prototype-based learning such as interpretability, efficiency, the capability of dealing with multimodal data or very large data sets can be transferred to this setting.

The project will focus on important representative tools from prototype-/exemplar-based learning, such as GLVQ and SRLVQ in the supervised domain and NG, GTM, and Affinity Propagation in the unsupervised domain. Relational extensions are based on an implicit embedding of general dissimilarity data in pseudo-Euclidean space, and an according implicit adaptation of prototypes which can be computed based on the given pairwise dissimilarities only. A number of problems arise in this context, such as the following: How can powerful prototype-based techniques be transferred to the relational setting? How can implicitely represented 'relational' prototypes be presented to humans in an intuitive way? How can missing values be dealt with, such as missing dissimilarities, or even missing parts of structures? Can we infer and adapt the relevance of structural parts for dissimilarity data? How can we deal with large data sets? How can linear or sub-linear methods be derived from the general mathemathical framework?  

Promising ways to tackle these challenges focus on intrinsic properties of prototype-based techniques, such as their reference to prototypes which can be approximated by explicit exemplars of the data sets. These usually offer an intuitive interface to human observers, as well as a powerful compression technique suitable to overcome the stability/plasticity dilemma, for example. Various intuitive paradigms which partially have already been investigated in the context of the classical models such as relevance learning will be transferred to the demanding setting of relation data.



In a first step, several classical unsupervised techniques have been transferred to relational data, such as e.g. NG or GTM. These have extensively been compared to alternative classical kernel based clustering and classification algorithms, leading to comparable performance. First techniques to arrive at efficient linear time methods have been implemented based on patch processing or the Nyström approximation, respectively. The resulting methods have linear time instead of squared complexity, and provide very accurate approximations depending on the characteristics of the data. Empirical tests in the context of biomedical domains have been conducted to prove this behavior. Currently, first experiments with supervised techniques show a similar and very promising behavior.