Relevance learning for temporal neural maps

Acronym: 
RLNM
Research Areas: 
D
Abstract: 

A vast amount of data that engineers and scientists are facing today is of temporal nature, and the sheer volume means that only a small fraction can ever be inspected manually. Thus, automated tools for exploration and visualization of temporal data are strongly required. Popular highly sensitive technologies from chemistry, medical science, and biology such as mass spectrometry lead to extremely high-dimensional and often non-linear time-series, but also bio-hazard detectors or motion capturing systems generate data of this type.

These are, at the same time, often extremely short and multi-dimensional. Current
approaches are widely insufficient  to fully cope with high dimensional and extremely short temporal sequences, and there is a need for new data mining tools being able to handle those data sets.

The aim of this project is the development of data mining methods for unsupervised and partially supervised scenarios for short, high-dimensional, nonlinear temporal sequences which allow relevance determination, visualization, and inspection of data as occur in biomedical applications. The models will be based on neural maps and recent extensions and combine these techniques with the principle of learning metrics and recursive processing of temporal data.

This work was supported by the DFG project HA2719/4-1.

Methods and Research Questions: 

Short, high dimensional time series constitute a common data structure in different
measurement systems. Classical time series analysis is not applicable, nor does
a direct application of standard machine learning techniques lead to satisfactory results
since important information would be neglected by a separate investigation of single events. The goal of the project is to offer efficient and reliable inspection techniques which are applicable in these scenarios.

________________________________________________________________________________________________

The ever increasing amount of electronic data in medicine, business, the web, etc. causes the need for efficient automatic data inspection and visualization techniques.
The project will focus on the situation of high-dimensional temporal data, where strong regularization techniques are needed. The goal of the project is to transfer powerful techniques from prototype based learning, known to be effective for high-dimensional data analysis, to the scenario of temporal dependencies.

Short, high dimensional time series constitute a common data structure in many applications:

  • therapies or vaccines accompanied by a series of consecutive measurements e.g. from blood probes
  • progressive questionnaires constitute a common evaluation tool in psychotherapeutic treatment
  • mass spectra reliably characterize a bacterium and its specific state e.g. with respect to the age, sporulation, or autolysis effects
  • microarray data measure gene expression over time e.g. for multiple sclerosis data
  • motion data are captured over time at multiple body angles from insects, analyzing locomotion

 These time series are short, containing often less than 100 time points, since human intervention is needed to get the measurements at every time point, the data generation is costly or it would be desirable to use short measurement frames in the application of a model.
At the same time, data are high-dimensional corresponding to the variety of highly sensitive sensor technology available today.
The goal of the project is to offer efficient and reliable inspection techniques which are applicable in these scenarios and which help to gain insights into the respective application domain. One key issue in unsupervised or semi-supervised data analysis constitutes inthe identification of relevant factors of the system. On the one hand, relevance determination allows to shape the underlying metric of prototype-based methods according to the desired needs such that only the information relevant to the respective application is presented in the model and not the underlying noise. On the other hand, human inspection, interpretation and intervention becomes possible since the user can focus on explicit patterns given by the factors indicated as relevant by the system.
A major goal of the project is to integrate relevance determination into the models such that the form of the underlying metric and explicit weighting parameters which directly indicate the relevance of certain factors (data dimensions, temporal differences, etc.) can be determined. Reasonable sources for auxiliary information can be offered by a (partial) classification of the data as well as more complex information. The project will integrate relevance learning by means of semi-supervised learning into temporal data investigation. Unlike classical vectorial scenarios, several different aspects of relevance determination are of high interest in the temporal domain:

  • What are relevant time invariant factors which allow an identification of the data independent of the point of time (e.g. identification of bacteria independent from sporulation or autolysis effects)?
  • What are minimum relevant time-dependent factors which allow a differential characterization of data while single time points lack this information (e.g. the characterization of the success of a vaccination)? What are relevant factors to predict the temporal development of a time series (e.g. the determination of the success of a therapy)?
  • The project will investigate methods to determine and visualize these complex forms of relevance profiles within a temporal context by means of relevance learning and metric adaptation in recurrent prototype-based techniques.

 The project will rely on prototype-based methods which will possess the following important properties:

  • Interpretability: Interpretability refers to the fact that the inspection techniques pro- vide a visualization of high dimensional temporal data entries in 2D and prototypes reveal insight into typical representative regions of the data space. This will be enriched by explicit human-understandable relevance profiles and the possibility to track the typical temporal aspects of the system and to browse within the temporal context.
  • Flexibility: Flexibility refers to the fact that the models are directly suited for different tasks including visualization, clustering, classification, function approximation, extraction of typical temporal sequences, and temporal prediction. For this purpose, various techniques of neural maps can be transferred to the given setting such as prototype labels, U-matrix methods, or an extraction of the dynamics defined by the trained context.
  • Reliability: An essential issue consists in the ability to judge the reliability of observed phenomena of the unsupervised or semi-supervised models. It is possible to transfer large parts of the theory of neural maps towards the models, including concepts such as cost functions, the magnification factor, and topology preservation. Depending on the precise model, these aspects can be enriched by a characterization of the temporal dynamics of the models in classical terms using e.g. the Chomsky hierarchy for discrete settings.
  • Applicability: The models will be directly applicable in different domains without the need of extensive parameter and model tuning.

The approach will be based on prototype-based learning methods like the Self-Organizing-Map, Learning Vector Quantization or extensions thereof. These methods are
known to be efficient and robust also in case of few data points. The obtained models are easier to interpret by human experts. Further novel metric adaptation approaches, like relevance and matrix learning, are used.

Outcomes: 

The main outcome of the DFG funded RLNM project is a prototype-based learning method for the analysis and modeling of short-time series. The approach is able to learn a compact model of the data characteristic, taking the temporal structure of the data into account. The obtained model can be subsequently used for visualization and analysis of e.g. the temporal trajectories of single time series, clusters thereof and the relevance of individual input dimensions for the model. The labels of new time-series can be predicted by the model.  

The figure shows a schema of the developed prototype model to analyze short time-series. It is based on an extension of generative topographic mapping through time (GTM-TT).  Our approach permits the identification of dimensions relevant for the separation of different groups over time.

The subsequent picture shows a so called relevance profile of a temporal microarray study. The identified gene-expressions (high relevances) are discriminating between the groups over time and agree favorable to reported results in the literature.

A toolbox implementing the basic concepts of Supervised Generative Topographic Mapping Through Time (SGTM-TT) is available in the SGTM-TT Toobox
 


Additional publications under submission:

Odor recognition in robotics applications by discriminative time series modeling,
F.-M. Schleif, B. Hammer, J.G. Monroy, J.G.-Jimenez, J.L. Blanco, M.Biehl, N. Petkov,
IEEE-TNN-LS, submitted, 2013
 
Discriminative learning of life-science time series ,
F.-M. Schleif, B. Mokbel, L. Theunissen, V. Dürr, B. Hammer,
Bioinformatics, submitted, 2013

 

Publications: