Dealing with imbalanced and weakly labeled data in machine learning using fuzzy and rough set methods

Duration: 2014 - 2018
Funding organization: Ghent University - Special Research Fund
Summary: Over the past decades, machine learning (ML) has made headway in the development of accurate, robust and efficient algorithms for data classification and prediction, and has been successfully applied in various domains. However, many current applications of real-life data analysis present characteristics for which the classical branch of algorithms performs sub-optimally.

In this project, I focus on learning from imbalanced and weakly labeled data. The former deals with data where one or more decision classes are underrepresented, which happens in several domains like medical applications, microarray data, etc. For two-class data sets, classical ML algorithms typically obtain high accuracy for the majority class, while for the minority class, the opposite occurs. The second problem refers to the fact that obtaining precise decision labels for all data samples is often a costly or difficult task, so learning algorithms need to deal with partially or implicitly labeled data. In many fields, ranging from bioinformatics to web mining and image processing, obtaining completely labeled data is tedious, and sometimes prohibitively expensive. Therefore, many scientists focus on exploiting weakly labeled training data, which are cheaper and easier to generate, to help improve performance and discover the underlying structure of the data. Currently, two important areas of research can be discerned: semi-supervised learning (SSL) and multi-instance learning (MIL). In SSL, labels are known for a minority of training samples, and unlabeled instances are used for improving generalization, by modifying or reprioritizing the learning hypothesis obtained from labeled samples alone. In MIL, a data sample has multiple forms of representation or, alternatively, consists of multiple parts, or represents multiple samples from a stochastic process. As such, a sample is not described by a single feature vector, but by a bag of feature vectors that are called instances. The class labels of instances are not known, only those of the bags. The goal is to classify new bags based on the description of its instances.

We propose to tackle these problems using fuzzy set and rough set approaches. Fuzzy sets generalize classical sets by considering partial membership degrees of objects to a set or relation, while rough sets characterize a set of objects by means of a lower and an upper approximation, taking into account the indiscernibility between objects. Together, they provide an attractive framework for modeling and processing gradual and incomplete information and their hybridization into fuzzy rough sets (FRS) has proven its worth in data analysis.
There is supplementary material available for some research papers.