Improving semi-supervised self-training with embedded manifold transduction
Transactions of the Institute of Measurement and Control
Published online on July 27, 2016
Abstract
Semi-supervised learning aims to utilize both labelled and unlabelled data to improve learning performance. This paper shows a distinct way to exploit unlabelled data for traditional semi-supervised learning methods, such as self-training. Self-training is a well-known semi-supervised learning algorithm which iteratively trains a classifier by bootstrapping from unlabelled data. Standard self-training barely selects unlabelled examples for training set augmentation according to the current classifier model, which is trained only on the labelled data. This could be problematic since the underlying classifier is not strong enough, especially when initial labelled data is sparse. Consequently, self-training suffers from too much classification noise accumulated in the training set. In this paper, we propose a novel self-training style algorithm, which exploits a manifold assumption to optimize the self-labelling process. Unlike standard self-training, our algorithm utilizes labelled and unlabelled data as a whole to label and select unlabelled examples for training set augmentation. In detail, two measures are employed to minimize the effect of noise introduced to the labelled training set: a transductive method based on controlled graph random walk is incorporated to generate reliable predictions on unlabelled data; secondly, the mechanism is adopted to sequentially augment the training set. Empirical results suggest that the proposed method can effectively improve classification performance.