博士申请——Research Proposal

钱飞翼

2023-12-01

之前在申请境外博士的时候，写过一篇RP（研究计划书）。由于不是硕士的研究方向，所以写的比较浅显。在这里贴出来，供大家参阅，对于某些童鞋，或许会有所帮助。

Why do I want the Ph.D

About Myself

Background

My current research direction is multimodal language generation. The research goal is to integrate the multimodal information acquired by the robot into a perfect sentence. At present I have designed a basic framework by referring the methods of image caption and machine translation to carry on my research. I plan to pursue graduate studies towards a Ph.D. degree at your school of computer science from the fall of 20XX. In the future, I am determined to devote myself to the research on object tracking.

Research Motivation

Object tracking is a process to locate an interested object in a series of images, so as to reconstruct the moving object’s track. Given a bounding box defining the object of interest in a single frame, the goal of tracking is to automatically determine the object’s bounding box or indicate that the object is not visible in every frame that follows.

As a mid-level task in computer vision, object tracking grounds high-level tasks such as pose estimation, action recognition, and behavior analysis. It has numerous practical applications, such as visual surveillance, human computer interaction and virtual reality. Although object tracking has been studied for several decades, it still remains challenging due to factors like abrupt appearance changes and severe object occlusions. Apart from those practical requirements that appeal to me deeply, I am curious about dealing with all these issues.

Object Tracking

Introduction

There are two main forms of object tracking, namely, single-object-tracking(SOT) and multi- object-tracking(MOT). Compared with single-object-tracking, which primarily focuses on designing sophisticated appearance models and/or motion models to deal with challenging factors such as object deformation, occlusion, illumination changes, motion blur and background clutters, multiple-object-tracking additionally requires two tasks to be solved: determining the number of objects, which typically varies over time, and maintaining their identities. Apart from the common challenges in both SOT and MOT, further key issues that complicate MOT include among others:1) frequent occlusions, 2) initialization and termination of tracks, 3) similar appearance, and 4) interactions among multiple objects.

In order to deal with all these issues, a wide range of solutions have been proposed in the past decades. In general, most tracking algorithms can be categorized into two classes based on their representation schemes: generative and discriminative models. Generative models typically learn an appearance model and use it to search for image regions with minimal reconstruction errors as tracking results. The typical generative algorithms are sparse representation methods, which have been used to represent the object by a set of targets and trivial templates to deal with partial occlusion, illumination change and pose variation. Discriminative models pose object tracking as a detection problem in which a classifier is learned to separate the target object from its surrounding background within a local region. Unlike generative methods, discriminative approaches use both target and background information to find a decision boundary for differentiating the target object from the background. And this is employed in tracking-by-detection methods, where a discriminative classifier is trained online using sample patches of the target and the surrounding background.

Related Work

A class of tracking techniques called “tracking-by-detection” are proposed in object tracking after Mykhaylo combining the advantages of both detection and tracking in a single framework. These methods train a discriminative classifier in an online manner to separate the object from the background. This classifier bootstraps itself by using the current tracker state to extract positive and negative examples from the current frame. However, light inaccuracies in the tracker can lead to incorrectly labeled training examples, which degrade the classifier and can cause drift. Boris et al. show that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and can lead to a more robust tracker with fewer parameter tweaks.

Particle filter (PF) realizes recursive Bayesian estimation based on the Monte Carlo method, using random particle groups to discretely express the posterior probability density function (PDF) of object state. Particle filter performs very well with non-linear and non-Gaussian dynamic state estimation problems, and it is widely used in object tracking. Since the invention of the particle filter, several types of appearance models for this framework have been proposed, including color, contour, edge, and saliency. However, a particle filter itself is a high complexity algorithm because each particle must be processed separately. Complex models can dramatically increase the overall execution time of a particle filter framework, rendering it useless in real-life applications. In addition, the particle filter which is a generative algorithm has a poorer performance under some complex visual scenarios compared with discriminative algorithms such as correlation filters and deep learning.

Some traditional types of correlation filters such as ASEF and UMACE filters have been trained offline and are used for object detection or target identification. However, their training needs are poorly suited to tracking. Object tracking requires robust filters to be trained from a single frame and dynamically adapted as the appearance of the target object changes. Bolme et al. introduce a regularized variant of ASEF named Minimum Output Sum of Squared Error (MOSSE) which is suitable for visual tracking. A tracker based upon MOSSE filters is robust and effective. Because correlation filters can be interpreted as linear classifiers, there is the question of whether they can take advantage of the Kernel Trick to classify on richer non-linear feature spaces. Some researchers investigate this problem, and Henriques et al. derive a new Kernelized Correlation Filter(KCF) and Kernel SDF filters have been proposed by Patnaik et al..

In the past few years, deep learning architectures have been used successfully to give very promising results for some complicated tasks, including image classification and speech recognition. The key to success is to make use of deep architectures to learn richer invariant features via multiple nonlinear transformations. Naiyan Wang et al. believe that visual tracking can also benefit from deep learning for the same reasons, and they propose a novel deep learning tracker (DLT) for robust visual tracking. DLT uses a stacked denoising autoencoder (SDAE) to learn generic image features from a large image dataset as auxiliary data and then transfers the features learned to the online tracking task. Then, they bring the biologically-inspired convolutional neural network (CNN) framework to visual tracking to address the challenge of limited labeled training data. Subsequently, Hyeonseob et al. propose a novel CNN architecture, referred to as Multi-Domain Network (MDNet), to learn the shared representation of targets from multiple annotated video sequences for visual tracking, where each video is regarded as a separate domain. Besides, Milan et al. present an approach based on recurrent neural networks(RNN) to address the challenging problem of data association and trajectory estimation. And they show that an RNN-based approach can be utilised to learn complex motion models in realistic environments.

Tracking System

A tracking system generally consists of four basic components:

Motion Model. It relates the locations of the object over time. Based on the estimation from the previous frame, the motion model generates a set of candidate regions or bounding boxes which may contain the target in the current frame.
Feature Extraction. The features extracted from candidate regions or bounding boxes are usually used for object representation. The common features are histogram features, texture features, color features, haar-like features and deep convolutional features.
Appearance Model. An appearance model can be used to evaluate the likelihood that the object of interest is at these candidate regions. For object tracking, local appearance models are generally more robust than holistic ones.
Online Update Mechanism. This mechanism controls the strategy and frequency of updating the appearance model. It has to strike a balance between model adaptation and drift.

Future Work

Some research work can be carried out in the future, such as：

Reducing the search scope of object in motion model.
Applying the visual selective attention mechanism to object tracking.
Researching object tracking in discontinuous video with the aid of person re-id.

Bibliography

[1] Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C]. computer vision and pattern recognition, 2010: 2544-2550

[2] Babenko B, Yang M, Belongie S J, et al. Robust Object Tracking with Online Multiple Instance Learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8): 1619-1632.

[3] Yilmaz A, Javed O, Shah M, et al. Object tracking: A survey[J]. ACM Computing Surveys, 2006, 38(4).

[4] Andriluka M, Roth S, Schiele B, et al. People-tracking-by-detection and people-detection-by-tracking[C]. computer vision and pattern recognition, 2008: 1-8.

[5] Kalal Z, Mikolajczyk K, Matas J, et al. Tracking-Learning-Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(7): 1409-1422.

[6] Truong M T, Pak M, Kim S, et al. Single object tracking using particle filter framework and saliency-based weighted color histogram[J]. Multimedia Tools and Applications, 2018, 77(22): 30067-30088.

[7] Zhou T, Ouyang Y, Wang R, et al. Particle filter based on real-time Compressive Tracking[C]. international conference on audio language and image processing, 2016: 754-759.

[8] Wang N, Yeung D Y. Learning a Deep Compact Image Representation for Visual Tracking[C]. neural information processing systems, 2013: 809-817.

[9] Choi J, Chang H J, Yun S, et al. Attentional Correlation Filter Network for Adaptive Visual Tracking[C]. computer vision and pattern recognition, 2017: 4828-4837

[10] Milan A, Rezatofighi S H, Dick A R, et al. Online Multi-Target Tracking Using Recurrent Neural Networks.[J]. national conference on artificial intelligence, 2016: 4225-4232.

[11] Babenko B, Yang M, Belongie S J, et al. Robust Object Tracking with Online Multiple Instance Learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8): 1619-1632.

[12] Wang N, Shi J, Yeung D Y, et al. Understanding and Diagnosing Visual Tracking Systems[J]. international conference on computer vision, 2015: 3101-3109.

[13] Wang N, Li S, Gupta A, et al. Transferring Rich Feature Hierarchies for Robust Visual Tracking.[J]. arXiv: Computer Vision and Pattern Recognition, 2015.

[14] Wei J, Hongjuan L, Wei S, et al. A new particle filter object tracking algorithm based on dynamic transition model[C]. international conference on information and automation, 2016: 1832-1835.

[15] Huang L, Ma B, Shen J, et al. Visual Tracking by Sampling in Part Space[J]. IEEE Transactions on Image Processing, 2017, 26(12): 5800-5810.

[16] Zhang K, Zhang L, Liu Q, et al. Fast Visual Tracking via Dense Spatio-Temporal Context Learning[C]. european conference on computer vision, 2014: 127-141.

[17] Nam H, Han B. Learning Multi-domain Convolutional Neural Networks for Visual Tracking[J]. computer vision and pattern recognition, 2016: 4293-4302.

[18] Henriques J F , Caseiro R , Martins P , et al. High-Speed Tracking with Kernelized Correlation Filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3):583-596.

[19] R. Patnaik and D. Casasent. Fast FFT-based distortion-invariant kernel filters for general object recognition. In Proceedings of SPIE, volume 7252, 2009.

[20] Henriques J F, Caseiro R, Martins P, et al. Exploiting the circulant structure of tracking-by-detection with kernels[C]. european conference on computer vision, 2012: 702-715.

博士申请——Research Proposal

相关阅读

相关文章

相关问答

相关文档