Gaussian Processes in Machine Learning Dmitry P. Vetrov Dorodnicyn Computing Centre of Russian Academy of Sciences
Our research group My colleague Dmitry Kropotov, PhD student of MSU Dmitry Kropotov, PhD student of MSUStudents: Nikita Ptashko Nikita Ptashko Oleg Vasiliev Oleg Vasiliev Pavel Tolpegin Pavel Tolpegin Igor Tolstov Igor Tolstov Oleg Kurchin Oleg Kurchin
Overview Kernel selection problem Kernel selection problem Introduction to random processes Introduction to random processes GP regression GP regression Covariance function selection Covariance function selection Classification tasks Classification tasks Open problems Open problems
Kernel selection Kernel methods are state-of-the-art methodology for data-mining tasks. The most known of them SVM (Support Vector Machines) SVM (Support Vector Machines) Logistic regression Logistic regression RVM (Relevance Vector Machines) RVM (Relevance Vector Machines) Kernel Fisher discriminant Kernel Fisher discriminant RBF neural networks RBF neural networks The performance of all such methods depends significantly on the choice of proper kernel. The problem of kernel selection is one of the most intriguing problems in machine learning. To tell the truth nobody knows how to select good kernels…
Random processes I Random process is a set of random values indexed with 1-dimensional parameter (or multi- dimensional in case of random fields):
Random processes II If we fix index we get single random value If we fix random component we get single function of index variable Random processes have dual nature. They may be treated as random functions and have both functional and probabilistic features
Random processes III Main characteristics of random processes Mean function Mean function Covariance function Covariance function Covariance function is symmetric and non- negatively defined. Looks like kernel function in SVM, doesnt it?
Stationary processes Stationary processes have constant characteristics which do not depend on the concrete value index variable. In this case we may assume that process have zero mean The most of random process theory is developed for stationary processes
Gaussian processes Gaussian process (GP) is a process whose all marginal distributions are gaussian GP is uniquely defined by its mean and covariance functions. Without loss of generality we may assume that GP is centered at zero
Examples of GPs Examples of GPs with RBF covariance functions of different width
Observation of GPs What can we tell about the value of GP in the point ? In case of absence of any additional information only this: What can we tell about the value of GP in the point t ? In case of absence of any additional information only this: But what if we knew the values of GP in some points ?
Conditional distribution of GP We may use the values of GP as regression inputs and predict the most likely value of. We may even find its marginal distribution
Parameters of prediction Note that prediction expression is explicit and does not require any optimization. We may rewrite the prediction in the following way
Covariance function selection In case of lack of any prior preferences we may use just traditional maximum likelihood principle for adjusting proper covariance function e.g. if then parameters can be selected as
Illustration to CF selection
Connection to Bayesian inference Assume we are given noised measurements of the process to be predicted. Its true values are. The more is difference between and the less likely are. We want to select such covariance function that the rate of likely process realizations were the largest On one hand this expression can be treated as the evidence of model which is popular mean for model selection among Bayesians. On the other hand it can be shown that it is our likelihood for covariance function which corresponds to GP. On one hand this expression can be treated as the evidence of model which is popular mean for model selection among Bayesians. On the other hand it can be shown that it is our likelihood for covariance function which corresponds to GP f.
Classification problem I It seems that GPs cant be applied to classification tasks because the outputs should be discrete, either +1 or -1. It seems that GPs cant be applied to classification tasks because the outputs y should be discrete, either +1 or -1. Even if we decide to switch to the value of discriminant function rather than to its sign, it is still unclear how to train GP as we do not know the values of discriminant function even for training sample – only its sign.
Classification problem II
Classification problem III There are no explicit equations for GP classifiers predictions We may now calculate the values of GP in the points of training sample… … and with their use estimate the kernel likelihood
Interesting observation Polynomial RVM RBF stationary GP Logistic regression RBF RVM Other kernel approaches… Non-stationary GPs with rich parameterized covariance functions family
Overfitting Maximization of kernel likelihood (or model evidence in alternative Bayesian notation) does not fit the right answers but operates in more abstract terms such as adequacy of GP to the observed data. If the data are noisy then the best GP will be close to white-noise process. It may seem that such approach avoids overfitting but…
Underfitting I If there are many parameters in covariance function to be adjusted their choice via maximum kernel likelihood leads to underfitting when significant regularitites in data are treated as noise. Example: RVM which is non-stationary GP with covariance function and parameters and to be adjusted
Underfitting II Relevance Vector Regression
Thank you!