# SWSL

**Stuttgart Workshop on Statistical Learning**

## 9th-11th July 2019

**SWSL** is a mini-workshop in statistical machine learning

at the University of Stuttgart, specifically directed at PhD students and young postdocs.

We aim at organizing fruitful exchange between young scientists and giving them the opportunity to expose their research, while our keynote speakers highlight recent research topics.

This workshop is free to attend but we ask for prior registration by mail to the organizers (see below).

# Speaker

### Alexandra Carpentier

Otto-von-Guericke-University of Magdeburg

alexandra.carpentier@ovgu.de

**Adaptive inference and its relations to sequential decision making**

(based on joint works with Andrea Locatelli, Matthias Loeffler, Olga Klopp and Richard Nickl)

*Adaptive inference - namely adaptive estimation and adaptive confidence statements - is particularly important in high or infinite dimensional models in statistics. Indeed whenever the dimension becomes high or infinite, it is important to adapt to the underlying structure of the problem. While adaptive estimation is often possible, it is often the case that adaptive and honest confidence sets do not exist. This is known as the adaptive inference paradox. And this has consequences in sequential decision making. In this talk, I will present some classical results of adaptive inference and discuss how they impact sequential decision making.*

### Ingo Steinwart

University of Stuttgart

ingo.steinwart@mathematik.uni-stuttgart.de

**A Sober Look at Neural Network Initializations**

*Initializing the weights and the biases is a key part of the training process of a neural network. Unlike the subsequent optimization phase, however, the initialization phase has gained only limited attention in the literature. In this paper we discuss some consequences of commonly used initialization strategies for vanilla DNNs with ReLU activations. Based on these insights we then develop an alternative initialization strategy. Finally, we present some large scale experiments assessing the quality of the new initialization strategy. *

**Gergely Neu**AI group, DTIC

Universitat Pompeu Fabra

gergely.neu@gmail.com

### A unified view of entropy-regularized Markov decision processes

*We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup. *

## Contributed Talks

**Daniele Calandriello: **Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret*Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions *

*d and iterations t. Given a set of A alternative to choose from, the overall runtime *

*O(t^3A) quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or covariance of the GP. Combining a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most O(d_eff) points, where d_eff is the effective dimension of the explored space, which is typically much smaller than both *

*d and t. This greatly reduces the dimensionality of the problem, thus leading to a *

*O(TAd_eff^2) runtime and O(Ad_eff) space complexity. *

** ****Luigi Carratino: ** Algorithmic Regularization for Fast and Optimal Large-Scale Machine Learning *Kernel methods provide a principled way to perform non linear learning. They enjoy optimal statistical properties and in a least squares setting they reduce to solving a linear system. However, in large scale problems computational requirements make them impractical. In this talk we will see how the algorithmic choices control the generalization properties of the solutions and how specific choices provide robust as well as resource efficient solvers. Ideas from statistics, randomized linear algebra and convex optimization will be used to minimize the computational footprint.* **Florian Dumpert: **Statistical properties of locally learnt SVM-based predictors*The huge amount of available data nowadays is a challenge for kernel-based machine learning algorithms like SVMs with respect to runtime and storage capacities. Local approaches might help to relieve these issues and to improve statistical accuracy. It is possible to show that these local approaches are consistent and robust in a basic sense and with respect to the so-called influence function which expresses the differentiability of the learning method: One can show that there is a differentiable dependency of our locally learned predictor on the underlying distribution. The assumptions of the proven theorems can be verified without knowing anything about this distribution. This makes the results interesting also from an applied point of view.***Simon Fischer: **Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms

*Learning rates for least-squares regression are typically expressed in terms of $L_2$-norms. In this talk we extend these rates to norms stronger than the $L_2$-norm for kernel-based regression schemes without requiring the regression function to be contained in the hypothesis space. In the special case of Sobolev reproducing kernel Hillbert spaces used as hypotheses spaces, these stronger norms coincide with fractional Sobolev norms between the used Sobolev space and $L_2$. As a consequence, not only the target function but also some of its derivatives can be estimated without changing the algorithm. From a technical point of view, we combine the well-known integral operator techniques with an embedding property, which so far has only been used in combination with empirical process arguments. This combination results in new finite sample bounds with respect to the stronger norms. From these finite sample bounds our rates easily follows. Finally, we prove the asymptotic optimality of our results in many cases.*

**Damien Garreau:**Comparison-Based Random Forests

*Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A, B, C) and ask an oracle whether item A is closer to item B or to item C. In this talk, I will present a novel random forest algorithm for regression and classification that relies only on such triplet comparisons.*

**Christina Göpfert:**When can unlabeled data improve the learning rate?

*In semi-supervised classification, one is given access both to*

labeled and unlabeled data. As unlabeled data is typically cheaper to

acquire than labeled data, this setup becomes advantageous as soon as

one can exploit the unlabeled data in order to produce a better

classifier than with labeled data alone. However, the conditions under

which such an improvement is possible are not fully understood. In my

talk, I focus on improvements in the minimax learning rate in terms of

the number of labeled examples (with the number of unlabeled examples

being allowed to depend on the number of labeled ones). I will discuss

possible pitfalls in demonstrating such improvements, conditions under

which an improvement of rates is provably impossible, and give examples

of improvements from 1/sqrt{l} to 1/l or exponential rates.

labeled and unlabeled data. As unlabeled data is typically cheaper to

acquire than labeled data, this setup becomes advantageous as soon as

one can exploit the unlabeled data in order to produce a better

classifier than with labeled data alone. However, the conditions under

which such an improvement is possible are not fully understood. In my

talk, I focus on improvements in the minimax learning rate in terms of

the number of labeled examples (with the number of unlabeled examples

being allowed to depend on the number of labeled ones). I will discuss

possible pitfalls in demonstrating such improvements, conditions under

which an improvement of rates is provably impossible, and give examples

of improvements from 1/sqrt{l} to 1/l or exponential rates.

**Anna Korba:**Maximum Mean Discrepancy Gradient Flow

*We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties.*

*The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.*

**Nicole Mücke: **Global Minima of DNNs: The Plenty Pantry*A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for some local averaging schemes with zero training error. However, there is no theoretical analysis of interpolating ERM-like methods so far. We take a step in this direction by showing that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.*

**Abishake:**Regularization schemes for statistical inverse problems

*We study a statistical inverse learning problem, where we observe the noisy image of a quantity through a operator at some random design points. We consider the regularization schemes to reconstruct the estimator of the quantity for the ill-posed inverse problem. We develop a theoretical analysis for the minimizer of the regularization scheme using the ansatz of reproducing kernel Hilbert spaces. We discuss optimal rates of convergence for the proposed scheme, uniformly over classes of admissible solutions, defined through appropriate source conditions.*

**Concentration of weakly dependent Banach-valued sums and applications to statistical learning methods**

Oleksandr Zadorozhnyi:

Oleksandr Zadorozhnyi:

*In this talk I present a Bernstein type inequality for Banach-valued random sums under weak-dependency assumption of general kind on the variables and smothness assumption on the underlying Banach norm. In the succeeding part this concentration inequlity is used in asymptotical regime to derive risk upper bounds for the family of spectral regularization methods for reproducing kernel decision rules, when trained on a sample coming from a τ−mixing process. Talk is based on joint work with Gilles Blanchard.*

### Schedule:

Tuesday, 9th of July 2019:

13.00 - 13.15 Welcome and Organization

13.15 - 14.00 Talk Florian Dumpert

14.00 - 14.45 Talk Luigi Carratino

14.45 - 15.15 Break (30 min)

15.15 - 16.00 Talk Simon Fischer

16.00 - 16.45 Talk Christina Göpfert

18.30 - open end Poster session

Wednesday, 10th of July 2019:

09.00 - 10.00 Keynote Alexandra Carpentier

10.00 - 10.30 Coffee Break

10.30 - 11.30 Keynote Ingo Steinwart

11.30 - 13.00 Break (1 h 30 min)

13.00 - 14.00 Keynote Gergely Neu

14.00 - 14.30 Break (30 min)

14.30 - 15.15 Talk Nicole Mücke

15.15 - 16.00 Talk Damien Garreau

16.00 - 16.15 Coffee Break (30 min)

16.15 - 17.00 Talk Anna Korba

19.00 - ... Guided tour through “Brauhaus Calwer-Eck” with beer tasting

Dinner at “Brauhaus Calwer-Eck”

(Calwer Straße 31, 70173 Stuttgart)

Thursday, 11th of July 2019:

09.00 - 09.15 Organization

09.15 - 10.00 Talk Oleksandr Zadorozhnyi

10.00 - 10.45 Talk Abishake Rastogi

10.45 - 11.00 Break

11.00 - 11.45 Talk Daniele Calandriello

11.45 Farewell

## Venue

University of Stuttgart

Institute for Stochastics and Applications (ISA)

Pfaffenwaldring 57

70569 Stuttgart

**Guide to Campus Vaihingen (Universität):**

***Coming from the city (including central station): **Take the subway from the station "Hauptbahnhof", "Stadtmitte" or "Feuersee" S1 (Herrenberg), S2 or S3 (both Flughafen) and get off at "Universität".***Coming from the airport:** The train stop is located directly beneath the airport. Just follow the signs marked "S". Take the subway S2 (Schorndorf) or S3 (Backnang) and get off at "Universität".

Don't forget to buy a ticket for two zones at one of the orange colored vending machines!

**Room: Tuesday: 8.122 (großer Fakultätssaal)Wednesday: 2.136 (kleiner Fakultätssaal)Thursday: Seminarraum IANS 7.122**

## Organizers

### Nicole Mücke

nicole.muecke@mathematik.uni-stuttgart.de

### Florian Dumpert

Florian.Dumpert@uni-bayreuth.de

supported by: *Nachwuchsförderung der DMV-Fachgruppe Stochastik*