# What is alpha in Vapnik's statistical learning theory?

I’m currently studying Vapnik’s theory of statistical learning. I rely on Vapnik (1995) and some secondary literature that is more accessible to me. Vapnik defines a learning machine as ‘an object’ capable of implementing a set of functions $$f(x, alpha), alpha in Lambda$$. This term appears in all following equations, eg the risk functional $$R(alpha)$$ is written as a function of $$alpha$$.

I’m having trouble understanding what is $$alpha$$ in practice and how does it relate to the VC dimension $$h$$. Suppose for example that I fit a simple regression tree on my data. What are the ‘learning machine’ and $$f(x, alpha)$$ in this context? Can I interpret $$alpha$$ as the paramaters (eg split variables, cutpoints etc.) and hyperparameters of my decision tree?

Cross Validated Asked on November 21, 2021

$$alpha$$ is the parameter or vector of parameters, including all so-called "hyperparameters," of a set of functions $$V$$, and has nothing to do with the VC dimension.

## Long Answer: What is $$alpha$$?

Statistical learning is the process of choosing an appropriate function (called a model) from a given class of possible functions. Given a set of functions $$V$$ (the class of possible models under consideration), it is often convenient to work with a parametrization of $$V$$ instead. This means choosing a parameter set $$Lambda$$ and a function $$g$$ called a parametrization where $$g : Lambda to V$$ is a surjective function, meaning that every function $$f in V$$ has at least one parameter $$alpha in Lambda$$ that maps to it. We call the elements $$alpha$$ of the parameter space $$Lambda$$ parameters, which can be numbers, vectors, or really any object at all. You can think of each $$alpha$$ as being a representative for one of the functions $$f in V$$. With a parametrization, we can write the set $$V$$ as $$V = { f(x, alpha) }_{alpha in Lambda}$$ (but this is bad notation, see footnote*).

Technically, it's not necessary to parametrize $$V$$, just convenient. We could use the set $$V$$ directly for statistical learning. For example, I could take

$$V = { log(x), x^3, sin (x), e^x, 1/x , sqrt{x} },$$

and we could define the risk functional $$R : V to mathbb{R}$$ in the standard way as the expected loss

$$R(f) = int L(y, f(x)) dF(x, y) = E[L(y, f(x))]$$

for some loss function $$L$$, a popular choice being $$L(y, x) = | y - f(x) |_2$$, and where $$F$$ is the joint cdf of the data $$(x, y)$$. The goal is then to choose the best model $$f^*$$, which is the one that minimizes the risk functional, i.e.

$$f^* = text{argmin}_{f in V} R(f) .$$

To make this easier to work with, Vapnik instead considers parametrizing the set $$V$$ with a parameter set $$Lambda$$ and a parametrization $$g : Lambda to V$$. With this, you can write every function $$f in V$$ as $$f = g(alpha)$$ for some parameter $$alpha in Lambda$$. This means that we can reinterpret the risk minimization problem as

$$alpha^* = text{argmin}_{alpha in Lambda} R(g(alpha)) quad text{ and } quad f^* = g(alpha^*) .$$

What Vapnik calls the risk functional is actually the function $$R circ g : Lambda to mathbb{R}$$ in the notation I've used, and if $$Lambda$$ is a set of numbers or vectors of numbers, then this has the advantage of being a function as opposed to a functional. This makes analysis much easier. For example, in the calculus of variations the trick of replacing a functional with a function is used to prove necessary conditions for minimizing a functional by converting a statement about a functional $$J$$ to a statement about a function $$Phi$$, which can then be analyzed by using standard calculus (see link for details).

In addition to being easier to analyze, it's also quite convenient to use a parametrization when the functions in $$V$$ are all of a similar form, such as the set of power functions $$V = { x, x^2, x^3, x^4, dots } = { x^alpha }_{alpha in mathbb{N}}$$ or the set of linear functions $$V = { mx + b }_{(m, b) in mathbb{R}^2} .$$

## $$alpha$$ in Practice: A Simple Example

To use your example, let's start with a very simple regression tree to model some data with one real-valued feature $$x in mathbb{R}$$ and a real-valued target $$y in mathbb{R}$$. Let's also assume for simplicity that we're only considering left-continuous decision trees with a depth of 1. This defines our function class $$V$$ implicitly as

$$V = { text{all functions which can be written as a left-continuous regression tree of depth 1} }$$

which is not a very mathematically convenient formulation. It would be much easier to work with this if we notice that the depth $$d$$ being exactly 1 means that there is one split point, which means that we can parametrize $$V$$ using the parametrization $$g : mathbb{R}^3 to V$$ defined by

$$g(alpha_1, alpha_2, alpha_3) = begin{cases} alpha_1 , & text{ if } x le alpha_3 \ alpha_2 , & text{ if } x > alpha_3 \ end{cases},$$ where $$alpha_3$$ is the split point, and $$alpha_1$$ and $$alpha_2$$ are the values of the function on the intervals $$(-infty, alpha_3]$$ and $$(alpha_3, infty)$$. Notice that in general parametrizations are not unique. For instance, there was nothing special about the order of these three parameters: I could rearrange them to get a different parametrization, or I could even use the parametrization

$$h(alpha_1, alpha_2, alpha_3) = begin{cases} alpha_1^5 - 2 alpha_1 + 5 , & text{ if } x le 1000alpha_3 \ tan(alpha_2) , & text{ if } x > 1000alpha_3 \ end{cases}.$$ What's important is that every $$f in V$$ can be represented by some parameter $$alpha = (alpha_1, alpha_2, alpha_3) in mathbb{R}^3$$, which is possible whether we use the parametrization $$g$$ or $$h$$.

## $$alpha$$ in Practice: A More Complicated Example

Now, let's say we want to use a more complicated model. Let's use a regression tree to model data with two real-valued features $$(x_1, x_2) in mathbb{R}^2$$ and a real-valued target $$y in mathbb{R}$$, and with decision trees with a maximum depth of 2. Parametrizing $$V$$ this time is much more complicated, because regression trees depend on the shape of the tree, which variable is being split at each node, and the actual value of the split point. Every full binary tree of depth $$d le 2$$ is one of five possible shapes, shown below: In addition, for each leaf on the tree, we have to specify a real number parameter, and for each branch vertex we have to specify which of the two features we're splitting on and what the value of the split point is. One way you could construct the parametrization would be to use a discrete variable to parametrize the possible tree shapes, another discrete variable for each node to parametrize whether $$x_1$$ or $$x_2$$ is being split, and then real-valued parameters for the actual values of the function on each piece of the domain. Once again, there are many ways of parametrizing this set, but here is one: Let $$Lambda = { 1, 2, 3, 4, 5 } times { 1, 2 }^3 times mathbb{R}^7$$ For a parameter $$alpha in Lambda$$, e.g. $$alpha = (4, (2, 1, 1), (0.18, 0.3, -0.5, 10000, 538, 10, pi))$$, the first coordinate determines the shape of the tree, as listed in order above; the second coordinate has three coordinates that determine which of the two features is split on at each branch node (note that the middle one is "unused" for shape 4, which is not an issue because parametrizations don't have to be injective functions); the third coordinate has seven coordinates, each of which is a real value corresponding to a node in the graph that

1. for leaves, determines the value of the regression tree on the corresponding piece of the domain,
2. for branch vertices, determines the split value,
3. and for unused vertices, is unused.

I've shown the graph corresponding to this parameter below: ## Relation to VC Dimension

$$alpha$$ has nothing to do with the VC dimension, because each $$alpha in Lambda$$ is a representative of one function $$f in V$$, and the VC dimension is a characteristic of the entire set of functions $$V$$. You could ask whether the parametrization $$g : Lambda to V$$ has something to do with the VC dimension. In fact, this might even be intuitive, because the VC dimension measures the "capacity" of the set of functions $$V$$. Often, the "number of parameters" is used as a proxy for the "capacity" as well. However, this intuitive concept does not formalize well. In fact, the example $$V = { sin(theta x) }_{theta in mathbb{R}}$$ has infinite VC dimension despite having only one parameter, so the notion of low "number of parameters" corresponding to low "capacity" does not hold. In fact, the "number of parameters" is not well defined in the first place, since parametrizations are not unique and can have different numbers of parameters (the minimum of which is almost always 1 because of space-filling curves).

## The Learning Machine

The learning machine is not simply the set $$V$$, however, but a process for estimating the data generating process that produces the training data $${ (x, y) }_{i = 1}^n$$. This might mean picking a function set $$V$$ in advance, and minimizing the empirical risk $$R_text{emp} (f) = sum_{i = 1}^n L(y_i, f(x_i))$$ over the set $$V$$, or in parametric form, minimizing $$R_text{emp} (g(alpha)) = sum_{i = 1}^n L(y_i, g(alpha)(x_i))$$ over the set $$Lambda$$. Note that $$g(alpha)$$ is itself a function, which $$x_i$$ is being plugged into in the above expression. This is why the notation $$g_alpha$$ is slightly better than $$g(alpha)$$, so we don't have to write awkward expressions like $$g(alpha)(x_i)$$.

The learning machine can also be much more complicated. For instance, it also includes any regularization being used. Limiting the set $$V$$ is one type of regularization used to avoid over-fitting, but of course there are other types as well.

## Footnote

* We should really write functions as $$f$$ not as $$f(x)$$, which is technically not a function but an element of the range of the function, so we could write $$V = { f(alpha) }_{alpha in Lambda}$$, or better yet $$V = { f_alpha }_{alpha in Lambda}$$ to avoid confusing the arguments of the function with the parameter that indicates which function we're talking about.

Answered by Eric Perkerson on November 21, 2021

## Related Questions

### What statistical analysis to used for kinetic data with multiple groups?

1  Asked on August 5, 2020 by carlos-valenzuela

### In R, why do the p-values from anova() change when you add more predictors?

0  Asked on August 4, 2020 by m-smith

### Random forest after cross validation

1  Asked on August 1, 2020 by steven-niggebrugge

### Grey relation between two datasets?

0  Asked on July 31, 2020 by msilvy

### General procedures for combined feature selection, model tuning, and model selection?

1  Asked on July 31, 2020 by uared1776

### Classification model not working for a large dataset

1  Asked on July 30, 2020 by gabriel-ullmann

### Sigma algebra generated by random variable on a set with generators

0  Asked on July 28, 2020 by gabriel

### What is the seasonal trend lowess model in time series?

0  Asked on July 28, 2020 by christopher-u

### Non seasonal and seasonal parameters of this time-series

0  Asked on July 27, 2020 by statsmonkey

### Extended Cox model and cox.zph

2  Asked on July 25, 2020 by finance