What is Online Machine Learing?

We understand the process of answering a sequence of questions given knowledge of the correct answers to previous questions

Posted by : Lokesh Kumar ✪ May 7, 2020 ✪ 8 min read.

What is online learning?
How’s an online learning problem setup?
Measuring performance of Online Algorithms
Motivating uses of Online Learning
Introducing Cover’s Impossibility Result

What is Online Learning?

In conventional machine learning, we have access to a large dataset on which we train conventional ML algorihtms. We then aim to answer queries depending on the nature of the problem (classification, regression). The performace is judged based on metrics like accuracy, MSE (mean squared error) etc. All learing of this kind fall into the sub-discipline of machine learning called Offline Learning.

Now consider the following cases where,

Problem is so complex, that theoritical modelling and applying appropriate optimization techniques to solve them might be infeasible
Training dataset is ridiculously large, and therefore computationally impossible to employ batch training
ML algorihtm needs to face dynamic new patterns in the data, trend and seasonal variations in the data.
ML algorithm must adapt to the changing dynamics of data creation, and keep itself relevant at all instances of time.

Online Learning, a sub-discipline of machine learning aims to solve these problems. Online learning agents aim to make a sequence of accurate predictions given knowledge of the correct predictions of previous tasks and optionally any other additional information.

How’s an online learning problem setup?

Online Learning is performed in a sequence of consecutive rounds/timesteps.
At round $t$ , learner is given a question $x_t$ taken from the question set $\mathcal{X}$ . The learner provides an answer $p_t$ after processing the input.
After learner’s prediciton, the correct answer $y_t$ is revealed from the answer set $\mathcal{Y}$ .
The learner suffers a loss $l(p_t,y_t)$ which measures the error in prediction.

For example, in the case of online classification (will it rain tomorrow), the following variables take the form

$x_t$ is the feature representation of weather of the day, $\mathcal{X}$ denotes the input feature domain where input is obtained
$y_t = \{0,1\}$ denotes the whether it will rain tomorrow (0 = no rain, 1 = rain). Note this is known only tomorrow, and hence all decisions must be taken with the algorithm’s estimate of $y_t$ (i,e) $p_t$
The loss function can be assumed to be 0-1 loss function (i,e) $l(p_t,y_t) = \|p_t-y_t\|$ .

Measuring performance of Online Algorithms

We look specifically into two metrics:

Mistake bound (valid under realizability assumption)
Regret

Before we take problem head-on, we need to define some notations to make understanding easier. $\mathcal{H}$ denotes the hypothesis class, a set of collection of mappings $\{h: \mathcal{X}\longrightarrow \mathcal{Y}\}$ from domain to target space. The learner can choose any mapping from $\mathcal{H}$ .

Mistake Bound $M_A(\mathcal{H})$

Realizability Assumption: We assume that $y_t$ generated by the environment is taken from a mapping $h^* \in \mathcal{H}$ . This means, $\exists$ a mapping in the set of allowed mapping $\mathcal{H}$ which exactly matches the true dynamics of the environment (generation of $y_t$ ).

Under this assumption, the aim of the learning agent is to find $h^*$ as soon as possible from $\mathcal{H}$ . For the online learning algorithm, $A$ , we denote by $M_A(\mathcal{H})$ the maximum number of mistakes $A$ might commit on a sequence of examples labelled by $h^*$ . A bound on $M_A(\mathcal{H})$ is called the mistake-bound and we must design algorihtms with minimal $M_A(\mathcal{H})$ . Its easy to understand that if the problem is not realizable, $M_A(\mathcal{H}) \rightarrow \infty$

Regret $R_T(\mathcal{H})$

Now we can relax the realizability assumption $\implies$ we no longer require the answers ( $y_t$ ) to be generated by some $h^* \in \mathcal{H}$ . We want the algorithm to be competitive with the best fixed predictor from $\mathcal{H}$ . This notion of “regret” measures how much algorithm has suffered in terms of loss till now, for having not followed the some predictor $h \in \mathcal{H}$ . So, defining the regret wrt to some $h \in \mathcal{H}$

Loss accumulated by the algorithm from $t=0$ to $T$ is $\sum_{t=1}^Tl(p_t,y_t)$ . The loss accumulated by choosing a $h(.)$ for all timesteps is $\sum_{t=1}^Tl(h(x_t),y_t)$ . The regret definition now is straightforward and is,

$\begin{equation} R_T(h) = \sum_{t=1}^Tl(p_t,y_t) - \sum_{t=1}^Tl(h(x_t),y_t) \end{equation}$

Regret of the $A$ relative to the hypothesis class $\mathcal{H}$ , is the maximum achievable regret

$\begin{equation} R_T(\mathcal{H}) = max_{h \in \mathcal{H}} R_T(h) \end{equation}$

Motivating uses of Online Learning

I’m providing a set of problem which can be modelled as online learning problems. We will begin by addressing the Prediction from expert advice problem.

Prediction from expert advice

Consider a simple scenario where in a single day, the learning agent must decide on whether to engage in a transaction or not. Engaging in a transaction can result in profit or loss, which is decided by factors out of agents control (environment). Agent has access to a set of $d$ experts who suggest the agent whether to engage in transaction or not, when the day starts.

The agent must use a strategy to combine the advices of the expert and take the appropriate action (engage in transaction or not) so as to maximize the profits it receives.

Food for Thought: We can approach this problem in a batch learning fashion. Store the advices of the experts (0 = donot engage in transaction, 1 = engage in transaction) for $T$ timesteps, constructing the data matrix $X \in \mathbb{R}^{T \times d}$ . Agent always engages in a transaction, thereby knowing whether transaction at a particular day is profitable or not which constucts a 0-1 vector, $y \in \mathbb{R}^T$ (0 for loss, 1 for profit). Now we have a supervised offline version of this problem, which can be solved by employing classification algorihtms. Is it a good approach?

Online spam filtering

Emails arrive into the system and the classifier must identify whether the new email is a spam or not. Note that the learner must learn to adapt dynamically to reject adversarially generated spam emails to target the users.

Lets represent each incoming email at time $t$ using feature vector $x_t \in \mathbb{R}^d$ . Assume we have a linear spam filter whose weights are $\theta \in \mathbb{R}^d$ . Prediction Rule $\hat{y}_t = sgn(\theta^Tx_t)$ (-1 denotes spam, +1 denotes valid). Its clear that the predictor must dynamically update itself, so we define a convex loss function $f(x_t) = l(\hat{y}_t, y_t)$ and the predictor must update itself (change $\theta$ appropriately) such that it minimizes the total loss till the given time $t$ .

Recommendation systems

No online learning text is complete without mention of recommendation systems which revolutionized entertainment and e-retail industry. Recommendation system problem can be cast as a matrix completion problem. Lets consider a user-item matrix $X \in \mathbb{R}^{m\times n}$ where we have $m$ customers and $n$ items (may be songs, movies etc).

$\begin{equation} X_{ij} = \begin{cases} 0, & \text{if customer $i$ dislikes item $j$}\\ 1, & \text{if customer $i$ likes item $j$} \end{cases} \end{equation}$

In online setting, at each iteration the algorithm outputs a preference matrix (its estimate of $X$ ) $X_t \in \mathcal{K}$ where $\mathcal{K} \subset \{0,1\}^{m\times n}$ (all possible 0-1 $m\times n$ matrix). Now, environment chooses a user-item pair $(i_t, j_t)$ along with the real preference for this pair $y_t \in \{0,1\}$ . Thus the loss experienced by the algorihtm is

$\begin{equation} f_t(X) = (X_{i_t,j_t} - y_t)^2 \end{equation}$

We generally have other priors on the preference matrix $X$ such as its a low rank matrix which we can leverage on.

Introducing Cover’s Impossibility Result

Consider an online binary classification, $\mathcal{Y}=\{0,1\}$ and 0-1 loss $l(p,y)=\|p-y\|$ . On each round learner receives $x_t \in \mathcal{X}$ and predicts $p_t \in \{0,1\}$ . Then $y_t$ is revealed by the environment and pays a loss $l(p_t,y_t)$ .

We now will show that achieving sub-linear regret is not possible = Cover’s Impossibility Result

Finite Hypothesis Class Assumption: $\|\mathcal{H}\| < \infty$

$\mathcal{H} = \{h_0, h_1\}$ where $h_b(x)=b\ \ \ \forall x, b \in \{0,1\}$ .

$\begin{equation} R_T(\mathcal{H}) = max_{h \in \mathcal{H}}\left(\sum_{t=0}^T\|p_t-y_t\| - \sum_{t=0}^T\|h(x_t)-y_t\|\right) \end{equation}$

Now lets say, the environment waits for $A$ to predict $p_t$ and explicitly sets $y_t = !p_t$ . Well, this is completely possible and this is bad news. This means that $A$ commits a mistake at all time steps starting from $t=0$ to $T$ . What about the second term?

Consider a sequence of $y_1,...,y_T$ , the best estimator (which is present in $\mathcal{H}$ ) is $h_b$ where $b$ is the majority label in the above sequence. This means the number of mistakes made by $h_b$ is atmost $T/2$ . Therefore, the regret of any online algorithm is atlest $T-T/2=T/2 = O(T)$ which is not sublinear with $T$ .

Dont get demotivated, we have a way around by restricting the environment’s “power”.

Since this is a roadblock and no further analysis can be done, we now impose some restrictions on the adversary/environment so that meaningful analysis can be done.

Realizability of the Adversary
Randomization of Algorithm Prediction

Randomization of Algorithm Prediction: Here we randomize the algorithm’s prediction. Therefore, the adversary may be aware of the probability distribution of the algorithms outcome, but not the actual $p_t$ .

Next time, we will see some algorihtms (randomized weighted majority, Winnow, Perceptron) and their regret bounds. Comment down your suggestions and would love to hear your feedback!