In today’s post, we will explain a certain algorithm for matrix factorization models for recommender systems which goes by the name *Alternating Least Squares* (there are others, for example based on stochastic gradient descent). We will go through the basic ALS algorithm, as well as how one can modify it to incorporate user and item biases.

We will also go through the ALS algorithm for implicit feedback, and then explain how to modify *that* to incorporate user and item biases. The basic ALS model and the version from implicit feedback are discussed in many places (both online and in freely available research papers), but we aren’t aware of any good source for *implicit ALS with biases*… hence, this post.

Recommendation engines are becoming ubiquitous in the modern consumer landscape. From Netflix’s personalized movie recommendations to Amazon’s raft of suggested items (New For Your, Recommendations for You in the Kindle Store, etc…), to Spotify’s wildly popular personalized Discover Weekly Playlist, recommendation engines have become an essential tool for promoting content discovery and extending consumer engagement.

There has been a proliferation of algorithms in the data science community upon which recommender systems are built, but they all follow the basic central dogma that similar users like similar items. The hard part, of course, is deciding exactly what “similar” means and how to measure it. For Netflix, similar users are users who rate movies similarly. For Spotify, similar users are users who listen to the same kinds of music (a little thought will convince you that these two ideas, while similar, are not really the same).

The simplest type of recommender system, called a *collaborative filter*, is essentially a literal interpretation of “similar users like similar things” Goldberg1992. Once a metric for similarity is decided on, one just looks at the top rated items of the kk nearest neighbors (with respect to the similarity metric), filters out those items which the user of interest has already seen/consumed, and *voila*: personalized recommendations.

Collaborative filters have the advantages of being relatively easy to implement and relatively easy to interpret (e.g. we recommended movie A to user 1 because users 2, 3, and 4 all rated it highly and have similar ratings profiles to user 1). This being the Age of Data Science, though, it is no longer enough to have a reasonable explanation. We can actually measure how effectively a recommender system predicts the ratings of users on unseen items. Thus began Netflix’s famous quest for better recommender systems, leading to the famous million dollar prize and eventual solution Bell2008 (by a team of experts working for several years).

Although the recommender system that won the Netflix prize was never put into production (it was a carefully tuned mix of many statistical models, and the marginal gain obtained by implementing all of the models was not worth the technical price compared to simply implementing a few of the best ones), the data science community, and the wider commercial world, learned without a doubt that, while collaborative filters are explainable and easily implementable, one can do significantly better if one is willing to flex a bit of mathematical muscle.

There were essentially two types of recommender systems in the final solution to the Netflix prize, and they have become the bread and butter of the current state of the art: matrix factorization models and restricted Bolzmann machines (RBMs). We will discuss matrix factorization models in this post.

We will be interested in two refinements of the basic matrix factorization model for recommendations: using implicit feedback, and using user and item biases.

It was realized early on, even for collaborative filters, that recommender systems work a lot better if one accounts for user and item biases.

Suppose, for example, that we are trying to predict movie ratings and

*Bob*is grouchy and rates items with an average of 2 stars.*Alice*is chipper and cheery and rates things with an average of 4 stars.

Clearly 3 stars from *Bob* is very different than 3 stars from *Alice*. If we keep track of the average movie rating for each user (the *user bias*), and just try to predict the difference from that average, the recommender system works *much* better.

Similarly, suppose

*Snakes on a Plane*gets an average rating (over all users) of 1 star.*The Shawshank Redemption*gets an average rating of 5 stars.

Then a 3 star rating is very different for *Snakes on a Plane* than for *The Shawshank Redemption*. If we keep track of the average user rating for each movie (the *item bias*), and just try to predict the difference from that average, the recommender system also works *much* better.

Obviously, one should keep track of both user and item biases. (We will explain the technical details below.)

One of the challenges of recommender systems in the wider commercial world is that one rarely has explicit ratings data (Netflix strongly encourages users to rate movies and leverage the tech built around those ratings).

Instead, user-item interactions like clicks, purchases, song listens (or fast-forwards), etc, are used as a proxy to indicate a preference or distaste for a particular item (possibly very weakly). If a user listens to a song only once, maybe it indicates that they didn’t like the song. However, maybe they were busy or weren’t paying attention and actually would like to hear the song again. If a user listens to a song 100 times, it is a safe bet that they like the song. But it is hard to draw a line between unknown preference and known preference.

This kind of indirect information about user-item preferences is known as *implicit feedback*, and many smart people have thought long and hard about how to deal with it. We won’t delve much into the modeling of implicit feedback in today’s post. It turns out to be relatively effective to model a user’s preference on a scale of 0 (bad) to 1 (good) instead of modeling the user’s raw number of interactions (or whatever number is actually being measured).

We can interpret the number of user-item interations (song listens, for example) as a measure of our confidence in our model’s prediction for the user’s preference of the item.

Below, we’ll step through the details of how a matrix factorization model can be used to deal with implicit feedback. The basic algorithm we discuss is from the seminal paper Hu2008.

The most basic matrix factorization model for recommender systems models the rating r^r^ a user `u`

would give to an item `i`

byr^ui=xTuyi,r^ui=xuTyi,

where xTu=(x1u,x2u,…,xNu)xuT=(xu1,xu2,…,xuN) is a vector associated to the user, and yTi=(y1i,y2i,…,yNi)yiT=(yi1,yi2,…,yiN) is a vector associated to the item. The dimension of the vectors is the *rank* of the model, and the components are called *factors*.

We can collect the user-item ratings into a matrix Rˆ=(r^ui)R^=(r^ui). First collect the user vectors into a matrixXT=⎛⎝⎜⎜⎜⋮xu1⋮⋮xu2⋮⋯⋯⋯⋮xunusers⋮⎞⎠⎟⎟⎟,XT=(⋮⋮⋯⋮xu1xu2⋯xunusers⋮⋮⋯⋮),

and the item vectors into a matrixYT=⎛⎝⎜⎜⎜⋮yi1⋮⋮yi2⋮⋯⋯⋯⋮yinitems⋮⎞⎠⎟⎟⎟,YT=(⋮⋮⋯⋮yi1yi2⋯yinitems⋮⋮⋯⋮),

then we can express the above model asRˆ=XYT.R^=XYT.

Of course, there is no reason that the “true” user-item matrix R=(rui)R=(rui) should have rank NN, and indeed, we won’t even know the “true” user-item matrix. This model assumes that it can be approximated by a rank NN factorization:R∼Rˆ=XYT.R∼R^=XYT.

In other words, we want to find an nusers×Nnusers×N matrix XX and an nitems×Nnitems×N matrix YY such that Rˆ:=XYTR^:=XYT approximates RR.

** Remark** We follow the matrix conventions of Hu2008, and machine learning in general. Although the user- and item-factor vectors xuxu and yiyi (and vectors in general) are regarded as column vectors, they are arranged as rows into the composite matrices XX and YY.

One popular method for finding the matrices XX and YY given partial information about RR is known as *alternating least squares*. The idea is to find the parameters xjuxuj and yjiyij (the entries of the matrices XX and YY) which minimize the L2L2 cost functionC=∑u,i∈observed ratings(rui−xTuyi)2+λ(∑u∥xu∥2+∑i∥yi∥2).C=∑u,i∈observed ratings(rui−xuTyi)2+λ(∑u‖xu‖2+∑i‖yi‖2).

The constant λλ is called the regularization parameter and essentially penalizes the components of the matrices XX and YY if they get too large (in magnitude). This is important for numerical stability (and some kind of regularization is almost always used). It has an even more important effect in this setting, though:

**Fundamental Observation**: If we hold the item vectors YY fixed, CC is a *quadratic* function of the components of XX. Similarly, if we hold the user vectors XX fixed, CC is a quadratic function of the components of YY.

So in order to minimize CC, one could try the following:

- Hold the user vectors fixed and
*solve*the quadratic equation for the yjiyij’s. This will (probably) not be the global minimum of CC since we haven’t touched half of the variables (the xjuxuj’s), but we have at least decreased CC. - Hold the item vectors fixed and
*solve*the quadratic equation for the xjuxuj’s. - Repeat.

Some remarks are in order.

- Since CC is a convex function, and each step of the above algorithm is a minimization, the process must converge at some point.
- There are other methods for minimizing such a convex cost function, for example, gradient descent (or stochastic gradient descent, or batch SGD, etc…). One important difference here is that at each step of our algorithm, we find the
*exact*minimum; we don’t take small steps in a downward direction. So if we do step 1), a second application of step 1) will have no effect: we’re already at the absolute minimum of CC with YY held fixed. It also means that a single iteration of the above algorithm generally moves much further than an iteration of a gradient descent algorithm; i.e. we need fewer iterations for convergence.

The above algorithm is called (for obvious reasons) *alternating least squares*.

A bit of linear algebra yields the following algorithm (see here, for example):

- Initialize the user vectors XX somehow (e.g. randomly).
- For each item
`i`

, let riri be the vector of ratings of that item (it will have`n_users`

components; one for each user). Computeyi=(XTX+λI)−1XTriyi=(XTX+λI)−1XTrifor each item`i`

. (II is the N×NN×N identity matrix.) - For each user
`u`

, let ruru be the vector of ratings of that user (it will have`n_items`

components; one for each item). Computexu=(YTY+λI)−1YTruxu=(YTY+λI)−1YTrufor each user`u`

. - Repeat 2), 3) until desired level of convergence is achieved.

** Remark.** It is almost always faster to solve the linear system (XTX+λI)yi=XTri(XTX+λI)yi=XTri instead of inverting the matrix (XTX+λI)(XTX+λI) as written in the algorithm (similarly for the user vectors).

As we mentioned above, it turns out that most recommender systems perform better if user and item biases are taken into account.

Suppose we have a ratings system that allows each user to rate each item on a scale of 1 to 5 stars. Suppose we have two users: `Alice`

, who rates items with an average of `4`

stars, and `Bob`

, whose average rating is `1.5`

stars. If `Bob`

rates some new item with `3`

stars, it means something *very* different than if `Alice`

rates the same item with `3`

stars (`Bob`

*really* liked the new item, `Alice`

didn’t). The difference is what we call *user bias*.

For an explicit-ratings model (as discussed above, when we have explicit ratings for user-item pairs), one way to account for user bias is to model the actual user-item ratings asrui∼βu+r^ui,rui∼βu+r^ui,

where βuβu is the *user bias* of user `u`

and r^uir^ui is the model of whatever is left over; for example, r^ui=xTuyir^ui=xuTyi if we use the matrix factorization model discussed above. We can account for item bias in a similar way:rui∼βu+γi+r^ui;rui∼βu+γi+r^ui;

here, γiγi is the *item bias* of item `i`

.

** Remark**. If instead of matrix factorization we used a traditional collaborative filter, we would define r^uir^ui via kk-nearest neighbors using some similarity metric (Pearson is a standard choice), βuβu is the average rating of user

`u`

over all items, and γiγi is the average rating of item `i`

over all users.User and item biases can be directly incorporated into the ALS algorithm. We model the user-item ratings matrix asrui∼βu+γi+xTuyirui∼βu+γi+xuTyi

and minimize the cost functionCbiased=∑u,i∈observed ratings(rui−xTuyi)2+λ(∑u(∥xu∥2+β2u)+∑i(∥yi∥2+γ2i)).Cbiased=∑u,i∈observed ratings(rui−xuTyi)2+λ(∑u(‖xu‖2+βu2)+∑i(‖yi‖2+γi2)).

Again, because of the regularization, we can hold the user variables fixed and solve for the minimum in the item variables. Then we can hold the item variables fixed and solve for the minimum in the user variables.

The biases βuβu and γiγi appear in CbiasedCbiased without any coefficients, whereas all the other parameters appear at least once with some coefficient. So one approach would be to solve for the biases separately from the other parameters in each step.

It is easier, though, to leverage the work we’ve already done. We can rewrite our cost function at each step so that it looks like an unbiased model, and then just use the same formulas we found above for the unbiased ALS. The trick is to define new vectors that include the biases as components in the right way.

Let ββ be the vector of user biases (with `n_users`

components) and γγ the vector of item biases (with `n_items`

components). Here is the algorithm:

- Initialize user vectors randomly and set all biases to zero (or initialize them randomly, it doesn’t matter very much).
- For each item
`i`

, define three new vectors :rβi:=ri−βriβ:=ri−βwith components rβui:=rui−βuruiβ:=rui−βu (notice that both vectors have`n_users`

components),x~Tu:=(1,xTu),x~uT:=(1,xuT),andy~Ti:=(γi,yTi).y~iT:=(γi,yiT).Then Cbiased=∑(rβui−x˜Tuy˜i)2+λ(∑u∥x˜u∥2+∑i∥y˜i∥2)Cbiased=∑(ruiβ−x~uTy~i)2+λ(∑u‖x~u‖2+∑i‖y~i‖2). Hence, we find that the item bias and vector can be computed asy˜i:=(γiyi)=(X˜TX˜+λI)−1X˜Trβi.y~i:=(γiyi)=(X~TX~+λI)−1X~Triβ.(II is now the (N+1)×(N+1)(N+1)×(N+1) identity matrix, and X~X~ and Y~Y~ are matrices whose columns are the vectors x~ux~u and y~iy~i as usual.) - Now, for each user
`u`

, define three new vectors:rγu:=ru−γ,ruγ:=ru−γ,y˜Ti:=(1,yi),y~iT:=(1,yi),andx˜Tu:=(βu,xu).x~uT:=(βu,xu).The user bias and vector can be computed asx˜u:=(βuxu)=(Y˜TY˜+λI)−1Y˜Trγu.x~u:=(βuxu)=(Y~TY~+λI)−1Y~Truγ. - Repeat 2), 3) until convergence.

For many real-world user-item interactions there are no explicit rating data available. However, there is often nontrivial information about the interactions, e.g. clicks, listens/watches, purchases, etc. Such indirect “ratings” information about user-item interactions is known as **implicit feedback**. Modeling implicit feedback is a difficult but important problem. There are several ways to use the ALS matrix factorization to approach such a model. We present here a standard solution, presented (without bias corrections) in Hu2008.

The basic approach is to forget about modeling the implicit feedback directly. Rather, we want to understand whether user `u`

has a preference or not for item `i`

using a simple boolean variable which we denote by pui.pui. The number of clicks, listens, views, etc, will be interpreted as our confidence in our model.

Following the fundamental idea of matrix factorization models, we try to find a user vector xuxu for each user `u`

and an item vector yiyi for each item `i`

so thatpui∼xTuyi.pui∼xuTyi.It is important to note that we never actually observe puipui! (This is a very different situation than the explicit feedack models, as discussed above, where ruirui is the observer rating.)

Let’s assume the our implicit feedback is a positive integer (number of clicks, number of views, number of listens, etc). That is,rui=# of times user u interacted with item i.rui=# of times user u interacted with item i.

How do we go about finding the vectors xuxu and yiyi given some implicit feedback {rui}{rui}? If a user has interacted with an item, we have reason to believe that pui=1pui=1. The more they have interacted with that item, the stronger our belief in their preference.

To define our model, setpui={10if rui>0if rui=0.pui={1if rui>00if rui=0.We try to minimize the following cost function:Cimplicit:=∑u,i∈observed interactionscui(pui−xTuyi)2+λ(∑u∥xu∥2+∑i∥yi∥2),Cimplicit:=∑u,i∈observed interactionscui(pui−xuTyi)2+λ(∑u‖xu‖2+∑i‖yi‖2),where cuicui is our **confidence** in pui.pui. That is, the more a user has interacted with an item, the more we penalize our model for incorrectly predicting pui.pui.

If a user has never interacted with an item, it is possible that pui=1pui=1 and the user has just never encountered the item. It is also possible they are actively avoiding the item. To deal with this ambiguity, it is common to define the confidence bycui:=1+αrui,cui:=1+αrui,where αα is a parameter of the model that needs to be tuned to the dataset. There is some empirical evidence that setting it to the sparsity ratio (the ratio of nonzero entries to zero entries) works well; missing entries are often interpreted as slightly negative, and so one could interpret αα as balancing positive and negative interactions. See Johnson2014, for example. Another possibility that works well (and makes the model more robust against power users — those users who interact with items orders of magnitude more often than the average user), which is also mentioned in Hu2008, is to setcui=1+αlog(1+rui/ϵ),cui=1+αlog(1+rui/ϵ),where ϵϵ is yet another data-dependent parameter of the model.

The basic idea of the ALS algorithm for matrix factorization applies to minimizing CimplicitCimplicit: if we hold the user (resp. item) vectors fixed, then CimplicitCimplicit depends quadratically on the item (resp. user) vectors. That means we can do the same thing: hold the user vectors fixed and solve for the minimum in the item variables, then hold the item vectors fixed and solve for the minimum in the user variables, and repeat…

Of course, our cost function CimplicitCimplicit is slightly different, so the actual steps involving solving for the minimum look a bit different. The algorithm is as follows.

- Initialize user vectors.
- For each item
`i`

, let pipi be the vector whose components are puipui (for fixed`i`

, there will be`n_users`

components), let CiCi be the diagonal matrix with cuicui for fixed uu along the diagonal, and let di=Cipidi=Cipi (the reason for defining didi separately will be explained in the remarks about the`numpy`

implementation below). Computeyi=(XTCiX+λI)−1XTdi.yi=(XTCiX+λI)−1XTdi. - For each user
`u`

, let pupu be the vector whose components are puipui (for fixed`u`

; there will be`n_items`

components), let cucu be the vector whose components are cuicui, let CuCu be the diagonal matrix with cucu along the diagonal, and let du=Cupudu=Cupu. Computexu=(YTCuY+λI)−1YTdu.xu=(YTCuY+λI)−1YTdu. - Repeat 2), 3) until convergence.

*Remarks*

- As for standard (explicit) matrix factorization models, it is (almost) always better to solve (XTCiX+λI)yi=XTdi(XTCiX+λI)yi=XTdi than to invert the matrix (XTCiX+λI)(XTCiX+λI).
- Since cui=1+αruicui=1+αrui, which we can write in terms of matrices C:=(cui)C:=(cui) and R:=(rui)R:=(rui) as C=1+αRC=1+αR, the computation in step two can be rewritten asyi=(XTX+λI+αXTRiX)−1XTdi.yi=(XTX+λI+αXTRiX)−1XTdi.The term XTX+λIXTX+λI is independent of
`i`

, and can be computed once and reused at each step of the algorithm (similarly for YTY+λIYTY+λI in step 3).

`numpy`

Implementation of ALS for Implicit FeedbackIt turns out that the matrix multiplication XTCiXXTCiX and the matrix-vector product XTdiXTdi are both easier to implement in `numpy`

than to express in standard linear algebra notation. A direct implementation of the algorithm is of course possible, but a small bit of thought improves performance vastly.

`numpy`

Implementation NotesA direct implementation of the matrix product XTCiXXTCiX would be something like

Ci = np.diag(c[i]) XTCiX = np.dot(X, np.dot(Ci, np.dot(X.T)).

But the matrix product `np.dot(X, Ci)`

is just multiplying the rows of XX, in order, by the diagonal elements of CiCi, which is achieved very efficiently by simple `numpy`

broadcasting. So we can achieve the same matrix product via

XTCiX = np.dot(c[i] * X.T, X).

Moreover, since di=pi+αridi=pi+αri (which follows from puirui=ruipuirui=rui), we can rewrite the algorithm in terms of just the scaled raw counts αRαR and the preference pp asyi=(XTX+λI+αXTRiX)−1XT(pi+αri).yi=(XTX+λI+αXTRiX)−1XT(pi+αri).

Hence, we can completely avoid constructing the diagonal matrices CiCi and CuCu.

Finally, when `n_users`

and `n_items`

are large, RR (and hence pp) will generally be sparse matrices. The matrix (XTX+λI+αXTRiX)(XTX+λI+αXTRiX) is, however, N×NN×N and since the number of factors is generally small (on the order of 10), there is no great savings by using a sparse implementation of this matrix. On the other hand, only the user vectors corresponding to nonzero entries of riri (and pipi) are needed in the computation of XTRiXXTRiX and of XT(pi+αri)XT(pi+αri), so if the algorithm is parallelized, it is *very* useful to only ship those vectors to the workers doing the computation for yiyi (this is, for example, how the Spark implementation works). Our implementation is not parallelized (for the sake of clarity), so we don’t need to worry about shipping data to workers.

As with the standard matrix factorization model, matrix factorization models for implicit feedback tend to work better if user and item biases are accounted for; we introduce parameters βuβu and γiγi as before, and try to fit a modelpui∼βu+γi+p^ui,pui∼βu+γi+p^ui,where p^ui=xTuyip^ui=xuTyi as above, by minimizing the cost functionCbiasedimplicit:=∑u,i∈observed interactionscui(pui−βu−γi−xTuyi)2+λ(∑u(∥xu∥2+β2u)+∑i(∥yi∥2+γ2i)).Cimplicitbiased:=∑u,i∈observed interactionscui(pui−βu−γi−xuTyi)2+λ(∑u(‖xu‖2+βu2)+∑i(‖yi‖2+γi2)).

The interpretation of the biases in this case is slightly different:

- User Bias βuβu: A higher bias βuβu pushes the values of all preferences puipui up
*for all items*; that is, a user with a high bias likes a large variety of items; conversely, a low user bias means the user only likes a small selection of items. - Item Bias γiγi: A higher item bias γiγi pushes the preferences puipui up
*for all users*; that is, the item is more universally beloved (or mainstream); conversely, a low item bias might indicate a niche item.

Again, the principle of the ALS algorithm applies with appropriate modifications for the new cost function CbiasedimplicitCimplicitbiased. The algorithm is:

- Initialize user vectors and biases.
- Define vectorspβi:=pi−β,piβ:=pi−β,x~Tu=(1,xTu),x~uT=(1,xuT),andy~Ti=(γi,yTi).y~iT=(γi,yiT).Then computey~i=(X˜TCiX˜+λI)−1X˜Tdβi,y~i=(X~TCiX~+λI)−1X~Tdiβ,where we have defined dβi=Cipβidiβ=Cipiβ as in the previous implicit ALS algorithm.
- Define vectorspγu:=pu−γ,puγ:=pu−γ,y~Ti=(1,yTi),y~iT=(1,yiT),andx~Tu=(βu,xTu).x~uT=(βu,xuT).Then computex˜u=(Y˜TCuY˜+λI)−1Y˜Tdγu,x~u=(Y~TCuY~+λI)−1Y~Tduγ,where dγu=Cupγuduγ=Cupuγ.
- Repeat 2) and 3)

The same remarks apply as for biased ALS and implicit ALS.