MCTS is the cornerstone of AlphaGo and many AI applications. We aim to build some intuitions and along the way get our hands dirty.

Monte Carlo (Image from Unsplash)

Monte Carlo Tree Search (MCTS) is an important algorithm behind many major successes of recent AI applications such as AlphaGo’s striking showdown in 2016.

In this blog, we will first start with uninformed search in which we simply traverse through the whole search space to find the optima. It includes depth-first search and breadth-first search.

Then we follow by describing how MCTS algorithm works. In the end, we apply it to a toy example problem of finding the most rewarding leaf node of a binary tree.

Uninformed search

Uninformed search, as its name suggests, is a kind of generic search algorithms which…

Various evaluation metrics are used for evaluating the effectiveness of a recommender. We will focus mostly on ranking related metrics covering HR (hit ratio), MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain).

Recommender systems eventually output a ranking list of items regardless of different modelling choices. So it is important to look at how to evaluate directly ranking quality instead of other proxy metrics like mean squared error, etc.

HR (Hit Ratio)

In recommender settings, the hit ratio is simply the fraction of users for which the correct answer is included in the recommendation list of length L.

As one can see, the larger L is, the higher hit ratio becomes, because there is a higher chance that the correct answer is included in the recommendation list. …

A small tutorial or introduction about common loss functions used in machine learning, including cross entropy loss, L1 loss, L2 loss and hinge loss. Practical details are included for PyTorch.

Cross Entropy

Cross entropy loss is commonly used in classification tasks both in traditional ML and deep learning.

Image from this post

Note: logit here is used to refer to the unnormalized output of a NN, as in Google ML glossary. However, admittedly, this term is overloaded, as discussed in this post.

In this figure, the raw unnormalized output from a Neural Network is converted into probability by a softmax function.

A gentle intro into Matrix Factorization techniques in Recommender Systems, including FunkSVD , SVD++, and Non-negative Matrix Factorization

You mean this Matrix? (image source: Unsplash)


RS is to match items to users. The starting point is a user-item matrix filled with values representing either explicit feedback (user provided ratings) or implicit feedback (count of clicks, number of visits, watch time, etc.). One can also pose this question as a matrix filling problem: given the known entries in user-item matrix, how to fill in the missing entries given all sort of constraints?

We go back to the basics and look at Nearest Neighbour methods in the context of density estimation and recommender systems


Given the recent hyped progress in deep learning based approaches in AI research especially in NLP and CV, it is time to go back to the fundamentals. Many of the fundamental methods, such as neighbourhood based methods are surprisingly effective in many areas, even today. We take a brief look at nearest neighbour method and its applications in density estimation and recommender systems. User-based and item-based collaborative filtering are discussed. In the end, the cold start and data sparsity issue are also touched upon.

Nearest Neighbour in the context of density estimation

Let’s start with the problem of density estimation. In plain words, density estimation is the construction…

Immutability is highly encouraged in Scala applications. As a first class into Scala, it often starts with the difference between val and var. In the naive example when the variable is pointing to a primitive data type like Integer, it all seems obvious, but what if the variable is pointing to a more complex data type, such as a class with a mixture of var and val members?

Let’s dive in with a small example walk-through.

val is conveniently called “value”, implying its immutability, whereas var is called “variable”, implying its mutability. …

What are we really talking about when we talk about AI?

The term “Artificial Intelligence” is coined shortly after World War II during the 1956 Dartmouth Conference. John McCarthy persuaded other attendees (amongst Marvin Minsky, Claude Shannon, etc.) to accept AI as the name for this promising field.

(Anecodotes: Claude Shannon again? He is just everywhere. And Marvin Minsky? Isn’t him the villain who has discouraged Neural Network reseach and caused the AI winter in 1970s?)

Compared to Physics and other fields in science and engineering, AI is one of the newest field, claimed by Russel & Norvig in their…

ML Pipeline is an important feature provided by Scikit-Learn and Spark MLlib. It unifies data preprocessing, feature engineering and ML model under the same framework. This abstraction drastically improves maintainability of any ML project, and should be considered if you are serious about putting your model in production.

Let’s start by outlining some main benefits of using ML pipeline, already.

  • Unified API for both data transformations and modelling
  • No extra preprocessing steps needed during model serving time
  • Joint grid search possible including data preprocessing steps/parameters and model selection/parameters
  • Compatible with big data framework such as Spark

The last point might not be immediately obvious for some. Imagine if one day your model needs to train on large amount of data. You might have no better choice but to use some distributed framework like Spark. Using ML pipeline will make migrating to Spark a breeze to work with. One…

LSH is a technique used to find similar documents in a large corpus. Here we use “documents” in a broad sense to represent any data that can be represented as a set, such as a shopping basket containing a set of grocery items.

There are different LSH functions associated with different distance measures. LSH functions are used to group similar items into the same buckets. Before we can do that, we first need to define “similar” in terms of the distance measure under consideration. That’s why LSH functions are always associated with a distance measure.

Loosely speaking, for a given…

Bloom filter is a probabilistic data structure designed to tell you if a member is in a set in a highly memory and time efficient manner. In that sense, it is similar to a set, but it does it with an adjustable false positive rate.

A brilliant example is provided in this blog post.

Bloom filter is especially fit for large set, where it is hard to fit the whole set into main memory as it is, and when it is okay to allow certain rate of false positives.

One typical example is account name selection while creating your gmail…

Benjamin Wang

Machine Learning & Software Engineer in Amsterdam, Holland

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store