# Otsu Thresholding — Mathematical Secrets behind Image Binarization

Simple yet effective method – Otsu thresholding (envisioned in 1979) makes the conversion from grayscale to a binary image. It is based on the observation of the grayscale histogram and used only the zeroth- and first-order cumulative moment.

Even though this approach was proposed for over 40 years, it is still powerful to cover several practical applications. Interestingly, Otsu thresholding is everywhere, for example, widely used in those image binarization functions in Python or Matlab. In the article, I try to use easy-to-understand mathematical-supported examples to show you why Otsu thresholding makes sense.

## + How Otsu thresholding does?

Given an 8-bit grayscale image *I*, and the pixel value *Pi* of the *i*-th pixel, grayscale histogram *H* can be created by the number of occurrences of pixel value [0, 255]. Otsu thresholding directly adopts this statistical data to find the optimal cut *t** that gives the best separation of the classes (for image binarization, foreground, and background).

Algorithmically, first establish the histogram *H *(see the above illustration, colored in blue). Second, for each threshold *t* in [0, 255], pixels can be separated into two classes, *C1* and *C2*; those pixels whose *Pi* < *t* are put into *C1*, otherwise into *C2*. Then, we know the possibilities of *C1* and *C2* separated by* t*, denoted as *W1* and *W2*, respectively. For example, *W1* = (#pixels in *C1*) / (total pixels count). Third, given *H*, *W1*, and *W2*, for each *t*, compute the between-class variance *Vb* or within-class variance *Vw *(see the above illustration, it computes *Vb* and draws as a red curve). Finally, the optimal cut *t** corresponds to *t* whose *Vb* is* *maximum or *Vw *is minimum.

## + Why do Otsu thresholding make sense

It can be explained in the following three different ways.

well-thresholded classes would be separated in gray levels, and conversely, a threshold giving the best separation of classes in gray levels would be the best threshold.

From this standpoint stated in the original paper [1], the author introduces “class separability” (*CS*, for short) to evaluate the goodness of the thresholding result.

**(0) Higher** ** CS**The higher the better. Why? Leave for you to think, and I will explain in the next section.

Either to minimize within-class variance *Vw* or to maximize between-class variance *Vb* can make foreground and background well-separated. Why?

**(1) Maximum Vb**Given three classes

*C1*,

*C2*, and

*C3*contains black (grayscale value = 0), gray (grayscale value = 128), and white color (grayscale value = 255), respectively. Why does

*C1*look more different to

*C3*than to

*C2*? Exactly due to the grayscale value difference between the two classes (at this point, you can think of it as

*Vb*). Suppose one class represents the foreground and the other one is the background. Which two classes are the result that we expect?

*C1*and

*C3*, since they are of the most visual difference between the two classes. That is why we prefer higher

*Vb*for binarization.

**(2) Minimum Vw**Given two separation cases, one is

*C1*: pure black and

*C2*:pure white, and the other is

*C1*and

*C2*with gray added. Which separation is the result we expect (can be visually distinguished one from the other without efforts)? The former, since each element is more identical to others in the same class than those in the latter one. In other words, the total variance of elements in

*C1*and

*C2*(denoted as

*V1*and

*V2*) in the former case is smaller than those in the latter one (at this point, you can think of

*V1*+

*V2*as

*Vw*). Here comes the reason that the lower

*Vb*is the better choice for binarization.

## + Let’s go deeper

In this section, we talk about the connection between “class separability” (*CS*, for short) and *Vb* (and *Vw*). Take a look at the following illustration.

Why do the classes on the left-hand side be in good separability? Two main reasons: (1) the classes (blue points, and yellow points) are farther from each other, and (2) every element in each class is most concentrated. Interestingly, as mentioned in the previous section, (1) implies maximum *Vb*, and (2) implies minimum *Vw*.

The term *CS* is from Fisher’s linear discriminant analysis. Otsu thresholding can be seen as a discrete version of *CS* and is equivalent to *Vb*/*Vw*. Recall that higher *CS* implies better thresholding results. Therefore, the larger *Vb* and smaller *Vw*, consistent with reasons (1) and (2), can make the value *CS* larger.

Since variance either lies between or within the class, the total variance *Vt* is held fixed (i.e., *Vt* = *Vb* + *Vw*). We can rewrite *Vb*/*Vw* as *Vb*/(*Vt*-*Vb*) or (*Vt*-*Vw*)/*Vw. *Observed that the larger *Vb* has the same contribution to the smaller *Vw* in that equation, and thus either one of the two reasons (1) and (2) can yield higher *CS.*

## + What are the mathematical expressions of Vb and Vw?

So far, we know why the best cut for image binarization in Otsu thresholding lies on the threshold *t* whose *Vb *is maximum (*Vw *is minimum). Do you curious about what *Vw *and *Vb *exactly are? To the best of my knowledge, little online material is talking about how to derive *Vb* and *Vw*. Now, let’s back to the image domain. You may have seen the equation, *Vw* = *W1* ×*V1* + *W2* × *V2*.

Here, I expand each term by their definitions where *N* is the total pixel count in *I*, and *W1* and *W2* are the probabilities of the two classes *C1* and *C2* separated by *t*. *C1 *contains pixels with values less than *t*, while *C2* contains those pixels which are not in *C1*. Since the size (pixel count) of *C1* might be different from that of *C2*, we have to convert *V1*+*V2* (as mentioned in (2) Minimum Vw) to a weighted variance sum for *Vw*.

As for *Vb*, you may have seen the following two equations.

In the second equation, *Vb* is proportional to the difference between class representatives (as mentioned in (1) Maximum *Vb*). How to derive it? Why does it have a quadratic term? Unlike *Vw*, it is not straightforward to define the *Vb* in a math way. However, we can use *Vt* and *Vw* to do back derivation.

First, expand squared total mean as colored in red where -2 is the clue to complete a square (*a*-*b*)² = *a²* - 2*ab* + *b²*. Second, substitute the weighted sum of *μ1* and *μ2* for *μt*. Third, use *W1*+*W2*=1 to merge and arrange the expansion terms. With these, we can derive the two equations.

There are two observations from the second derived equation:

- given fixed
*W1*×*W2*, a large mean difference encourages larger*Vb.* - given fixed mean difference, maximum
*Vb*happens at*W1*=*W2*.

For the observations above, *Vb* forms a push-pull factor during maximization where a push between *μ1*and *μ2 *and a pull between *W1*and *W2. *Also, during threshold finding in Otsu binarization, a pull exists within each class to make elements concentrated (i.e., similar pixel values); and a push exists between the two classes to let one be away from the other (i.e., larger mean difference).

## + The gap between ideal and real

Up to now, we know Otsu thresholding makes the decision is based on the observation of statistical data — histogram. In an ideal case, the histogram of one image has a sharp valley between two peaks (foreground and background) where the optimal cut lies. However, most real images do not have such distribution and make it hard to figure out the true valley precisely. For nearly 40 years, various extensions have been proposed to alleviate this problem for better performance.

## + Takeaway

- Otsu thresholding is a binarization approach that utilizes the grayscale histogram to find the best separation result.
- The optimal threshold lies on the separation result with maximum between-class variance (
*Vb*) or minimum within-class variance (*Vw*). - The separation result is the production of push-pull optimization that makes each class more concentrated while pushing classes to be far away from each other.

## + Reference

[1] Otsu, Nobuyuki. “A threshold selection method from gray-level histograms.” *IEEE transactions on systems, man, and cybernetics* 9.1 (1979): 62–66.