We can see that the MI described in the right side of the first row of Eq. (10) has two terms. The first one is the entropy of the class label variable $r$. It is not a function of the window-shifting transform $T$. The second term is a conditional entropy of $r$ given the chunked spectral feature vector $xw,b$. When the class label variable $r$ and the chunked spectral feature vector $xw,b$ are related, the amount of entropy, $H(r|xw,b)$, will reduce. For example, if $r$ can be predicted by $xw,b$ reliably, then $r$ will become less uncertain when we have the observation of $xw,b$. If we intuitively understand the entropy as the amount of uncertainty, the decrease of uncertainty means less of entropy. In other words, if $xw,b$ is a good observation or informative feature subset to predict the unknown variable $r$, the uncertainty of $r|xw,b$ will reduce more than other less discriminatory feature subsets. As a result, $H(r|xw,b)$ will decrease. From Eq. (10), the less amount of $H(r|xw,b)$ will increase the MI, $I(r,xw,b)$. Consequently, considering that $xw,b=Tw,b(x)$, maximizing the MI in Eq. (10) will encourage the spectral window to shift to a wavelength region in which the spectral features would have a better capability to predict the class label.