One challenge is encountering a scenario or a combination that never appeared in our training data. For example, what if we've never seen a case where someone watched a movie while homework was "On it" during sunny weather? This would give us a probability of zero and ruin our calculations!
Laplace Smoothing
The solution? Use Laplace smoothing:
P(feature∣Watch)=count(Watch)+Ncount(feature,Watch)+1
Where N is the number of possible values for that feature. For example:
- Homework status has 3 possible values (Yes/No/On it), so N = 3
- Weather has 2 possible values (Sunny/Rainy), so N = 2
- Friend availability has 2 possible values (TRUE/FALSE), so N = 2
Let's see this in action. If we never observed someone watching a movie with homework "On it," instead of:
P(HW=OnIt∣Watch=yes)=80=0
We use smoothing:
P(HW=OnIt∣Watch=yes)=8+30+1=111
This gives us a small but non-zero probability, which is more realistic than assuming it's impossible because we haven't seen it yet!
So, don't forget to use smoothing when you find a probability of zero!
Handling Zero Probabilities
What happens when we multiply probabilities that are close to zero?
Let's look at a challenging scenario:
- P(Watch)=0.571(8/14)
- P(HW=OnIt∣Watch)=0.001 (after smoothing)
- P(Sunny∣Watch)=0.001 (after smoothing)
- P(Friend=FALSE∣Watch)=0.001 (after smoothing)
If we multiply these tiny numbers:
0.571×0.001×0.001×0.001=0.000000571
The computer might have trouble with such small numbers! The solution? We can use logarithms:
Instead of calculating:
P(Watch)×P(HW∣Watch)×P(Weather∣Watch)×P(Friend∣Watch)
We calculate:
log(P(Watch))+log(P(HW∣Watch))+log(P(Weather∣Watch))+log(P(Friend∣Watch))
This turns our multiplication into addition and helps avoid numerical underflow (when numbers get too small for computers to handle accurately). When we compare scenarios, bigger log probabilities (closer to zero) are better than smaller ones (more negative).