Saturday, April 21, 2018

The "No Free Lunch" Theorem

The "No Free Lunch" theorem was first published by  David Wolpert and William Macready in their 1996 paper "No Free Lunch Theorems for Optimization".

In computational complexity and optimization the no free lunch theorem is a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a 'short cut'. 

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. However, to decide what data to keep , you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and the distance between the instances and the straight line is just noise, which can safely be ignored.

David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the "No Free Lunch Theorem" (NFL).

NFL states that no model is a priori guaranteed to work better. The only way to know for sure which model is the best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models.




Sunday, April 15, 2018

Review : Focal Loss for Dense Object Detection

The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.

y : binary class {+1, -1}
p : probability of input correctly classified to binary class

Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) ,  if y = 1
-log(1 - p), if y = -1

The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.

Example 1
gamma = 2.0
p = 0.9
y = +1
FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045 
CE(0.9, +1) = - log(0.9) = 0.0457

Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an a-balanced variance of FL:

FL(p,y) = 
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p )  ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.



For a high level understanding of deep learning click here