cross-entropy loss

- [[multi-class logistic regression]], [[one-vs-rest classification strategy]], [[binary cross-entropy loss function|binary cross-entropy]], [[cross entropy]] # Idea For [[multi-class classification]] in neural networks, the cross-entropy loss is used (not [[one-vs-rest classification strategy]]). This loss is also known as the [[softmax]] loss. It is a generalization of the [[logistic regression]] algorithm. This approach fits a single classifier for all classes and the prediction directly outputs the best class. It's also possible for [[support vector machines]], but it's less commonly used. The [[loss function]] associated with [[softmax]], the cross-entropy loss, is: $\begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation} $ Where $y$ is the target category for this example and $\mathbf{a}$ is the output of a softmax function. In particular, the values in $\mathbf{a}$ are probabilities that sum to one. Note in (3) above, only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an 'indicator function' that will be 1 when the index matches the target and zero otherwise. $\mathbf{1}\{y == n\} = =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases}$ Now the [[cost function]] is: $\begin{align} J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4} \end{align} $ Where $m$ is the number of examples, $N$ is the number of outputs. This is the average of all the losses. ![[20231231193313.png]] ```python # sklearn logreg = LogisticRegression(multi_class="multinomial", solver="lbfgs") # tensorflow model.compile(loss=SparseCategoricalCrossentropy()) ``` ## Tensorflow implementation The "obvious" implementation. But it's NOT RECOMMENDED! ```python # make dataset for example centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]] X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30) model = Sequential( [ Dense(25, activation = 'relu'), Dense(15, activation = 'relu'), Dense(4, activation = 'softmax') # softmax activation here ] ) model.compile( loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adam(0.001), ) model.fit( X_train,y_train, epochs=10 ) # output is a vector or probabilities p = model.predict(X_train) ``` The better and numerically more stable implementation. ![[20231231193853.png]] ```python # make dataset for example centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]] X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30) model = Sequential( [ Dense(25, activation = 'relu'), Dense(15, activation = 'relu'), Dense(4, activation = 'linear') # linear! ] ) model.compile( loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), # NOTE!! optimizer=tf.keras.optimizers.Adam(0.001), ) model.fit( X_train,y_train, epochs=10 ) # output aren't probabilities! p = preferred_model.predict(X_train) # convert to probabilities sm_preferred = tf.nn.softmax(p_preferred).numpy() # get category for the first 5 observations for i in range(5): print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}") ``` Note that to select the most likely category, the softmax is not required. One can find the index of the largest output using [np.argmax()](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html). ### SparseCategorialCrossentropy or CategoricalCrossEntropy Tensorflow has two potential formats for target values and the selection of the loss defines which is expected. - `SparseCategorialCrossentropy`: expects the target to be an integer corresponding to the index. For example, if there are 10 potential target values, y would be between 0 and 9. - `CategoricalCrossEntropy`: Expects the target value of an example to be one-hot encoded where the value at the target index is 1 while the other N-1 entries are zero. An example with 10 potential target values, where the target is 2 would be `[0,0,1,0,0,0,0,0,0,0]`. # References - https://campus.datacamp.com/courses/linear-classifiers-in-python/logistic-regression-3?ex=9 - [[Martins 2016 from softmax to sparsemax]] - [Softmax - Neural network training | Coursera](https://www.coursera.org/learn/advanced-learning-algorithms/lecture/mzLuU/softmax) - [Neural Network with Softmax output - Neural network training | Coursera](https://www.coursera.org/learn/advanced-learning-algorithms/lecture/ZQPG3/neural-network-with-softmax-output) - [Improved implementation of softmax - Neural network training | Coursera](https://www.coursera.org/learn/advanced-learning-algorithms/lecture/Tyil1/improved-implementation-of-softmax)