Machine learning / NLP formulas

Activation Functions:

Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$
ReLU: $f(x) = \max(0, x)$
Softmax (for $n$ classes): $\sigma(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$

Feedforward Network:

Neuron Activation: $a = \sigma(Wx + b)$
Dropout Regularization: $y = f(Wx + b) \cdot \text{mask}$ with dropout probability $p$

Cross-Entropy Loss:

$L = -\sum_{i=1}^n y_i \log(\hat{y}_i)$, where $y_i$ is the true label, $\hat{y}_i$ the predicted probability.

Gradient Descent Update Rule: Weight update: $w := w - \eta \nabla L(w)$

TF-IDF: Inverse Document Frequency: $\text{IDF}(t) = \log \frac{N}{1 + \text{DF}(t)}$ where $N$ is document count and $\text{DF}(t)$ is the frequency of term $t$.

Word2Vec (Skip-gram): Probability: $P(w_O | w_I) = \frac{\exp(v_{w_O} \cdot v_{w_I})}{\sum_{w} \exp(v_w \cdot v_{w_I})}$, where $v_{w_O}$ and $v_{w_I}$ are word vectors.

RNN Cell Update:

$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$

GRU Cell:

Update gate: $z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$
Reset gate: $r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$
Memory content: $\tilde{h}_t = \tanh\left(W \cdot \left(r_t \ast h_{t-1}, x_t\right)\right)$ Final memory: $h_t = (1 - z_t) \ast h_{t-1} + z_t \ast \tilde{h}_t$

LSTM Cell:

Forget gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
Input gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
Output gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
Cell state: $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$ $\tilde{C}_t = \tanh\left(W_C \cdot \left[h_{t-1}, x_t\right] + b_C\right)$

Self Attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$ with Query ($Q$), Key ($K$), and Value ($V$) matrices, and dimension $d_k$.

Multi-Head Attention:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$, where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

Encoder-Decoder Framework:

$P(Y

X) = \prod_{t=1}^T P(y_t

y_{<t}, X)$, with input $X$, output $Y$, and token $y_t$ at time $t$.

Beam Search Decoding:

Sequence probability: $P(Y) = \prod_{t=1}^T P(y_t

y_{<t}, X)$

Transformer Feedforward Network:

Position-wise FFN: $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

Layer Normalization:

$\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta$

Attention Score (Dot Product):

$\text{score}(Q, K) = \frac{Q \cdot K^T}{\sqrt{d_k}}$

BERT MLM Objective:

Masked Language Modeling loss: $L_{\text{MLM}} = -\sum_{i=1}^M \log P(x_i | x_{\setminus i})$, with $x_i$ a masked token, $M$ the count of masked tokens, and $x_{\setminus i}$ the context.