Activation function and GLU variants for Transformer models | Tarique Anwar | Apr 18, 2022 (medium.com)

submitted 4 months ago by ericjmorey@programming.dev to c/machine_learning@programming.dev

0 comments fedilink hide all child comments

Apr 18, 2022 | Tarique Anwar Writes:

The main reason for ReLu being used is that it is simple, fast, and empirically it seems to work well.

But with the emergence of Transformer based models, different variants of activation functions and GLU have been experimented with and do seem to perform better. Some of them are:

GeLU²

Swish¹

GLU³

GEGLU⁴

SwiGLU⁴

We will go over some of these in detail but before that let’s see where exactly are these activations utilized in a Transformer architecture.

Read Activation function and GLU variants for Transformer models

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here

this post was submitted on 08 Apr 2024

6 points (87.5% liked)

Machine Learning

450 readers

1 users here now

A community for posting things related to machine learning

Icon base by Lorc under CC BY 3.0 with modifications to add a gradient

founded 1 year ago

MODERATORS

Ategon@programming.dev

Akisamb@programming.dev

ericjmorey@programming.dev