The ‘swish’ activation function is f(x) = x.sigmoid(B.x).
B is typically set to 1, but it doesn’t have to be. You can use it as a parameter for the model to learn if you want. I’ve played with it and not really seen any significant benefit though; I’ve found that allowing the learning rate and/or batch size to vary are more impactful than a learned activation function. Also you can end up with vanishing or exploding gradients if you don’t constrain B; and even then B might saturate depending on what happens during training.
The choice of activation function itself is more impactful than allowing it to be dynamic/learned.
Happy learning!