where can be constant (usually set to 1) or trainable.
The swish family was designed to smoothly interpolate between a linear function and the ReLU function.
When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in [2]: Eq 3 . Variants of the swish function include Mish.[3]
Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]
Since , all instances of swish have the same shape as the default , zoomed by . One usually sets . When is trainable, this constraint can be enforced by , where is trainable.
SiLU was first proposed alongside the GELU in 2016,[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[5][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.
^Hendrycks, Dan; Gimpel, Kevin (2016). "Gaussian Error Linear Units (GELUs)". arXiv:1606.08415 [cs.LG].
^Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji (2017-11-02). "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning". arXiv:1702.03118v3 [cs.LG].