Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

“`html

A deep neural network can be understood as a geometric system, where each layer reshapes the input space to form increasingly complex decision boundaries. For this to work effectively, layers must preserve meaningful spatial information — particularly how far a data point lies from these boundaries — since this distance enables deeper layers to build rich, non-linear representations.

Sigmoid disrupts this process by compressing all inputs into a narrow range between 0 and 1. As values move away from decision boundaries, they become indistinguishable, causing a loss of geometric context across layers. This leads to weaker representations and limits the effectiveness of depth.

ReLU, on the other hand, preserves magnitude for positive inputs, allowing distance information to flow through the network. This enables deeper models to remain expressive without requiring excessive width or compute.

In this article, we focus on this forward-pass behavior — analyzing how Sigmoid and ReLU differ in signal propagation and representation geometry using a two-moons experiment, and what that means for inference efficiency and scalability.

Setting up the dependencies

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

plt.rcParams.update({
“font.family”: “monospace”,
“axes.spines.top”: False,
“axes.spines.right”: False,
“figure.facecolor”: “white”,
“axes.facecolor”: “#f7f7f7”,
“axes.grid”: True,
“grid.color”: “#e0e0e0”,
“grid.linewidth”: 0.6,
})

T = {
“bg”: “white”,
“panel”: “#f7f7f7”,
“sig”: “#e05c5c”,
“relu”: “#3a7bd5”,
“c0”: “#f4a261”,
“c1”: “#2a9d8f”,
“text”: “#1a1a1a”,
“muted”: “#666666”,
}

Creating the dataset

To study the effect of activation functions in a controlled setting, we first generate a synthetic dataset using scikit-learn’s make_moons. This creates a non-linear, two-class problem where simple linear boundaries fail, making it ideal for testing how well neural networks learn complex decision surfaces.

We add a small amount of noise to make the task more realistic, then standardize the features using StandardScaler so both dimensions are on the same scale — ensuring stable training. The dataset is then split into training and test sets to evaluate generalization.

Finally, we visualize the data distribution. This plot serves as the baseline geometry that both Sigmoid and ReLU networks will attempt to model, allowing us to later compare how each activation function transforms this space across layers.

X, y = make_moons(n_samples=400, noise=0.18, random_state=42)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)

fig, ax = plt.subplots(figsize=(7, 5))
fig.patch.set_facecolor(T[“bg”])
ax.set_facecolor(T[“panel”])
ax.scatter(X[y == 0, 0], X[y == 0, 1], c=T[“c0″], s=40, edgecolors=”white”, linewidths=0.5, label=”Class 0″, alpha=0.9)
ax.scatter(X[y == 1, 0], X[y == 1, 1], c=T[“c1″], s=40, edgecolors=”white”, linewidths=0.5, label=”Class 1″, alpha=0.9)
ax.set_title(“make_moons — our dataset”, color=T[“text”], fontsize=13)
ax.set_xlabel(“x₁”, color=T[“muted”])
ax.set_ylabel(“x₂”, color=T[“muted”])
ax.tick_params(colors=T[“muted”])
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig(“moons_dataset.png”, dpi=140, bbox_inches=”tight”)
plt.show()

Creating the Network

Next, we implement a small, controlled neural network to isolate the effect of activation functions. The goal here is not to build a highly optimized model, but to create a clean experimental setup where Sigmoid and ReLU can be compared under identical conditions.

We define both activation functions (Sigmoid and ReLU) along with their derivatives, and use binary cross-entropy as the loss since this is a binary classification task. The TwoLayerNet class represents a simple 3-layer feedforward network (2 hidden layers + output), where the only configurable component is the activation function.

A key detail is the initialization strategy: we use He initialization for ReLU and Xavier initialization for Sigmoid, ensuring that each network starts in a fair and stable regime based on its activation dynamics.

The forward pass computes activations layer by layer, while the backward pass performs standard gradient descent updates.
“` Additionally, we have included diagnostic methods such as get_hidden and get_z_trace in our implementation. These methods allow us to inspect how signals evolve across layers, which is essential for analyzing the preservation or loss of geometric information.

By maintaining consistency in architecture, data, and training setup, we ensure that any performance differences or variations in internal representations can be directly attributed to the activation function itself. This approach sets the stage for a clear comparison of the impact of different activation functions on signal propagation and expressiveness.

The training process involves training both networks under the same conditions to ensure a fair comparison. Two models, one using Sigmoid and the other using ReLU, are initialized with the same random seed to start from equivalent weight configurations.

During the training loop, which spans 800 epochs using mini-batch gradient descent, the training data is shuffled, split into batches, and both networks are updated in parallel. This setup ensures that the only changing variable between the two runs is the activation function.

Furthermore, we monitor the loss after each epoch and log it at regular intervals to track the training progress of both networks. Observing the evolution of each network over time allows us to analyze not only their convergence speed but also whether they continue to improve or reach a plateau.

This step is crucial as it helps identify the first signs of divergence between the two models. If both models start similarly but show different behaviors during training, it indicates that the discrepancy stems from how each activation function propagates and maintains information within the network.

The code provided runs a training loop for a specified number of epochs using a batch size and learning rate. It trains two neural networks, one with a sigmoid activation function and the other with a ReLU activation function. The loss history of each network is recorded and compared at certain intervals to assess their progress.

The training loss curves clearly illustrate the divergence between the Sigmoid and ReLU networks. Despite starting from the same initialization and training conditions, the Sigmoid network plateaus around a loss of ~0.28 by epoch 400, indicating a lack of further improvement. In contrast, the ReLU network continues to reduce loss steadily, reaching a much lower value by epoch 800. This difference highlights the limitation of Sigmoid’s compression in preserving information compared to ReLU’s ability to maintain and refine the signal, leading to continued learning.

The decision boundary plots further emphasize the difference in performance between the two networks. The Sigmoid network learns a linear boundary that struggles to capture the complex structure of the dataset, resulting in lower accuracy. On the other hand, the ReLU network adapts a non-linear boundary that closely follows the data distribution, achieving higher accuracy. This difference is attributed to ReLU’s preservation of magnitude across layers, allowing for more expressive and accurate decision boundaries.

Lastly, the layer-by-layer signal trace chart visually demonstrates how the signal propagates across layers for a specific data point. It clearly showcases where the Sigmoid activation function falls short compared to ReLU, further reinforcing the advantages of using ReLU for deep learning tasks. The comparison between Sigmoid and ReLU activation functions in neural networks reveals significant differences in how they handle signal magnitude and depth.

Sigmoid quickly compresses the pre-activation magnitude to a narrow band, erasing meaningful differences as we move deeper into the network. In contrast, ReLU retains and amplifies the magnitude, allowing for higher values in the final layers. This results in the output neuron in the ReLU network making decisions based on a strong, well-separated signal, while the Sigmoid network struggles with a weak, compressed signal.

The visualization of hidden space scatter further highlights the impact of depth on the networks. In the Sigmoid network, classes collapse into a tight, overlapping region, becoming less expressive with depth. On the other hand, ReLU shows preserved variation and clear class separability as depth increases, leading to more effective decision-making in the output layer.

These observations emphasize the importance of choosing the right activation function based on the desired performance and behavior of the neural network. Are you wondering if you are on telegram? You can now join us on telegram as well.

If you are looking to collaborate with us to promote your GitHub Repo, Hugging Face Page, Product Release, Webinar, or any other project, feel free to connect with us.

Meet Arham Islam, a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi. He has a strong passion for Data Science, particularly Neural Networks and their applications in various fields.

Arham Islam’s bio:
Arham Islam, a recent Civil Engineering graduate from Jamia Millia Islamia, New Delhi, has a profound interest in Data Science, with a focus on Neural Networks and their diverse applications.

Feel free to reach out to us for any collaborations or partnerships. We are looking forward to working with you! Transform the following:

Original: The cat is sleeping on the couch.
Transformed: The couch has the cat sleeping on it. Transform the following:

“Be the change you wish to see in the world.”

into:

“Be the change you want to see in the world.”

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

Setting up the dependencies

Creating the dataset

Creating the Network

Be the first to comment

Leave a Reply Cancel reply

Ethereum briefly surges to $2,400 as geopolitical relief boosts crypto, stocks