What is a Neural Network? Neural networks are machine learning models that mimic the complex functions of the human brain. These models consist of interconnected nodes or neurons that process data, learn patterns, and enable tasks such as pattern recognition and decision-making. History and Evolution of Neural Networks 1943: First Artificial Neuron Model by McCulloch and Pitts 1958: Rosenblatt Develops The Perceptron 1970: AI Winter Due To Perceptron Limitations 1986: Hinton Introduces Backpropagation 2012: CNN Wins ImageNet Competition, Boosting Deep Learning 2020: Transformers Advance AI In NLP And Generative Models Neural networks are capable of learning and identifying patterns directly from data without pre-defined rules. These networks are built from several key components: Neurons: The basic units that receive inputs, each neuron is governed by a threshold and an activation function. Connections: Links between neurons that carry information, regulated by weights and biases. Weights and Biases: These parameters determine the strength and influence of connections. Propagation Functions: Mechanisms that help process and transfer data across layers of neurons. Learning Rule: The method that adjusts weights and biases over time to improve accuracy. Learning in Neural Networks Learning in neural networks follows a structured, three-stage process: Input Computation: Data is fed into the network. Output Generation: Based on the current parameters, the network generates an output. Iterative Refinement: The network refines its output by adjusting weights and biases, gradually improving its performance on diverse tasks. In an adaptive learning environment: The neural network is exposed to a simulated scenario or dataset. Parameters such as weights and biases are updated in response to new data or conditions. With each adjustment, the network's response evolves allowing it to adapt effectively to different tasks or environments. Importance of Neural Networks Identify Complex Patterns: Recognize intricate structures and relationships in data; adapt to dynamic and changing environments. Learn from Data: Handle vast datasets efficiently; improve performance with experience and retraining. Drive Key Technologies: Power natural language processing (NLP); enable self-driving vehicles; support automated decision-making systems. Boost Efficiency: Streamline workflows and processes; enhance productivity across industries. Backbone of AI: Serve as the core driver of artificial intelligence progress; continue shaping the future of technology and innovation. Layers in Neural Network Architecture A neural network typically consists of three types of layers: Input Layer: This is where the network receives its input data. Each input neuron in the layer corresponds to a feature in the input data. Hidden Layers: These layers perform most of the computational heavy lifting. A neural network can have one or multiple hidden layers. Each layer consists of units (neurons) that transform the inputs into something that the output layer can use. Output Layer: The final layer produces the output of the model. The format of these outputs varies depending on the specific task like classification, regression. Working of Neural Networks 1. Forward Propagation When data is input into the network, it passes through the network in the forward direction, from the input layer through the hidden layers to the output layer. This process is known as forward propagation. Linear Transformation: Each neuron in a layer receives inputs which are multiplied by the weights associated with the connections. These products are summed together and a bias is added to the sum. This can be represented mathematically as: $$z = w_1x_1 + w_2x_2 + \dots + w_nx_n + b$$ where: $W$ represents the weights $X$ represents the inputs $b$ is the bias Activation: The result of the linear transformation (denoted as $Z$) is then passed through an activation function. The activation function is crucial because it introduces non-linearity into the system, enabling the network to learn more complex patterns. Popular activation functions include ReLU, sigmoid and tanh. 2. Backpropagation After forward propagation, the network evaluates its performance using a loss function which measures the difference between the actual output and the predicted output. The goal of training is to minimize this loss. This is where backpropagation comes into play: Loss Calculation: The network calculates the loss which provides a measure of error in the predictions. The loss function could vary; common choices are mean squared error for regression tasks or cross-entropy loss for classification. Gradient Calculation: The network computes the gradients of the loss function with respect to each weight and bias in the network. This involves applying the chain rule of calculus to find out how much each part of the output error can be attributed to each weight and bias. Weight Update: Once the gradients are calculated, the weights and biases are updated using an optimization algorithm like stochastic gradient descent (SGD). The weights are adjusted in the opposite direction of the gradient to minimize the loss. The size of the step taken in each update is determined by the learning rate. 3. Iteration This process of forward propagation, loss calculation, backpropagation, and weight update is repeated for many iterations over the dataset. Over time, this iterative process reduces the loss and the network's predictions become more accurate. Through these steps, neural networks can adapt their parameters to better approximate the relationships in the data, thereby improving their performance on tasks such as classification, regression, or any other predictive modeling. Example of Email Classification Let's consider a record of an email dataset: Email ID Email Content Sender Subject Line Label 1 "Get free gift cards now!" spam@example.com "Exclusive Offer" 1 To classify this email, we will create a feature vector based on the analysis of keywords such as "free", "win", and "offer". The feature vector of the record can be presented as: "free": Present (1) "win": Absent (0) "offer": Present (1) How Neurons Process Data in a Neural Network In a neural network, input data is passed through multiple layers, including one or more hidden layers. Each neuron in these hidden layers performs several operations, transforming the input into a usable output. Input Layer: The input layer contains 3 nodes that indicate the presence of each keyword. Hidden Layer: The input vector is passed through the hidden layer. Each neuron in the hidden layer performs two primary operations: a weighted sum followed by an activation function. Weights: Neuron H1: $[0.5, -0.2, 0.3]$ Neuron H2: $[0.4, 0.1, -0.5]$ Input Vector: $[1, 0, 1]$ Weighted Sum Calculation For H1: $(1 \times 0.5) + (0 \times -0.2) + (1 \times 0.3) = 0.5 + 0 + 0.3 = 0.8$ For H2: $(1 \times 0.4) + (0 \times 0.1) + (1 \times -0.5) = 0.4 + 0 - 0.5 = -0.1$ Activation Function (ReLU) Here we will use ReLU activation function: H1 Output: $\text{ReLU}(0.8) = 0.8$ H2 Output: $\text{ReLU}(-0.1) = 0$ Output Layer: The activated values from the hidden neurons are sent to the output neuron where they are again processed using a weighted sum and an activation function. Output Weights: $[0.7, 0.2]$ Input from Hidden Layer: $[0.8, 0]$ Weighted Sum: $(0.8 \times 0.7) + (0 \times 0.2) = 0.56 + 0 = 0.56$ Activation (Sigmoid): $\sigma(0.56) = \frac{1}{1+e^{-0.56}} \approx 0.636$ Final Classification: The output value of approximately $0.636$ indicates the probability of the email being spam. Since this value is greater than $0.5$, the neural network classifies the email as spam (1). Learning of a Neural Network Learning with Supervised Learning In supervised learning, a neural network learns from labeled input-output pairs provided by a teacher. The network generates outputs based on inputs and by comparing these outputs to the known desired outputs, an error signal is created. The network iteratively adjusts its parameters to minimize errors until it reaches an acceptable performance level. Learning with Unsupervised Learning Unsupervised learning involves data without labeled output variables. The primary goal is to understand the underlying structure of the input data ($X$). Unlike supervised learning, there is no instructor to guide the process. Instead, the focus is on modeling data patterns and relationships, with techniques like clustering and association commonly used. Learning with Reinforcement Learning Reinforcement learning enables a neural network to learn through interaction with its environment. The network receives feedback in the form of rewards or penalties, guiding it to find an optimal policy or strategy that maximizes cumulative rewards over time. This approach is widely used in applications like gaming and decision-making. Types of Neural Networks There are seven types of neural networks that can be used: Feedforward Networks: It is a simple artificial neural network architecture in which data moves from input to output in a single direction. Single-layer Perceptron: It has one layer and it applies weights, sums inputs, and uses activation to produce output. Multilayer Perceptron (MLP): It is a type of feedforward neural network with three or more layers, including an input layer, one or more hidden layers, and an output layer. It uses nonlinear activation functions. Convolutional Neural Network (CNN): It is designed for image processing. It uses convolutional layers to automatically learn features from images, enabling effective image recognition and classification. Recurrent Neural Network (RNN): Handles sequential data using feedback loops to retain context over time. Long Short-Term Memory (LSTM): A type of RNN with memory cells and gates to handle long-term dependencies and avoid vanishing gradients. What is Perceptron? A Perceptron is the simplest form of a neural network that makes decisions by combining inputs with weights and applying an activation function. It is mainly used for binary classification problems. It forms the basic building block of many deep learning models. Takes multiple inputs and assigns weights Computes a weighted sum and applies a threshold Outputs either 0 or 1 (binary outcome) Forms the foundation of larger neural networks Core Components Inputs ($x_1, x_2, \dots, x_n$): These are the features or measurable attributes of a data point that the perceptron uses to make a decision. Each input provides a signal that contributes to the final output. Example: For an OR gate, the inputs are binary: $(x_1, x_2) \in \{0,1\}^2$. Inputs themselves have no inherent influence unless multiplied by weights. Weights ($w_1, w_2, \dots, w_n$): Weights determine how strongly each input contributes to the prediction. A larger weight means the corresponding input has a higher impact. Weights are learned during training, adjusting based on errors. They act like importance scores for each feature. Bias ($b$): The bias is a constant value added to the weighted sum to shift the decision boundary. It allows the perceptron to classify correctly even when all input features are zero. Bias ensures the model is not forced to pass the decision boundary through the origin. Difference Between Weights and Bias Weights control how much each input influences the output. Bias controls when the perceptron activates, independent of any input. Mathematically, Weights tilt the line and Bias shifts the line up/down or left/right. Net Input (Weighted Sum): This is the combined effect of all inputs and their weights: $$z = \sum_{i=1}^n w_ix_i + b$$ Represents the activation strength before passing through the activation function. If $Z$ is high or low enough, it determines the final class. Activation Function (Step Function): The activation function converts the numerical input into a binary output: $$\hat{y} = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}$$ It introduces non-linearity in the decision-making, although the decision boundary remains linear. Output is always 0 or 1, making perceptrons suitable for binary classification. Fundamentals of Neural Network A neural network extends the perceptron by connecting many neurons across multiple layers. Input layer: The input layer provides the network with the raw feature vector: $$X = (X_1, X_2, \dots, X_n)$$ No computation happens here. It simply passes the input values to the next layer. Hidden layers: Hidden layers contain multiple perceptrons (neurons) that learn intermediate representations of the data. Hidden Layer Computation: $$z^{(1)} = W^{(1)}x + b^{(1)}$$ $$a^{(1)} = \sigma(z^{(1)})$$ where: $W^{(1)}$: weight matrix for hidden layer $b^{(1)}$: bias vector $\sigma$: non-linear activation function (ReLU, Sigmoid, Tanh, etc.) Hidden layers identify complex patterns not visible from raw input alone. Adding more hidden layers improves model expressiveness. Output layer: The output layer produces the final prediction, which may be binary, multi-class or a continuous value. Output Layer Computation: $$z^{(2)} = W^{(2)}a^{(1)} + b^{(2)}$$ $$\hat{y} = \sigma(z^{(2)})$$ Output activation depends on the task: Sigmoid: binary classification Softmax: multi-class classification Linear: regression Because of multiple layers and non-linear activations, neural networks can model complex, non-linear decision boundaries, while a single perceptron can only model a straight line. Working (Perceptron Training) Training a perceptron means finding suitable weights $w_i$ and bias $b$ such that most training points are correctly classified. Compute the Weighted Sum: The perceptron first calculates a weighted combination of the input features, along with a bias term that helps shift the decision boundary. $$z = \sum_{i=1}^n w_ix_i + b$$ Apply the Activation Function (Step Function): The perceptron uses a simple threshold activation to convert the numerical value into a binary class label. $$\hat{y} = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{otherwise} \end{cases}$$ Compare Prediction with Actual Output: The perceptron checks if the predicted output matches the true label. $$\text{error} = y - \hat{y}$$ Update the Weights (Learning Rule): Whenever the perceptron misclassifies a sample, it updates each weight by an amount proportional to the error and the input value. $$w_i \leftarrow w_i + \eta (y - \hat{y}) x_i$$ Update the Bias Term: The bias is adjusted similarly to shift the decision boundary left or right. $$b \leftarrow b + \eta (y - \hat{y})$$ Repeat for All Samples Across Multiple Epochs: The perceptron cycles through the entire dataset several times (epochs), refining weights gradually until it reaches a stable solution. Final Learned Model: After training, the perceptron produces predictions using: $$\hat{y} = \text{step}(W^Tx + b)$$ Activation Functions 1. Sigmoid Activation Function Formula: $$\sigma(x) = \frac{1}{1+e^{-x}}$$ Meaning: It squeezes any value into the range 0 to 1. Works like an "ON/OFF but smooth" function. Use: Used in binary classification (output probability). Good when you need output between 0 and 1. Problem: Can cause vanishing gradients (very small gradients). 2. ReLU (Rectified Linear Unit) Formula: $$\text{ReLU}(x) = \max(0, x)$$ Meaning: If value is positive $\rightarrow$ keep it If negative $\rightarrow$ make it 0 Use: Most popular activation in hidden layers of deep learning. Helps models train faster. Advantages: Simple. Avoids vanishing gradient on positive side. Problem: For negative values, gradient becomes 0 $\rightarrow$ Dead ReLU problem. 3. Hyperbolic Tangent (tanh) Formula: $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$ Range: -1 to +1 Meaning: Similar to sigmoid but centered at 0. Positive values $\rightarrow$ positive output Negative values $\rightarrow$ negative output Use: Better than Sigmoid in many cases because of zero-centered output. Problem: Still suffers from vanishing gradients. 4. Softmax Activation Function Formula: $$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$ Meaning: Converts a list of numbers into probabilities that sum to 1. Example: output like $[2.3, 1.0, 0.2] \rightarrow \text{softmax} \rightarrow [0.7, 0.2, 0.1]$ Use: Used in multi-class classification (3 or more classes). Important: Always used in output layer, not in hidden layers. What is Gradient Descent? Gradient Descent is an iterative optimization algorithm used to minimize a cost function by adjusting model parameters in the direction of the steepest descent of the function's gradient. In simple terms, it finds the optimal values of weights and biases by gradually reducing the error between predicted and actual outputs. Imagine you're at the top of a hill and your goal is to find the lowest point in the valley. You can't see the entire valley from the top, but you can feel the slope under your feet. Start at the Top: You begin at the top of the hill (this is like starting with random guesses for the model's parameters). Feel the Slope: You look around to find out which direction the ground is sloping down. This is like calculating the gradient, which tells you the steepest way downhill. Take a Step Down: Move in the direction where the slope is steepest (this is adjusting the model's parameters). The bigger the slope, the bigger the step you take. Repeat: You keep repeating the process — feeling the slope and moving downhill — until you reach the bottom of the valley (this is when the model has learned and minimized the error). The key idea is that, just like walking down a hill, Gradient Descent moves towards the "bottom" or minimum of the loss function, which represents the error in predictions. Moving in the opposite direction of the gradient allows the algorithm to gradually descend towards lower values of the function and eventually reach the minimum of the function. These gradients guide the updates ensuring convergence towards the optimal parameter values. Gradual steps used in descent are defined by the learning rate. What is Learning Rate? Learning rate is an important hyperparameter in gradient descent that controls how big or small the steps should be when going downwards in gradient for updating model parameters. It is essential to determine how quickly or slowly the algorithm converges toward the minimum of the cost function. If Learning rate is too small: The algorithm will take tiny steps during iteration and converge very slowly. This can significantly increase training time and computational cost, especially for large datasets. If Learning rate is too big: The algorithm may take huge steps, overshooting the minimum of the cost function without settling. It fails to converge, causing the algorithm to oscillate. This process is termed as the exploding gradient problem. To address these problems, we have some techniques that can be used: Weights Regularization: The initialization of weights can be adjusted to ensure that they are in an appropriate range. Using a different activation function such as the Rectified Linear Unit (ReLU) can help us to mitigate the vanishing gradient problem. Gradient clipping: Restrict the gradients to a predefined range to prevent them from becoming excessively large or small. Batch normalization: It can also help to address these problems by normalizing the input of each layer to prevent activation functions from saturating and hence reducing vanishing and exploding gradient problems. Choosing the right learning rate can lead to fast and stable convergence, improving the efficiency of the training process, but sometimes vanishing and exploding gradient problems are unavoidable. Mathematics Behind Gradient Descent For simplicity, let's consider a linear regression model with a single input feature $X$ and target $y$. The loss function (or cost function) for a single data point is defined as the Mean Squared Error (MSE): $$J(w,b) = \frac{1}{n} \sum_{i=1}^n (y_p - y)^2$$ Here: $y_p = xw + b$: The predicted value. $W$: Weight (slope of the line). $b$: Bias (intercept). $n$: Number of data points. To optimize the model parameters $W$, we compute the gradient of the loss function with respect to $W$. This process involves taking the partial derivatives of $J(w,b)$. The gradient with respect to $W$ is: $$\frac{\partial J(w,b)}{\partial w} = \frac{\partial}{\partial w} \left[ \frac{1}{n} \sum_{i=1}^n (y_p - y)^2 \right]$$ $$\frac{\partial J(w,b)}{\partial w} = \frac{2}{n} \sum_{i=1}^n (y_p - y) \frac{\partial}{\partial w}(y_p - y)$$ Substitute $y_p = xw + b$: $$\frac{\partial J(w,b)}{\partial w} = \frac{2}{n} \sum_{i=1}^n (xw + b - y) x$$ Final Gradient with respect to $W$: $$\frac{\partial J(w,b)}{\partial w} = \frac{2}{n} \sum_{i=1}^n (y_p - y) x$$ Gradient Descent Update: Once the gradients are calculated, we update the parameters $W$ in the direction opposite to the gradient (to minimize the loss function): For positive gradient: $$w \leftarrow w - \gamma \frac{\partial J(w,b)}{\partial w}$$ Here: $\gamma$: Learning rate (step size for each update). $\frac{\partial J(w,b)}{\partial w}$: Gradients with respect to $W$. Since the gradient is positive, subtracting it effectively decreases $W$ and hence reduces the cost function. For negative gradient: Since the gradient is negative, subtracting it effectively increases $W$, so here we add it to reduce the cost function. Working of Gradient Descent Step 1: We first initialize the parameters of the model randomly. Step 2: Compute the gradient of the cost function with respect to each parameter. It involves making partial differentiation of the cost function with respect to the parameters. Step 3: Update the parameters of the model by taking steps in the opposite direction of the model. Here we choose a hyperparameter learning rate which is denoted by $\gamma$. It helps in deciding the step size of the gradient. Step 4: Repeat steps 2 and 3 iteratively to get the best parameter for the defined model. What is Forward Propagation in Neural Networks? Forward propagation is the fundamental process in a neural network where input data passes through multiple layers to generate an output. It is the process by which input data passes through each layer of a neural network to generate output. This process is crucial before backpropagation updates the weights. It determines the output of a neural network with a given set of inputs and current state of model parameters (weights and biases). Understanding this process helps in optimizing neural networks for various tasks like classification, regression, and more. Below is the step-by-step working of forward propagation: Input Layer The input data is fed into the network through the input layer. Each feature in the input dataset represents a neuron in this layer. The input is usually normalized or standardized to improve model performance. Hidden Layers The input moves through one or more hidden layers where transformations occur. Each neuron in a hidden layer computes a weighted sum of inputs and applies an activation function to introduce non-linearity. Each neuron receives inputs, computes: $Z = WX + b$, where: $W$ is the weight matrix $X$ is the input vector $b$ is the bias term The activation function such as ReLU or sigmoid is applied. Output Layer The last layer in the network generates the final prediction. The activation function of this layer depends on the type of problem: Softmax (for multi-class classification) Sigmoid (for binary classification) Linear (for regression tasks) Prediction The network produces an output based on current weights and biases. The loss function evaluates the error by comparing predicted output with actual values. Mathematical Explanation of Forward Propagation Consider a neural network with one input layer, two hidden layers, and one output layer. 1. Layer 1 (First Hidden Layer) The transformation is: $A^{[1]} = \sigma(W^{[1]}X + b^{[1]})$ where: $W^{[1]}$ is the weight matrix, $X$ is the input vector, $b^{[1]}$ is the bias vector, $\sigma$ is the activation function. 2. Layer 2 (Second Hidden Layer) $A^{[2]} = \sigma(W^{[2]}A^{[1]} + b^{[2]})$ 3. Output Layer $Y = \sigma(W^{[3]}A^{[2]} + b^{[3]})$ where $Y$ is the final output. Thus the complete equation for forward propagation is: $A^{[3]} = \sigma(\sigma(\sigma(XW^{[1]} + b^{[1]})W^{[2]} + b^{[2]})W^{[3]} + b^{[3]})$ This equation illustrates how data flows through the network: Weights ($W$) determine the importance of each input. Biases ($b$) adjust activation thresholds. Activation functions ($\sigma$) introduce non-linearity to enable complex decision boundaries. Backpropagation in Neural Network Backpropagation, short for Backward Propagation of Errors, is a key algorithm used to train neural networks by minimizing the difference between predicted and actual outputs. It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases. Combined with optimization techniques like gradient descent, backpropagation enables the model to reduce loss across epochs and effectively learn complex patterns from data. Backpropagation plays a critical role in how neural networks improve over time. Here's why: Efficient Weight Update: It computes the gradient of the loss function with respect to each weight using the chain rule, making it possible to update weights efficiently. Scalability: The Backpropagation algorithm scales well to networks with multiple layers and complex architectures, making deep learning feasible. Automated Learning: With Backpropagation, the learning process becomes automated, and the model can adjust itself to optimize its performance. Working of Backpropagation Algorithm The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward Pass. 1. Forward Pass Work In the forward pass, input data is fed into the input layer. These inputs, combined with their respective weights, are passed to hidden layers. For example, in a network with two hidden layers (h1 and h2), the output from h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted inputs. Each hidden layer computes the weighted sum ('a') of the inputs, then applies an activation function like ReLU (Rectified Linear Unit) to obtain the output ('o'). The output is passed to the next layer where an activation function such as softmax converts the weighted outputs into probabilities for classification. 2. Backward Pass In the backward pass, the error (the difference between the predicted and actual output) is propagated back through the network to adjust the weights and biases. One common method for error calculation is the Mean Squared Error (MSE) given by: $$\text{MSE} = (\text{Predicted Output} - \text{Actual Output})^2$$ Once the error is calculated, the network adjusts weights using gradients which are computed with the chain rule. These gradients indicate how much each weight and bias should be adjusted to minimize the error in the next iteration. The backward pass continues layer by layer, ensuring that the network learns and improves its performance. The activation function through its derivative plays a crucial role in computing these gradients during Backpropagation. Example of Backpropagation in Machine Learning Let's walk through an example of Backpropagation in machine learning. Assume the neurons use the sigmoid activation function for the forward and backward pass. The target output is 0.5 and the learning rate is 1. Forward Propagation 1. Initial Calculation The weighted sum at each node is calculated using: $$a_j = \sum (w_{i,j} \times x_i)$$ Where, $a_j$ is the weighted sum of all the inputs and weights at each node. $w_{i,j}$ represents the weights between the $i^{th}$ input and the $j^{th}$ neuron. $x_i$ represents the value of the $i^{th}$ input. Output ($o_j$): After applying the activation function to $a_j$, we get the output of the neuron: $$o_j = \text{activation function}(a_j)$$ 2. Sigmoid Function The sigmoid function returns a value between 0 and 1, introducing non-linearity into the model. $$y_j = \frac{1}{1+e^{-a_j}}$$ 3. Computing Outputs At h1 node: $a_1 = (w_{1,1}x_1) + (w_{2,1}x_2) = (0.2 \times 0.35) + (0.2 \times 0.7) = 0.21$ Once we calculated the $a_1$ value, we can now proceed to find the $y_3$ value: $y_3 = F(a_1) = \frac{1}{1+e^{-a_1}}$ $y_3 = F(0.21) = \frac{1}{1+e^{-0.21}} \approx 0.552$ (original OCR value was 0.56) Similarly find the values of $y_4$ at h2 and $y_5$ at O3 $a_2 = (w_{1,2}x_1) + (w_{2,2}x_2) = (0.3 \times 0.35) + (0.3 \times 0.7) = 0.315$ $y_4 = F(0.315) = \frac{1}{1+e^{-0.315}} \approx 0.578$ (original OCR value was 0.59) $a_3 = (w_{1,3}y_3) + (w_{2,3}y_4) = (0.3 \times 0.552) + (0.9 \times 0.578) \approx 0.1656 + 0.5202 = 0.6858$ (original OCR values were $0.3 \times 0.57 + 0.9 \times 0.59 = 0.702$) $y_5 = F(0.6858) = \frac{1}{1+e^{-0.6858}} \approx 0.665$ (original OCR value was 0.67) 4. Error Calculation Our actual output is 0.5 but we obtained $0.665$. To calculate the error we can use the below formula: $$\text{Error}_j = y_{\text{target}} - y_j$$ $\text{Error}_5 = 0.5 - 0.665 = -0.165$ (original OCR value was $-0.17$) Using this error value we will be backpropagating. Back Propagation 1. Calculating Gradients The change in each weight is calculated as: $$\Delta w_{ij} = \eta \times \delta_j \times O_i$$ Where: $\delta_j$ is the error term for each unit, $\eta$ is the learning rate. 2. Output Unit Error For O3: $$\delta_5 = y_5(1-y_5)(y_{\text{target}} - y_5)$$ $\delta_5 = 0.665(1 - 0.665)(-0.165) \approx 0.665 \times 0.335 \times (-0.165) \approx -0.0367$ (original OCR value was $-0.0376$) 3. Hidden Unit Error For h1: $$\delta_3 = y_3(1-y_3)(w_{1,3} \times \delta_5)$$ $\delta_3 = 0.552(1 - 0.552)(0.3 \times -0.0367) \approx 0.552 \times 0.448 \times (-0.01101) \approx -0.00272$ (original OCR value was $-0.0027$) For h2: $$\delta_4 = y_4(1-y_4)(w_{2,3} \times \delta_5)$$ $\delta_4 = 0.578(1 - 0.578)(0.9 \times -0.0367) \approx 0.578 \times 0.422 \times (-0.03303) \approx -0.00806$ (original OCR value was $-0.0819$) 4. Weight Updates For the weights from hidden to output layer ($\eta=1$): $\Delta w_{2,3} = 1 \times (-0.0367) \times y_4 = -0.0367 \times 0.578 \approx -0.0212$ (original OCR value was $-0.022184$) New weight: $w_{2,3}(\text{new}) = \text{current } w_{2,3} + \Delta w_{2,3} = 0.9 - 0.0212 = 0.8788$ (original OCR value was $0.877816$) For weights from input to hidden layer ($\eta=1$): $\Delta w_{1,1} = 1 \times (-0.00272) \times x_1 = -0.00272 \times 0.35 \approx -0.000952$ (original OCR value was $0.000945$) New weight: $w_{1,1}(\text{new}) = \text{current } w_{1,1} + \Delta w_{1,1} = 0.2 - 0.000952 = 0.199048$ (original OCR value was $0.200945$) Similarly other weights are updated: $w_{1,2}(\text{new}) \approx 0.273$ $w_{1,3}(\text{new}) \approx 0.087$ $w_{2,1}(\text{new}) \approx 0.269$ $w_{2,2}(\text{new}) \approx 0.185$ Through the backward pass, the weights are updated. After updating the weights the forward pass is repeated, hence giving: $y_3 \approx 0.552$ $y_4 \approx 0.578$ $y_5 \approx 0.665$ Since $y_5 \approx 0.665$ is still not the target output, the process of calculating the error and backpropagating continues until the desired output is reached. This process demonstrates how Backpropagation iteratively updates weights by minimizing errors until the network accurately predicts the output. $\text{Error} = y_{\text{target}} - y_5 = 0.5 - 0.665 = -0.165$ This process is said to be continued until the actual output is gained by the neural network. Linearly Separable Data Linearly Separable Data means: You can draw a straight line (in 2D), or a plane (in 3D), or a hyperplane (in higher dimensions) that perfectly separates the classes. In simple words: A single straight line is enough to divide the data into two groups without any mistakes. Example (Simple Explanation) Imagine you have two types of points: Red points on the left Blue points on the right If you can draw one straight line such that: All red points are on one side All blue points are on the other side Then the data is linearly separable. Formal Definition A dataset is linearly separable if there exists a hyperplane: $$W^T x + b = 0$$ such that: For Class 1: $W^T x + b > 0$ For Class 2: $W^T x + b This means the classes can be separated without any overlap using one straight boundary. Where Linearly Separable Data Occurs Simple classification problems AND, OR logical functions Perfectly clean 2-class datasets Non-Linearly Separable Example XOR problem: No single straight line can separate circles and dots $\rightarrow$ not linearly separable. Why It Matters? Linear classifiers like Logistic Regression, Perceptron, Linear SVM work best on linearly separable data. If data is not linearly separable, we need: Kernels (SVM with RBF) Neural networks Feature transformations