https://whizan.xyz/books/ml/mlchapter1 https://whizan.xyz/books/ml/images/mlchapter1_0.webp The Machine Learning Landscape. 3. If you already know all the Machine Learning basics, you may want to skip directly to Chapter 2 . If you are not sure, try to answer all the questions listed at the end of the chapter before moving on The Machine Learning Landscape https://whizan.xyz/books/ml/images/mlchapter1_1.webp Why Use Machine Learning?. You would test your program, and repeat steps 1 and 2 until it is good enough. The traditional approach Why Use Machine Learning? https://whizan.xyz/books/ml/images/mlchapter1_2.webp Why Use Machine Learning?. In contrast, a spam filter based on Machine Learning techniques automatically learns which words and phrases are good predictors of spam by detecting unusually fre‐ quent patterns of words in the spam examples compared to the ham examples . Machine Learning approach Why Use Machine Learning? https://whizan.xyz/books/ml/images/mlchapter1_3.webp Why Use Machine Learning?. In contrast, a spam filter based on Machine Learning techniques automatically noti‐ ces that “For U” has become unusually frequent in spam flagged by users, and it starts flagging them without your intervention. Automatically adapting to change Why Use Machine Learning? https://whizan.xyz/books/ml/images/mlchapter1_4.webp Why Use Machine Learning?. Machine Learning can help humans learn. To summarize, Machine Learning is great for Why Use Machine Learning? https://whizan.xyz/books/ml/images/mlchapter1_5.webp Supervised/Unsupervised Learning. In supervised learning , the training data you feed to the algorithm includes the desired solutions, called labels. A labeled training set for supervised learning (e.g., spam classification) Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_6.webp Supervised/Unsupervised Learning. Chapter 1: The Machine Learning Landscape. In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_7.webp Supervised/Unsupervised Learning. In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Regression Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_8.webp Supervised/Unsupervised Learning. In unsupervised learning , as you might guess, the training data is unlabeled . The system tries to learn without a teacher. An unlabeled training set for unsupervised learning Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_9.webp Supervised/Unsupervised Learning. Clustering. Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D rep‐ resentation of your data that can easily be plotted . Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_10.webp Supervised/Unsupervised Learning. Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D rep‐ resentation of your data that can easily be plotted . Example of a t-SNE visualization highlighting semantic clusters 3 Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_11.webp Supervised/Unsupervised Learning. A related task is dimensionality reduction , in which the goal is to simplify the data without losing too much information. It is often a good idea to try to reduce the dimension of your train‐ ing data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a super‐ vised learning algorithm). Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_12.webp Supervised/Unsupervised Learning. Yet another important unsupervised task is anomaly detection —for example, detect‐ ing unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learn‐ ing algorithm. Anomaly detection Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_13.webp Supervised/Unsupervised Learning. Some photo-hosting services, such as Google Photos, are good examples of this. Semisupervised learning Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_14.webp Supervised/Unsupervised Learning. Reinforcement Learning. For example, many robots implement Reinforcement Learning algorithms to learn how to walk. Supervised/Unsupervised Learning https://whizan.xyz/books/ml/images/mlchapter1_15.webp Batch and Online Learning. In online learning , you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches . Online learning Batch and Online Learning https://whizan.xyz/books/ml/images/mlchapter1_16.webp Batch and Online Learning. Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning). This whole process is usually done offline (i.e., not on the live sys‐ tem), so online learning can be a confusing name. Think of it as incremental learning Batch and Online Learning https://whizan.xyz/books/ml/images/mlchapter1_17.webp Batch and Online Learning. This whole process is usually done offline (i.e., not on the live sys‐ tem), so online learning can be a confusing name. Think of it as incremental learning. Using online learning to handle huge datasets Batch and Online Learning https://whizan.xyz/books/ml/images/mlchapter1_18.webp Instance-Based Versus Model-Based Learning. This is called instance-based learning : the system learns the examples by heart, then generalizes to new cases using a similarity measure. Instance-based learning Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_19.webp Instance-Based Versus Model-Based Learning. Another way to generalize from a set of examples is to build a model of these exam‐ ples, then use that model to make predictions . This is called model-based learning. Model-based learning Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_20.webp Instance-Based Versus Model-Based Learning. Do you see a trend here?. There does seem to be a trend here! Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_21.webp Instance-Based Versus Model-Based Learning. This model has two model parameters , θ 0 and θ 1 . 5 By tweaking these parameters, you can make your model represent any linear function, as shown in. A few possible linear models Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_22.webp Instance-Based Versus Model-Based Learning. Now the model fits the training data as closely as possible (for a linear model), as you can see in. The linear model that fits the training data best Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_23.webp Instance-Based Versus Model-Based Learning. print (lin_reg_model.predict(X_new)) # outputs [[ 5.96242338]]. If you had used an instance-based learning algorithm instead, you would have found that Slovenia has the closest GDP per capita to that of Cyprus ($20,732), and since the OECD data tells us that Slovenians’ life satisfaction is 5.7, you would have predicted a life satisfaction of 5.7 for Cyprus. Instance-Based Versus Model-Based Learning https://whizan.xyz/books/ml/images/mlchapter1_24.webp The Unreasonable Effectiveness of Data. Chapter 1: The Machine Learning Landscape. In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different Machine Learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language d The Unreasonable Effectiveness of Data https://whizan.xyz/books/ml/images/mlchapter1_25.webp Nonrepresentative Training Data. For example, the set of countries we used earlier for training the linear model was not perfectly representative; a few countries were missing.shows what the data looks like when you add the missing countries. A more representative training sample Nonrepresentative Training Data https://whizan.xyz/books/ml/images/mlchapter1_26.webp Overfitting the Training Data. shows an example of a high-degree polynomial life satisfaction model that strongly overfits the training data. Even though it performs much better on the training data than the simple linear model, would you really trust its predictions?. Overfitting the training data Overfitting the Training Data https://whizan.xyz/books/ml/images/mlchapter1_27.webp Overfitting the Training Data. are you that the W-satisfaction rule generalizes to Rwanda or Zimbabwe? Obviously this pattern occurred in the training data by pure chance, but the model has no way to tell whether a pattern is real or simply the result of noise in the data. Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are Overfitting the Training Data https://whizan.xyz/books/ml/images/mlchapter1_28.webp Overfitting the Training Data. Regularization reduces the risk of overfitting. The amount of regularization to apply during learning can be controlled by a hyper‐ parameter . Overfitting the Training Data https://whizan.xyz/books/ml/images/mlchapter1_29.webp Testing and Validating. If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the train‐ ing data. It is common to use 80% of the data for training and hold out 20% for testing Testing and Validating https://whizan.xyz/books/ml/mlchapter10 https://whizan.xyz/books/ml/images/mlchapter10_0.webp Biological Neurons. Before we discuss artificial neurons, let’s take a quick look at a biological neuron (rep‐ resented in. Biological neuron 3 Biological Neurons https://whizan.xyz/books/ml/images/mlchapter10_1.webp Biological Neurons. works (BNN) 4 is still the subject of active research, but some parts of the brain have been mapped, and it seems that neurons are often organized in consecutive layers, as shown in. Multiple layers in a biological neural network (human cortex) 5 Biological Neurons https://whizan.xyz/books/ml/images/mlchapter10_2.webp Logical Computations with Neurons. Warren McCulloch and Walter Pitts proposed a very simple model of the biological neuron, which later became known as an artificial neuron : it has one or more binary (on/off) inputs and one binary output. ANNs performing simple logical computations Logical Computations with Neurons https://whizan.xyz/books/ml/images/mlchapter10_3.webp The Perceptron. The Perceptron is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. Linear threshold unit The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_4.webp The Perceptron. Equation 10-1. Common step functions used in Perceptrons The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_5.webp The Perceptron. Equation 10-1. Common step functions used in Perceptrons. heaviside z = The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_6.webp The Perceptron. A Perceptron with two inputs and three outputs is represented in. This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multioutput classifier. Perceptron diagram The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_7.webp The Perceptron. Equation 10-2. Perceptron learning rule (weight update) The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_8.webp The Perceptron. Equation 10-2. Perceptron learning rule (weight update) The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_9.webp The Perceptron. Equation 10-2. Perceptron learning rule (weight update). wi , j next step = wi , j + η y j − yj xi The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_10.webp The Perceptron. However, it turns out that some of the limitations of Perceptrons can be eliminated by stacking multiple Perceptrons. XOR classification problem and an MLP that solves it The Perceptron https://whizan.xyz/books/ml/images/mlchapter10_11.webp Multi-Layer Perceptron and Backpropagation. An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called hidden layers , and one final layer of LTUs called the output layer (see. Multi-Layer Perceptron Multi-Layer Perceptron and Backpropagation https://whizan.xyz/books/ml/images/mlchapter10_12.webp Multi-Layer Perceptron and Backpropagation. Activation functions and their derivatives. An MLP is often used for classification, with each output corresponding to a different binary class (e.g., spam/ham, urgent/not-urgent, and so on). Multi-Layer Perceptron and Backpropagation https://whizan.xyz/books/ml/images/mlchapter10_13.webp Multi-Layer Perceptron and Backpropagation. An MLP is often used for classification, with each output corresponding to a different binary class (e.g., spam/ham, urgent/not-urgent, and so on). A modern MLP (including ReLU and softmax) for classification Multi-Layer Perceptron and Backpropagation https://whizan.xyz/books/ml/images/mlchapter10_14.webp Multi-Layer Perceptron and Backpropagation. From Biological to Artificial Neurons. Biological neurons seem to implement a roughly sigmoid (S- shaped) activation function, so researchers stuck to sigmoid func‐ tions for a very long time. Multi-Layer Perceptron and Backpropagation https://whizan.xyz/books/ml/images/mlchapter10_15.webp Training an MLP with TensorFlow’s High-Level API. Under the hood, the DNNClassifier class creates all the neuron layers, based on the ReLU activation function (we can change this by setting the activation_fn hyper‐ parameter). The TF.Learn API is still quite new, so some of the names and func‐ tions used in these examples may evolve a bit by the time you read this book. However, the general ideas should not change Training an MLP with TensorFlow’s High-Level API https://whizan.xyz/books/ml/images/mlchapter10_16.webp Construction Phase. activation_fn=None). The tensorflow.contrib package contains many useful functions, but it is a place for experimental code that has not yet graduated to be part of the main TensorFlow API. Construction Phase https://whizan.xyz/books/ml/images/mlchapter10_17.webp Construction Phase. loss = tf.reduce_mean(xentropy, name="loss"). The sparse_softmax_cross_entropy_with_logits() function is equivalent to applying the softmax activation function and then computing the cross entropy, but it is more efficient, and it prop‐ erly takes care of corner cases like logits equal to 0. Construction Phase https://whizan.xyz/books/ml/mlchapter11 https://whizan.xyz/books/ml/images/mlchapter11_0.webp Vanishing/Exploding Gradients Problems. Logistic activation function saturation Vanishing/Exploding Gradients Problems https://whizan.xyz/books/ml/images/mlchapter11_1.webp Xavier and He Initialization. n inputs + n outputs. When the number of input connections is roughly equal to the number of output Xavier and He Initialization https://whizan.xyz/books/ml/images/mlchapter11_2.webp Xavier and He Initialization. n inputs + n outputs Xavier and He Initialization https://whizan.xyz/books/ml/images/mlchapter11_3.webp Xavier and He Initialization. n inputs + n outputs. ReLU (and its variants) Xavier and He Initialization https://whizan.xyz/books/ml/images/mlchapter11_4.webp Xavier and He Initialization. Chapter 11: Training Deep Neural Nets. fan-in and fan-out like in Xavier initialization. This is also the default for the variance_scaling_initializer() function, but you can change this by setting the argument mode="FAN_AVG" Xavier and He Initialization https://whizan.xyz/books/ml/images/mlchapter11_5.webp Nonsaturating Activation Functions. datasets it runs the risk of overfitting the training set. Leaky ReLU Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_6.webp Nonsaturating Activation Functions. Equation 11-2. ELU activation function Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_7.webp Nonsaturating Activation Functions. Equation 11-2. ELU activation function. ELU α z = Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_8.webp Nonsaturating Activation Functions. ELU α z = Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_9.webp Nonsaturating Activation Functions. ELU α z = Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_10.webp Nonsaturating Activation Functions. ELU α z = Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_11.webp Nonsaturating Activation Functions. α exp z − 1 if z < 0 Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_12.webp Nonsaturating Activation Functions. z i f z ≥ 0. ELU activation function Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_13.webp Nonsaturating Activation Functions. The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but dur‐ ing training this is compensated by the faster convergence rate. So which activation function should you use for the hidden layers of your deep neural networks? Nonsaturating Activation Functions https://whizan.xyz/books/ml/images/mlchapter11_14.webp Batch Normalization. i. μB = mB i ∑= 1 Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_15.webp Batch Normalization. i. m Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_16.webp Batch Normalization. m. = mB i ∑= 1 Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_17.webp Batch Normalization. B. – μB 2 Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_18.webp Batch Normalization. i = γ i + β. μ B is the empirical mean, evaluated over the whole mini-batch B Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_19.webp Batch Normalization. Batch Normalization does, however, add some complexity to the model (although it removes the need for normalizing the input data since the first hidden layer will take care of that, provided it is batch-normalized). You may find that training is rather slow at first while Gradient Descent is searching for the optimal scales and offsets for each layer, but it accelerates once it has found reasonably good values Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_20.webp Batch Normalization. Let’s walk through this code. Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_21.webp Batch Normalization. Let’s walk through this code. Next we define bn_params, which is a dictionary that defines the parameters that will be passed to the batch_norm() function, including is_training of course. Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_22.webp Batch Normalization. Next we define bn_params, which is a dictionary that defines the parameters that will be passed to the batch_norm() function, including is_training of course. Chapter 11: Training Deep Neural Nets Batch Normalization https://whizan.xyz/books/ml/images/mlchapter11_23.webp Reusing Pretrained Layers. For example, suppose that you have access to a DNN that was trained to classify pic‐ tures into 100 different categories, including animals, plants, vehicles, and everyday objects. Reusing pretrained layers Reusing Pretrained Layers https://whizan.xyz/books/ml/images/mlchapter11_24.webp Reusing Pretrained Layers. Reusing pretrained layers. If the input pictures of your new task don’t have the same size as the ones used in the original task, you will have to add a prepro‐ cessing step to resize them to the size expected by the original model. Reusing Pretrained Layers https://whizan.xyz/books/ml/images/mlchapter11_25.webp Reusing a TensorFlow Model. First we build the new model, making sure to copy the original model’s hidden layers 1 to 3. The more similar the tasks are, the more layers you want to reuse (starting with the lower layers). For very similar tasks, you can try keeping all the hidden layers and just replace the output layer Reusing a TensorFlow Model https://whizan.xyz/books/ml/images/mlchapter11_26.webp Unsupervised Pretraining. you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data. 9. Unsupervised pretraining Unsupervised Pretraining https://whizan.xyz/books/ml/images/mlchapter11_27.webp Momentum optimization. Equation 11-4. Momentum algorithm Momentum optimization https://whizan.xyz/books/ml/images/mlchapter11_28.webp Momentum optimization. Equation 11-4. Momentum algorithm Momentum optimization https://whizan.xyz/books/ml/images/mlchapter11_29.webp Momentum optimization. Equation 11-4. Momentum algorithm. 1 . β + η ∇ θJ θ Momentum optimization https://whizan.xyz/books/ml/images/mlchapter11_30.webp Momentum optimization. 1 . β + η ∇ θJ θ. 2 . θ θ − Momentum optimization https://whizan.xyz/books/ml/images/mlchapter11_31.webp Momentum optimization. past local optima. Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. Momentum optimization https://whizan.xyz/books/ml/images/mlchapter11_32.webp Nesterov Accelerated Gradient. Equation 11-5. Nesterov Accelerated Gradient algorithm Nesterov Accelerated Gradient https://whizan.xyz/books/ml/images/mlchapter11_33.webp Nesterov Accelerated Gradient. Equation 11-5. Nesterov Accelerated Gradient algorithm Nesterov Accelerated Gradient https://whizan.xyz/books/ml/images/mlchapter11_34.webp Nesterov Accelerated Gradient. Equation 11-5. Nesterov Accelerated Gradient algorithm Nesterov Accelerated Gradient https://whizan.xyz/books/ml/images/mlchapter11_35.webp Nesterov Accelerated Gradient. β + η ∇ θJ θ + β Nesterov Accelerated Gradient https://whizan.xyz/books/ml/images/mlchapter11_36.webp Nesterov Accelerated Gradient. NAG ends up being significantly faster than regular Momentum optimization. Regular versus Nesterov Momentum optimization Nesterov Accelerated Gradient https://whizan.xyz/books/ml/images/mlchapter11_37.webp AdaGrad. Equation 11-6. AdaGrad algorithm AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_38.webp AdaGrad. Equation 11-6. AdaGrad algorithm AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_39.webp AdaGrad. Equation 11-6. AdaGrad algorithm AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_40.webp AdaGrad AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_41.webp AdaGrad. 1 . + ∇ θJ θ ⊗ ∇ θJ θ AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_42.webp AdaGrad. 1 . + ∇ θJ θ ⊗ ∇ θJ θ AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_43.webp AdaGrad. 1 . + ∇ θJ θ ⊗ ∇ θJ θ AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_44.webp AdaGrad. 1 . + ∇ θJ θ ⊗ ∇ θJ θ. 2 . θ θ − η ∇ θJ θ ⊘ AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_45.webp AdaGrad. si + AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_46.webp AdaGrad. si + AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_47.webp AdaGrad. si +. θi θi − η ∂/∂ θi J θ / for all parameters θ i (simultaneously) AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_48.webp AdaGrad. In short, this algorithm decays the learning rate, but it does so faster for steep dimen‐ sions than for dimensions with gentler slopes. AdaGrad versus Gradient Descent AdaGrad https://whizan.xyz/books/ml/images/mlchapter11_49.webp RMSProp. Equation 11-7. RMSProp algorithm RMSProp https://whizan.xyz/books/ml/images/mlchapter11_50.webp RMSProp. Equation 11-7. RMSProp algorithm RMSProp https://whizan.xyz/books/ml/images/mlchapter11_51.webp RMSProp. Equation 11-7. RMSProp algorithm RMSProp https://whizan.xyz/books/ml/images/mlchapter11_52.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_53.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_54.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_55.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_56.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_57.webp RMSProp RMSProp https://whizan.xyz/books/ml/images/mlchapter11_58.webp RMSProp. β + 1 − β ∇ θJ θ ⊗ ∇ θJ θ RMSProp https://whizan.xyz/books/ml/images/mlchapter11_59.webp Adam Optimization. Equation 11-8. Adam algorithm Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_60.webp Adam Optimization. Equation 11-8. Adam algorithm Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_61.webp Adam Optimization. Equation 11-8. Adam algorithm Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_62.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_63.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_64.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_65.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_66.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_67.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_68.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_69.webp Adam Optimization Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_70.webp Adam Optimization. β 1 + 1 − β 1 ∇ θJ θ Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_71.webp Adam Optimization. T. 1 − β 1 Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_72.webp Adam Optimization. T. 1 − β 2 Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_73.webp Adam Optimization. 1 − β 2. θ θ − η ⊘ Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_74.webp Adam Optimization. Faster Optimizers. All the optimization techniques discussed so far only rely on the first-order partial derivatives ( Jacobians ). Adam Optimization https://whizan.xyz/books/ml/images/mlchapter11_75.webp Learning Rate Scheduling. rupt training before it has converged properly, yielding a suboptimal solution (see. Learning curves for various learning rates η Learning Rate Scheduling https://whizan.xyz/books/ml/images/mlchapter11_76.webp ℓ1 and ℓ2 Regularization. reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) loss = tf.add_n([base_loss] + reg_losses, name="loss"). Don’t forget to add the regularization losses to your overall loss, or else they will simply be ignored ℓ1 and ℓ2 Regularization https://whizan.xyz/books/ml/images/mlchapter11_77.webp Dropout. Dropout regularization. It is quite surprising at first that this rather brutal technique works at all. Dropout https://whizan.xyz/books/ml/images/mlchapter11_78.webp Dropout. scope="outputs"). You want to use the dropout() function in tensorflow.con trib.layers, not the one in tensorflow.nn. The first one turns off (no-op) when not training, which is what you want, while the sec‐ ond one does not Dropout https://whizan.xyz/books/ml/images/mlchapter11_79.webp Dropout. Chapter 11: Training Deep Neural Nets. Dropconnect is a variant of dropout where individual connections are dropped randomly rather than whole neurons. In general drop‐ out performs better Dropout https://whizan.xyz/books/ml/images/mlchapter11_80.webp Max-Norm Regularization. Dropconnect is a variant of dropout where individual connections are dropped randomly rather than whole neurons. In general drop‐ out performs better Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_81.webp Max-Norm Regularization. Dropconnect is a variant of dropout where individual connections are dropped randomly rather than whole neurons. In general drop‐ out performs better. Another regularization technique that is quite popular for neural networks is called max-norm regularization : for each neuron, it constrains the weights w of the incom‐ ing connections such that w 2 ≤ r , where r is the max-norm hyperparameter and Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_82.webp Max-Norm Regularization. Another regularization technique that is quite popular for neural networks is called max-norm regularization : for each neuron, it constrains the weights w of the incom‐ ing connections such that w 2 ≤ r , where r is the max-norm hyperparameter and Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_83.webp Max-Norm Regularization. Another regularization technique that is quite popular for neural networks is called max-norm regularization : for each neuron, it constrains the weights w of the incom‐ ing connections such that w 2 ≤ r , where r is the max-norm hyperparameter and. · 2 is the ℓ 2 norm Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_84.webp Max-Norm Regularization. · 2 is the ℓ 2 norm Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_85.webp Max-Norm Regularization. · 2 is the ℓ 2 norm Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_86.webp Max-Norm Regularization. · 2 is the ℓ 2 norm Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_87.webp Max-Norm Regularization Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_88.webp Max-Norm Regularization. We typically implement this constraint by computing w 2 after each training step and clipping w if needed ( r ) Max-Norm Regularization https://whizan.xyz/books/ml/images/mlchapter11_89.webp Data Augmentation. For example, if your model is meant to classify pictures of mushrooms, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set (see. Generating new training instances from existing ones Data Augmentation https://whizan.xyz/books/ml/images/mlchapter11_90.webp Data Augmentation. the API documentation for more details). This makes it easy to implement data aug‐ mentation for image datasets. Another powerful technique to train very deep neural networks is to add skip connections (a skip connection is when you add the input of a layer to the output of a higher layer). Data Augmentation https://whizan.xyz/books/ml/mlchapter12 https://whizan.xyz/books/ml/images/mlchapter12_0.webp Distributing TensorFlow Across. In this chapter we will see how to use TensorFlow to distribute computations across multiple devices (CPUs and GPUs) and run them in parallel (see. Executing a TensorFlow graph across multiple devices in parallel Distributing TensorFlow Across https://whizan.xyz/books/ml/images/mlchapter12_1.webp Installation. Chapter 12: Distributing TensorFlow Across Devices and Servers. If you don’t own any GPU cards, you can use a hosting service with GPU capability such as Amazon AWS. Installation https://whizan.xyz/books/ml/images/mlchapter12_2.webp Installation. Nvidia’s Compute Unified Device Architecture library (CUDA) allows developers to use CUDA-enabled GPUs for all sorts of computations (not just graphics accelera‐ tion). TensorFlow uses CUDA and cuDNN to control GPUs and boost DNNs Installation https://whizan.xyz/books/ml/images/mlchapter12_3.webp Managing the GPU RAM. program #2 will only see GPU cards 2 and 3 (numbered 1 and 0, respectively). Every‐ thing will work fine (see. Each program gets two GPUs for itself Managing the GPU RAM https://whizan.xyz/books/ml/images/mlchapter12_4.webp Managing the GPU RAM. Each program gets all four GPUs, but with only 40% of the RAM each. If you run the nvidia-smi command while both programs are running, you should see that each process holds roughly 40% of the total RAM of each card Managing the GPU RAM https://whizan.xyz/books/ml/images/mlchapter12_5.webp Placing Operations on Devices. c = a * b. The "/cpu:0" device aggregates all CPUs on a multi-CPU system. There is currently no way to pin nodes on specific CPUs or to use just a subset of all CPUs Placing Operations on Devices https://whizan.xyz/books/ml/images/mlchapter12_6.webp Parallel Execution. TensorFlow manages a thread pool on each device to parallelize operations (see. Parallelized execution of a TensorFlow graph Parallel Execution https://whizan.xyz/books/ml/images/mlchapter12_7.webp Parallel Execution. As soon as operation C finishes, the dependency counters of operations D and E will be decremented and will both reach 0, so both operations will be sent to the inter-op thread pool to be executed. You can control the number of threads per inter-op pool by setting the inter_op_parallelism_threads option. Parallel Execution https://whizan.xyz/books/ml/images/mlchapter12_8.webp Multiple Devices Across Multiple Servers. computations (such a job is usually named "worker"). TensorFlow cluster Multiple Devices Across Multiple Servers https://whizan.xyz/books/ml/images/mlchapter12_9.webp The Master and Worker Services. nology. This is a lightweight binary data interchange format. All servers in a TensorFlow cluster may communicate with any other server in the cluster, so make sure to open the appropriate ports on your firewall The Master and Worker Services https://whizan.xyz/books/ml/images/mlchapter12_10.webp Sharding Variables Across Multiple Parameter Servers. p2 = 3 * s # pinned to /job:worker/task:1/gpu:1. This example assumes that the parameter servers are CPU-only, which is typically the case since they only need to store and com‐ municate parameters, not perform intensive computations Sharding Variables Across Multiple Parameter Servers https://whizan.xyz/books/ml/images/mlchapter12_11.webp Sharding Variables Across Multiple Parameter Servers. Resource containers make it easy to share variables across sessions in flexible ways. Resource containers Sharding Variables Across Multiple Parameter Servers https://whizan.xyz/books/ml/images/mlchapter12_12.webp Asynchronous Communication Using TensorFlow Queues. every step. Using queues to load the training data asynchronously Asynchronous Communication Using TensorFlow Queues https://whizan.xyz/books/ml/images/mlchapter12_13.webp Asynchronous Communication Using TensorFlow Queues. q = tf.FIFOQueue(capacity=10, dtypes=[tf.float32], shapes=[[2]], name="q", shared_name="shared_q"). To share variables across sessions, all you had to do was to specify the same name and container on both ends. Asynchronous Communication Using TensorFlow Queues https://whizan.xyz/books/ml/images/mlchapter12_14.webp Asynchronous Communication Using TensorFlow Queues. print (b_val) # [[1., 2.], [3., 4.], [5., 6.]]. If you run dequeue_a on its own, it will dequeue a pair and return only the first element; the second element will be lost (and simi‐ larly, if you run dequeue_b on its own, the first element will be lost) Asynchronous Communication Using TensorFlow Queues https://whizan.xyz/books/ml/images/mlchapter12_15.webp Loading Data Directly from the Graph. You must set trainable=False so the optimizers don’t try to tweak this variable. This example assumes that all of your training set (including the labels) consists only of float32 values. If that’s not the case, you will need one variable per type Loading Data Directly from the Graph https://whizan.xyz/books/ml/images/mlchapter12_16.webp Loading Data Directly from the Graph. A graph dedicated to reading training instances from CSV files. In the training graph, you need to create the shared instance queue and simply dequeue mini-batches from it Loading Data Directly from the Graph https://whizan.xyz/books/ml/images/mlchapter12_17.webp Loading Data Directly from the Graph. In this example, the first mini-batch will contain the first two instances of the CSV file, and the second mini-batch will contain the last instance. TensorFlow queues don’t handle sparse tensors well, so if your training instances are sparse you should parse the records after the instance queue Loading Data Directly from the Graph https://whizan.xyz/books/ml/images/mlchapter12_18.webp Loading Data Directly from the Graph. Reading simultaneously from multiple files. For this we need to write a small function to create a reader and the nodes that will read and push one instance to the instance queue Loading Data Directly from the Graph https://whizan.xyz/books/ml/images/mlchapter12_19.webp One Neural Network per Device. By running several client sessions in parallel (in different threads or different pro‐ cesses), connecting them to different servers, and configuring them to use different devices, you can quite easily train or run many neural networks in parallel, across all devices and all machines in your cluster (see. Training one neural network per device One Neural Network per Device https://whizan.xyz/books/ml/images/mlchapter12_20.webp One Neural Network per Device. It also works perfectly if you host a web service that receives a large number of queries per second (QPS) and you need your neural network to make a prediction for each query. Another option is to serve your neural networks using TensorFlow Serving . One Neural Network per Device https://whizan.xyz/books/ml/images/mlchapter12_21.webp In-Graph Versus Between-Graph Replication. In-graph replication. Alternatively, you can create one separate graph for each neural network and handle synchronization between these graphs yourself. In-Graph Versus Between-Graph Replication https://whizan.xyz/books/ml/images/mlchapter12_22.webp In-Graph Versus Between-Graph Replication. Alternatively, you can create one separate graph for each neural network and handle synchronization between these graphs yourself. Between-graph replication In-Graph Versus Between-Graph Replication https://whizan.xyz/books/ml/images/mlchapter12_23.webp Model Parallelism. sented by the dashed arrows). This is likely to completely cancel out the benefit of the parallel computation, since cross-device communication is slow (especially if it is across separate machines). Splitting a fully connected neural network Model Parallelism https://whizan.xyz/books/ml/images/mlchapter12_24.webp Model Parallelism. However, as we will see in Chapter 13 , some neural network architectures, such as convolutional neural networks, contain layers that are only partially connected to the lower layers, so it is much easier to distribute chunks across devices in an eff. Splitting a partially connected neural network Model Parallelism https://whizan.xyz/books/ml/images/mlchapter12_25.webp Model Parallelism. Splitting a deep recurrent neural network. In short, model parallelism can speed up running or training some types of neural networks, but not all, and it requires special care and tuning, such as making sure that devices that need to communicate the most run on the same machine Model Parallelism https://whizan.xyz/books/ml/images/mlchapter12_26.webp Data Parallelism. Another way to parallelize the training of a neural network is to replicate it on each device, run a training step simultaneously on all replicas using a different mini-batch for each, and then aggregate the gradients to update the model parameters. Data parallelism Data Parallelism https://whizan.xyz/books/ml/images/mlchapter12_27.webp Data Parallelism. With synchronous updates , the aggregator waits for all gradients to be available before computing the average and applying the result (i.e., using the aggregated gradients to update the model parameters). To reduce the waiting time at each step, you could ignore the gradi‐ ents from the slowest few replicas (typically ~10%). Data Parallelism https://whizan.xyz/books/ml/images/mlchapter12_28.webp Data Parallelism. rithm diverge. Stale gradients when using asynchronous updates Data Parallelism https://whizan.xyz/books/ml/images/mlchapter12_29.webp Data Parallelism. time spent moving the data in and out of GPU RAM (and possibly across the net‐ work) will outweigh the speedup obtained by splitting the computation load. At that point, adding more GPUs will just increase saturation and slow down training. For some models, typically relatively small and trained on a very large training set, you are often better off training the model on a single machine with a single GPU Data Parallelism https://whizan.xyz/books/ml/images/mlchapter12_30.webp Data Parallelism. Chapter 12: Distributing TensorFlow Across Devices and Servers. Although 16-bit precision is the minimum for training neural net‐ work, you can actually drop down to 8-bit precision after training to reduce the size of the model and speed up computations. Data Parallelism https://whizan.xyz/books/ml/mlchapter13 https://whizan.xyz/books/ml/images/mlchapter13_0.webp The Architecture of the Visual Cortex. David H. Local receptive fields in the visual cortex The Architecture of the Visual Cortex https://whizan.xyz/books/ml/images/mlchapter13_1.webp The Architecture of the Visual Cortex. and Patrick Haffner, which introduced the famous LeNet-5 architecture, widely used to recognize handwritten check numbers. Why not simply use a regular deep neural network with fully con‐ nected layers for image recognition tasks? The Architecture of the Visual Cortex https://whizan.xyz/books/ml/images/mlchapter13_2.webp Convolutional Layer. CNN layers with rectangular local receptive fields Convolutional Layer https://whizan.xyz/books/ml/images/mlchapter13_3.webp Convolutional Layer. CNN layers with rectangular local receptive fields. Until now, all multilayer neural networks we looked at had layers composed of a long line of neurons, and we had to flatten input images to 1D before feeding them to the neural network. Convolutional Layer https://whizan.xyz/books/ml/images/mlchapter13_4.webp Convolutional Layer. A neuron located in row i , column j of a given layer is connected to the outputs of the neurons in the previous layer located in rows i to i + f h – 1, columns j to j + f w – 1, where f h and f w are the height and width of the receptive field (see. Connections between layers and zero padding Convolutional Layer https://whizan.xyz/books/ml/images/mlchapter13_5.webp Convolutional Layer. It is also possible to connect a large input layer to a much smaller layer by spacing out the receptive fields, as shown in. Reducing dimensionality using a stride Convolutional Layer https://whizan.xyz/books/ml/images/mlchapter13_6.webp Filters. filter. During training, a CNN finds the most useful filters for its task, and it learns to combine them into more complex patterns (e.g., a cross is an area in an image where both the vertical filter and the horizontal filter are active). Applying two different filters to get two feature maps Filters https://whizan.xyz/books/ml/images/mlchapter13_7.webp Stacking Multiple Feature Maps. Up to now, for simplicity, we have represented each convolutional layer as a thin 2D layer, but in reality it is composed of several feature maps of equal sizes, so it is more accurately represented in 3D (see. The fact that all neurons in a feature map share the same parame‐ ters dramatically reduces the number of parameters in the model, but most importantly it means that once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. Stacking Multiple Feature Maps https://whizan.xyz/books/ml/images/mlchapter13_8.webp Stacking Multiple Feature Maps. nel . There are typically three: red, green, and blue (RGB). Grayscale images have just one channel, but some images may have much more—for example, satellite images that capture extra light frequencies (such as infrared). Convolution layers with multiple feature maps, and images with three channels Stacking Multiple Feature Maps https://whizan.xyz/books/ml/images/mlchapter13_9.webp TensorFlow Implementation. Padding options—input width: 13, filter width: 6, stride: 5. Unfortunately, convolutional layers have quite a few hyperparameters: you must choose the number of filters, their height and width, the strides, and the padding type. TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter13_10.webp Memory Requirements. During inference (i.e., when making a prediction for a new instance) the RAM occu‐ pied by one layer can be released as soon as the next layer has been computed, so you only need as much RAM as required by two consecutive layers. If training crashes because of an out-of-memory error, you can try reducing the mini-batch size. Memory Requirements https://whizan.xyz/books/ml/images/mlchapter13_11.webp Pooling Layer. Max pooling layer (2 × 2 pooling kernel, stride 2, no padding). This is obviously a very destructive kind of layer: even with a tiny 2 × 2 kernel and a stride of 2, the output will be two times smaller in both directions (so its area will be four times smaller), simply dropping 75% of the input values Pooling Layer https://whizan.xyz/books/ml/images/mlchapter13_12.webp CNN Architectures. Typical CNN architectures stack a few convolutional layers (each one generally fol‐ lowed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. Typical CNN architecture CNN Architectures https://whizan.xyz/books/ml/images/mlchapter13_13.webp CNN Architectures. Typical CNN architecture. A common mistake is to use convolution kernels that are too large. You can often get the same effect as a 9 × 9 kernel by stacking two 3 CNN Architectures https://whizan.xyz/books/ml/images/mlchapter13_14.webp AlexNet. 2 AlexNet https://whizan.xyz/books/ml/images/mlchapter13_15.webp AlexNet. 2. j high = min i + r , f n − 1 AlexNet https://whizan.xyz/books/ml/images/mlchapter13_16.webp GoogLeNet. shows the architecture of an inception module. Inception module GoogLeNet https://whizan.xyz/books/ml/images/mlchapter13_17.webp GoogLeNet. In short, you can think of the whole inception module as a convolutional layer on steroids, able to output feature maps that capture complex patterns at various scales. The number of convolutional kernels for each convolutional layer is a hyperparameter. Unfortunately, this means that you have six more hyperparameters to tweak for every inception layer you add GoogLeNet https://whizan.xyz/books/ml/images/mlchapter13_18.webp GoogLeNet. GoogLeNet architecture. Let’s go through this network GoogLeNet https://whizan.xyz/books/ml/images/mlchapter13_19.webp ResNet. Residual learning. When you initialize a regular neural network, its weights are close to zero, so the net‐ work just outputs values close to zero. ResNet https://whizan.xyz/books/ml/images/mlchapter13_20.webp ResNet. Moreover, if you add many skip connections, the network can start making progress even if several layers have not started learning yet (see. Regular deep neural network (left) and deep residual network (right) ResNet https://whizan.xyz/books/ml/images/mlchapter13_21.webp ResNet. Now let’s look at ResNet’s architecture (see. ResNet architecture ResNet https://whizan.xyz/books/ml/images/mlchapter13_22.webp ResNet. Note that the number of feature maps is doubled every few residual units, at the same time as their height and width are halved (using a convolutional layer with stride 2). Skip connection when changing feature map size and depth ResNet https://whizan.xyz/books/ml/images/mlchapter13_23.webp ResNet. There are a few other architectures that you may want to look at, in particular VGGNet 13 (runner-up of the ILSVRC 2014 challenge) and Inception-v4 14 (which merges the ideas of GoogLeNet and ResNet and achieves close to 3% top-5 error rate on ImageN. There is really nothing special about implementing the various CNN architectures we just discussed. ResNet https://whizan.xyz/books/ml/mlchapter14 https://whizan.xyz/books/ml/images/mlchapter14_0.webp Recurrent Neurons. Up to now we have mostly looked at feedforward neural networks, where the activa‐ tions flow only in one direction, from the input layer to the output layer (except for a few networks in Appendix E ). A recurrent neuron (left), unrolled through time (right) Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_1.webp Recurrent Neurons. You can easily create a layer of recurrent neurons. A layer of recurrent neurons (left), unrolled through time (right) Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_2.webp Recurrent Neurons. Equation 14-1. Output of a single recurrent neuron for a single instance Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_3.webp Recurrent Neurons. Equation 14-1. Output of a single recurrent neuron for a single instance Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_4.webp Recurrent Neurons. Equation 14-1. Output of a single recurrent neuron for a single instance Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_5.webp Recurrent Neurons. t = ϕ t T · x + t − 1 T · y + b Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_6.webp Recurrent Neurons. Equation 14-2. Outputs of a layer of recurrent neurons for all instances in a mini- batch Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_7.webp Recurrent Neurons. Equation 14-2. Outputs of a layer of recurrent neurons for all instances in a mini- batch Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_8.webp Recurrent Neurons. Equation 14-2. Outputs of a layer of recurrent neurons for all instances in a mini- batch Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_9.webp Recurrent Neurons. t = ϕ t · x + t − 1 · y + Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_10.webp Recurrent Neurons. t = ϕ t · x + t − 1 · y +. = ϕ t Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_11.webp Recurrent Neurons. = ϕ t. t − 1 Recurrent Neurons https://whizan.xyz/books/ml/images/mlchapter14_12.webp Memory Cells. In general a cell’s state at time step t , denoted h ( t ) (the “h” stands for “hidden”), is a function of some inputs at that time step and its state at the previous time step: h ( t ) = f ( h ( t –1) , x ( t ) ). A cell’s hidden state and its output may be different Memory Cells https://whizan.xyz/books/ml/images/mlchapter14_13.webp Input and Output Sequences. Lastly, you could have a sequence-to-vector network, called an encoder , followed by a vector-to-sequence network, called a decoder (see the bottom-right network). Seq to seq (top left), seq to vector (top right), vector to seq (bottom left), delayed seq to seq (bottom right) Input and Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_14.webp Static Unrolling Through Time. basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32). During backpropagation, the while_loop() operation does the appropriate magic: it stores the tensor values for each iteration dur‐ ing the forward pass so it can use them to compute gradients dur‐ ing the reverse pass Static Unrolling Through Time https://whizan.xyz/books/ml/images/mlchapter14_15.webp Handling Variable-Length Output Sequences. To train an RNN, the trick is to unroll it through time (like we just did) and then simply use regular backpropagation (see. This strategy is called backpro‐ pagation through time (BPTT). Backpropagation through time Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_16.webp Handling Variable-Length Output Sequences. Just like in regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows); then the output sequence is evaluated Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_17.webp Handling Variable-Length Output Sequences. Just like in regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows); then the output sequence is evaluated Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_18.webp Handling Variable-Length Output Sequences. Just like in regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows); then the output sequence is evaluated Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_19.webp Handling Variable-Length Output Sequences Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_20.webp Handling Variable-Length Output Sequences Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_21.webp Handling Variable-Length Output Sequences. using a cost function C t min , t min + 1 , ⋯, t max (where t min and t max are the first and last output time steps, not counting the ignored outputs), and the gradients of that cost function are propagated backward through the unrolled network (repre‐ sented by the solid arrows); and finally the model parameters are updated using the gradients computed during BPTT. Handling Variable-Length Output Sequences https://whizan.xyz/books/ml/images/mlchapter14_22.webp Training a Sequence Classifier. (one per class) connected to the output of the last time step, followed by a softmax layer (see. Sequence classifier Training a Sequence Classifier https://whizan.xyz/books/ml/images/mlchapter14_23.webp Training a Sequence Classifier. We get over 98% accuracy—not bad! Plus you would certainly get a better result by tuning the hyperparameters, initializing the RNN weights using He initialization, training longer, or adding a bit of regularization (e.g., dropout). You can specify an initializer for the RNN by wrapping its construction code in a variable scope (e.g., use variable_scope("rnn", initializer=variance_scaling_ini tializer()) to use He initialization) Training a Sequence Classifier https://whizan.xyz/books/ml/images/mlchapter14_24.webp Training to Predict Time Series. Now let’s take a look at how to handle time series, such as stock prices, air tempera‐ ture, brain wave patterns, and so on. Time series (left), and a training instance from that series (right) Training to Predict Time Series https://whizan.xyz/books/ml/images/mlchapter14_25.webp Training to Predict Time Series. cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu) outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32). In general you would have more than just one input feature. Training to Predict Time Series https://whizan.xyz/books/ml/images/mlchapter14_26.webp Training to Predict Time Series. every method call to an underlying cell, but it also adds some functionality. RNN cells using output projections Training to Predict Time Series https://whizan.xyz/books/ml/images/mlchapter14_27.webp Training to Predict Time Series. shows the predicted sequence for the instance we looked at earlier (in, after just 1,000 training iterations. Time series prediction Training to Predict Time Series https://whizan.xyz/books/ml/images/mlchapter14_28.webp Training to Predict Time Series. to [batch_size, n_steps, n_outputs]. These operations are represented in. Stack all the outputs, apply the projection, then unstack the result Training to Predict Time Series https://whizan.xyz/books/ml/images/mlchapter14_29.webp Creative RNN. X_batch = np.array(sequence[-n_steps:]).reshape(1, n_steps, 1) y_pred = sess.run(outputs, feed_dict={X: X_batch}) sequence.append(y_pred[0, -1, 0]). Creative sequences, seeded with zeros (left) or with an instance (right) Creative RNN https://whizan.xyz/books/ml/images/mlchapter14_30.webp Deep RNNs. Deep RNN (left), unrolled through time (right). To implement a deep RNN in TensorFlow, you can create several cells and stack them into a MultiRNNCell. In the following code we stack three identical cells (but you could very well use various kinds of cells with a different number of neurons) Deep RNNs https://whizan.xyz/books/ml/images/mlchapter14_31.webp Distributing a Deep RNN Across Multiple GPUs. outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32). Do not set state_is_tuple=False, or the MultiRNNCell will con‐ catenate all the cell states into a single tensor, on a single GPU Distributing a Deep RNN Across Multiple GPUs https://whizan.xyz/books/ml/images/mlchapter14_32.webp The Difficulty of Training over Many Time Steps. So how does an LSTM cell work? The architecture of a basic LSTM cell is shown in. LSTM cell The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_33.webp The Difficulty of Training over Many Time Steps. Equation 14-3. LSTM computations The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_34.webp The Difficulty of Training over Many Time Steps. Equation 14-3. LSTM computations The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_35.webp The Difficulty of Training over Many Time Steps. Equation 14-3. LSTM computations The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_36.webp The Difficulty of Training over Many Time Steps The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_37.webp The Difficulty of Training over Many Time Steps. t = σ T · + T · + The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_38.webp The Difficulty of Training over Many Time Steps. xi t hi t − 1 i The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_39.webp The Difficulty of Training over Many Time Steps. xi t hi t − 1 i The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_40.webp The Difficulty of Training over Many Time Steps. xi t hi t − 1 i The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_41.webp The Difficulty of Training over Many Time Steps. t = σ xf T · t + hf T · t − 1 + f The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_42.webp The Difficulty of Training over Many Time Steps. t = σ xf T · t + hf T · t − 1 + f The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_43.webp The Difficulty of Training over Many Time Steps. t = σ xf T · t + hf T · t − 1 + f The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_44.webp The Difficulty of Training over Many Time Steps. t = σ xf T · t + hf T · t − 1 + f The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_45.webp The Difficulty of Training over Many Time Steps The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_46.webp The Difficulty of Training over Many Time Steps. o t = σ T · + T · + The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_47.webp The Difficulty of Training over Many Time Steps. xo t ho t − 1 o The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_48.webp The Difficulty of Training over Many Time Steps. xo t ho t − 1 o The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_49.webp The Difficulty of Training over Many Time Steps. xo t ho t − 1 o The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_50.webp The Difficulty of Training over Many Time Steps. t = tanh xgT · t + hgT · t − 1 + g The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_51.webp The Difficulty of Training over Many Time Steps. t = tanh xgT · t + hgT · t − 1 + g The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_52.webp The Difficulty of Training over Many Time Steps. t = tanh xgT · t + hgT · t − 1 + g. t = t ⊗ t − 1 + t ⊗ t The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_53.webp The Difficulty of Training over Many Time Steps. t = t ⊗ t − 1 + t ⊗ t. t = t = t ⊗ tanh t The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_54.webp The Difficulty of Training over Many Time Steps. t = t = t ⊗ tanh t. W xi , W xf , W xo , W xg are the weight matrices of each of the four layers for their con‐ nection to the input vector x ( t ) The Difficulty of Training over Many Time Steps https://whizan.xyz/books/ml/images/mlchapter14_55.webp GRU Cell. The Gated Recurrent Unit (GRU) cell (see was proposed by Kyunghyun Cho et al. in a 2014 paper 7 that also introduced the Encoder–Decoder network we mentioned earlier. GRU cell GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_56.webp GRU Cell. Equation 14-4. GRU computations GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_57.webp GRU Cell. Equation 14-4. GRU computations GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_58.webp GRU Cell. Equation 14-4. GRU computations. t = σ xzT · t + hzT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_59.webp GRU Cell. t = σ xzT · t + hzT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_60.webp GRU Cell. t = σ xzT · t + hzT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_61.webp GRU Cell. t = σ xzT · t + hzT · t − 1. t = σ xrT · t + hrT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_62.webp GRU Cell. t = σ xrT · t + hrT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_63.webp GRU Cell. t = σ xrT · t + hrT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_64.webp GRU Cell. t = σ xrT · t + hrT · t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_65.webp GRU Cell. t = tanh xgT · t + hgT · t ⊗ t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_66.webp GRU Cell. t = tanh xgT · t + hgT · t ⊗ t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_67.webp GRU Cell. t = tanh xgT · t + hgT · t ⊗ t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_68.webp GRU Cell. t = tanh xgT · t + hgT · t ⊗ t − 1 GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_69.webp GRU Cell GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_70.webp GRU Cell. t = 1 − t ⊗ tanh xgT · t − 1 + t ⊗ t GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_71.webp GRU Cell. t = 1 − t ⊗ tanh xgT · t − 1 + t ⊗ t. Creating a GRU cell in TensorFlow is trivial GRU Cell https://whizan.xyz/books/ml/images/mlchapter14_72.webp Word Embeddings. Chapter 14: Recurrent Neural Networks. Embeddings are also useful for representing categorical attributes that can take on a large number of different values, especially when there are complex similarities between values. Word Embeddings https://whizan.xyz/books/ml/images/mlchapter14_73.webp An Encoder–Decoder Network for Machine Translation. Let’s take a look at a simple machine translation model 10 that will translate English sentences to French (see. A simple machine translation model An Encoder–Decoder Network for Machine Translation https://whizan.xyz/books/ml/images/mlchapter14_74.webp An Encoder–Decoder Network for Machine Translation. Note that at inference time (after training), you will not have the target sentence to feed to the decoder. Feeding the previous output word as input at inference time An Encoder–Decoder Network for Machine Translation https://whizan.xyz/books/ml/mlchapter15 https://whizan.xyz/books/ml/images/mlchapter15_0.webp Efficient Data Representations. layer composed of two neurons (the encoder), and one output layer composed of three neurons (the decoder). The chess memory experiment (left) and a simple autoencoder (right) Efficient Data Representations https://whizan.xyz/books/ml/images/mlchapter15_1.webp Performing PCA with an Undercomplete Linear Autoencoder. PCA performed by an undercomplete linear autoencoder Performing PCA with an Undercomplete Linear Autoencoder https://whizan.xyz/books/ml/images/mlchapter15_2.webp Stacked Autoencoders. The architecture of a stacked autoencoder is typically symmetrical with regards to the central hidden layer (the coding layer). Stacked autoencoder Stacked Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_3.webp Training One Autoencoder at a Time. Rather than training the whole stacked autoencoder in one go like we just did, it is often much faster to train one shallow autoencoder at a time, then stack all of them into a single stacked autoencoder (hence the name), as shown on. Training one autoencoder at a time Training One Autoencoder at a Time https://whizan.xyz/books/ml/images/mlchapter15_4.webp Training One Autoencoder at a Time. Another approach is to use a single graph containing the whole stacked autoencoder, plus some extra operations to perform each training phase, as shown in. A single graph to train a stacked autoencoder Training One Autoencoder at a Time https://whizan.xyz/books/ml/images/mlchapter15_5.webp Training One Autoencoder at a Time. During the execution phase, all you need to do is run the phase 1 training op for a number of epochs, then the phase 2 training op for some more epochs. Since hidden layer 1 is frozen during phase 2, its output will always be the same for any given training instance. Training One Autoencoder at a Time https://whizan.xyz/books/ml/images/mlchapter15_6.webp Visualizing the Reconstructions. shows the resulting images. Original digits (left) and their reconstructions (right) Visualizing the Reconstructions https://whizan.xyz/books/ml/images/mlchapter15_7.webp Visualizing Features. You may get low-level features such as the ones shown in. Features learned by five neurons from the first hidden layer Visualizing Features https://whizan.xyz/books/ml/images/mlchapter15_8.webp Unsupervised Pretraining Using Stacked Autoencoders. you really don’t have much labeled training data, you may want to freeze the pre‐ trained layers (at least the lower ones). Unsupervised pretraining using autoencoders Unsupervised Pretraining Using Stacked Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_9.webp Unsupervised Pretraining Using Stacked Autoencoders. Unsupervised pretraining using autoencoders. This situation is actually quite common, because building a large unlabeled dataset is often cheap (e.g., a simple script can download millions of images off the internet), but labeling them can only be done reliably by humans (e.g., classifying images as cute or not). Unsupervised Pretraining Using Stacked Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_10.webp Denoising Autoencoders. The noise can be pure Gaussian noise added to the inputs, or it can be randomly switched off inputs, just like in dropout (introduced in Chapter 11 ).shows both options. Denoising autoencoders, with Gaussian noise (left) or dropout (right) Denoising Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_11.webp TensorFlow Implementation. [..]. Since the shape of X is only partially defined during the construc‐ tion phase, we cannot know in advance the shape of the noise that we must add to X. TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_12.webp TensorFlow Implementation. Once we have the mean activation per neuron, we want to penalize the neurons that are too active by adding a sparsity loss to the cost function. Sparsity loss TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_13.webp TensorFlow Implementation. Chapter 15: Autoencoders. these distributions, noted D KL ( P Q ), can be computed using Equation 15-1 TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_14.webp TensorFlow Implementation. Equation 15-1. Kullback–Leibler divergence TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_15.webp TensorFlow Implementation. Equation 15-1. Kullback–Leibler divergence TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_16.webp TensorFlow Implementation. Equation 15-1. Kullback–Leibler divergence TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_17.webp TensorFlow Implementation TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_18.webp TensorFlow Implementation. D KL P Q = ∑ P i log P TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_19.webp TensorFlow Implementation. Equation 15-2. KL divergence between the target sparsity p and the actual sparsity q TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_20.webp TensorFlow Implementation. Equation 15-2. KL divergence between the target sparsity p and the actual sparsity q TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_21.webp TensorFlow Implementation. Equation 15-2. KL divergence between the target sparsity p and the actual sparsity q TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_22.webp TensorFlow Implementation TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_23.webp TensorFlow Implementation. D KL p q = p log p + 1 − p log 1 − p TensorFlow Implementation https://whizan.xyz/books/ml/images/mlchapter15_24.webp Variational Autoencoders. coder. Variational autoencoder (left), and an instance going through it (right) Variational Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_25.webp Variational Autoencoders. 1 - tf.log(eps + tf.square(hidden3_sigma))) Variational Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_26.webp Variational Autoencoders. 1 - tf.log(eps + tf.square(hidden3_sigma))). One common variant is to train the encoder to output γ = log( σ 2) rather than σ . Variational Autoencoders https://whizan.xyz/books/ml/images/mlchapter15_27.webp Generating Digits. Images of handwritten digits generated by the variational autoencoder. A majority of these digits look pretty convincing, while a few are rather “creative.” But don’t be too harsh on the autoencoder—it only started learning less than an hour ago. Generating Digits https://whizan.xyz/books/ml/mlchapter16 https://whizan.xyz/books/ml/images/mlchapter16_0.webp Learning to Optimize Rewards. Reinforcement Learning examples: (a) walking robot, (b) Ms. Pac-Man, (c) Go player, (d) thermostat, (e) automatic trader 5. Note that there may not be any positive rewards at all; for example, the agent may move around in a maze, getting a negative reward at every time step, so it better find the exit as quickly as possible! Learning to Optimize Rewards https://whizan.xyz/books/ml/images/mlchapter16_1.webp Policy Search. The algorithm used by the software agent to determine its actions is called its policy . For example, the policy could be a neural network taking observations as inputs and outputting the action to take (see. Reinforcement Learning using a neural network policy Policy Search https://whizan.xyz/books/ml/images/mlchapter16_2.webp Policy Search. plus their offspring together constitute the second generation. You can continue to iterate through generations this way, until you find a good policy. Four points in policy space and the agent’s corresponding behavior Policy Search https://whizan.xyz/books/ml/images/mlchapter16_3.webp Introduction to OpenAI Gym. The make() function creates an environment, in this case a CartPole environment. The CartPole environment Introduction to OpenAI Gym https://whizan.xyz/books/ml/images/mlchapter16_4.webp Introduction to OpenAI Gym. (400, 600, 3). Unfortunately, the CartPole (and a few other environments) ren‐ ders the image to the screen even if you set the mode to "rgb_array". Introduction to OpenAI Gym https://whizan.xyz/books/ml/images/mlchapter16_5.webp Neural Network Policies. For example, if it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability. Neural network policy Neural Network Policies https://whizan.xyz/books/ml/images/mlchapter16_6.webp Evaluating Actions: The Credit Assignment Problem. Discounted rewards. Of course, a good action may be followed by several bad actions that cause the pole to fall quickly, resulting in the good action getting a low score (similarly, a good actor may sometimes star in a terrible movie). Evaluating Actions: The Credit Assignment Problem https://whizan.xyz/books/ml/images/mlchapter16_7.webp Policy Gradients. Chapter 16: Reinforcement Learning. Researchers try to find algorithms that work well even when the agent initially knows nothing about the environment. Policy Gradients https://whizan.xyz/books/ml/images/mlchapter16_8.webp Markov Decision Processes. Example of a Markov chain. Markov decision processes were first described in the 1950s by Richard Bellman . Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_9.webp Markov Decision Processes. Example of a Markov decision process. Bellman found a way to estimate the optimal state value of any state s , noted V *( s ), which is the sum of all discounted future rewards the agent can expect on average after it reaches a state s , assuming it acts optimally. Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_10.webp Markov Decision Processes. Equation 16-1. Bellman Optimality Equation Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_11.webp Markov Decision Processes. Equation 16-1. Bellman Optimality Equation Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_12.webp Markov Decision Processes. Equation 16-1. Bellman Optimality Equation Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_13.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_14.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_15.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_16.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_17.webp Markov Decision Processes. V * s = max a ∑ s ′ T s , a , s ′ R s , a , s ′ + γ . V * s ′ for all s Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_18.webp Markov Decision Processes. Equation 16-2. Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_19.webp Markov Decision Processes. Equation 16-2. Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_20.webp Markov Decision Processes. Equation 16-2. Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_21.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_22.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_23.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_24.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_25.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_26.webp Markov Decision Processes. Vk + 1 s m a ax ∑ s ′ T s , a , s ′ R s , a , s ′ + γ . Vk s ′ for all s Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_27.webp Markov Decision Processes. V k ( s ) is the estimated value of state s at the k th iteration of the algorithm. This algorithm is an example of Dynamic Programming , which breaks down a complex problem (in this case estimating a poten‐ tially infinite sum of discounted future rewards) into tractable sub- problems that can be tackled iteratively (in this case f Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_28.webp Markov Decision Processes. Equation 16-3. Q-Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_29.webp Markov Decision Processes. Equation 16-3. Q-Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_30.webp Markov Decision Processes. Equation 16-3. Q-Value Iteration algorithm Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_31.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_32.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_33.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_34.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_35.webp Markov Decision Processes Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_36.webp Markov Decision Processes. Qk + 1 s , a ∑ T s , a , s ′ R s , a , s ′ + γ . max Qk s ′, a ′ for all s , a Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_37.webp Markov Decision Processes. s ′ a ′ Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_38.webp Markov Decision Processes. s ′ a ′ Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_39.webp Markov Decision Processes. s ′ a ′ Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_40.webp Markov Decision Processes. Once you have the optimal Q-Values, defining the optimal policy, noted π *( s ), is triv‐ ial: when the agent is in state s , it should choose the action with the highest Q-Value for that state: π * s = argmax Q * s , a Markov Decision Processes https://whizan.xyz/books/ml/images/mlchapter16_41.webp Temporal Difference Learning and Q-Learning. Equation 16-4. TD Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_42.webp Temporal Difference Learning and Q-Learning. Equation 16-4. TD Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_43.webp Temporal Difference Learning and Q-Learning. Equation 16-4. TD Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_44.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_45.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_46.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_47.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_48.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_49.webp Temporal Difference Learning and Q-Learning. Vk + 1 s 1 − α Vk s + α r + γ . Vk s ′ Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_50.webp Temporal Difference Learning and Q-Learning. α is the learning rate (e.g., 0.01). TD Learning has many similarities with Stochastic Gradient Descent, in particular the fact that it handles one sample at a time. Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_51.webp Temporal Difference Learning and Q-Learning. Equation 16-5. Q-Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_52.webp Temporal Difference Learning and Q-Learning. Equation 16-5. Q-Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_53.webp Temporal Difference Learning and Q-Learning. Equation 16-5. Q-Learning algorithm Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_54.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_55.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_56.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_57.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_58.webp Temporal Difference Learning and Q-Learning Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_59.webp Temporal Difference Learning and Q-Learning. Qk + 1 s , a 1 − α Qk s , a + α r + γ . max Qk s ′, a ′ Temporal Difference Learning and Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_60.webp Exploration Policies. Equation 16-6. Q-Learning using an exploration function. Q s , a 1 − α Q s , a + α r + γ . max f Q s ′, a ′ , N s ′, a ′ Exploration Policies https://whizan.xyz/books/ml/images/mlchapter16_61.webp Exploration Policies. Q s , a 1 − α Q s , a + α r + γ . max f Q s ′, a ′ , N s ′, a ′. α ′ Exploration Policies https://whizan.xyz/books/ml/images/mlchapter16_62.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Ms. Pac-Man observation, original (left) and after preprocessing (right). Next, let’s create the DQN. Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_63.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Next, let’s create the DQN. Deep Q-network to play Ms. Pac-Man Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_64.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Equation 16-7. Deep Q-Learning cost function Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_65.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Equation 16-7. Deep Q-Learning cost function Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_66.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Equation 16-7. Deep Q-Learning cost function Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_67.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_68.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. θ = 1 ∑ m y i − Q s i , a i , θ 2 Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_69.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. critic Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_70.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. critic. with y i = r i + γ . max Q s ′ i , a ′, θ actor Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_71.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. J ( θ critic ) is the cost function used to train the critic DQN. As you can see, it is just the Mean Squared Error between the target Q-Values y ( i ) as estimated by the actor DQN, and the critic DQN’s predictions of these Q-Values. The replay memory is optional, but highly recommended. Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/images/mlchapter16_72.webp Learning to Play Ms. Pac-Man Using Deep Q-Learning. Chapter 16: Reinforcement Learning Learning to Play Ms. Pac-Man Using Deep Q-Learning https://whizan.xyz/books/ml/mlchapter2 https://whizan.xyz/books/ml/images/mlchapter2_0.webp Working with Real Data. In this chapter we chose the California Housing Prices dataset from the StatLib repos‐ itory 2 (see. California housing prices Working with Real Data https://whizan.xyz/books/ml/images/mlchapter2_1.webp Look at the Big Picture. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. Since you are a well-organized data scientist, the first thing you do is to pull out your Machine Learning project checklist. Look at the Big Picture https://whizan.xyz/books/ml/images/mlchapter2_2.webp Frame the Problem. A Machine Learning pipeline for real estate investments. Pipelines Frame the Problem https://whizan.xyz/books/ml/images/mlchapter2_3.webp Frame the Problem. Have you found the answers?. If the data was huge, you could either split your batch learning work across multiple servers (using the MapReduce technique, as we will see later), or you could use an online learning technique instead Frame the Problem https://whizan.xyz/books/ml/images/mlchapter2_4.webp Select a Performance Measure. Equation 2-1. Root Mean Square Error (RMSE) Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_5.webp Select a Performance Measure. Equation 2-1. Root Mean Square Error (RMSE). m i = 1 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_6.webp Select a Performance Measure. h i − y i 2 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_7.webp Select a Performance Measure. h i − y i 2. RMSE , h = Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_8.webp Select a Performance Measure. Recall that the transpose operator flips a column vector into a row vector (and vice versa) Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_9.webp Select a Performance Measure. Recall that the transpose operator flips a column vector into a row vector (and vice versa) Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_10.webp Select a Performance Measure. Recall that the transpose operator flips a column vector into a row vector (and vice versa) Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_11.webp Select a Performance Measure. 2000 T Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_12.webp Select a Performance Measure. Equation 2-2. Mean Absolute Error Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_13.webp Select a Performance Measure. Equation 2-2. Mean Absolute Error Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_14.webp Select a Performance Measure. Equation 2-2. Mean Absolute Error. MAE , h = 1 ∑ m h i − y i Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_15.webp Select a Performance Measure. MAE , h = 1 ∑ m h i − y i. m i = 1 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_16.webp Select a Performance Measure. Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms , are possible Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_17.webp Select a Performance Measure. Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms , are possible Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_18.webp Select a Performance Measure. Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms , are possible Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_19.webp Select a Performance Measure Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_20.webp Select a Performance Measure Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_21.webp Select a Performance Measure. Computing the root of a sum of squares (RMSE) corresponds to the Euclidian norm : it is the notion of distance you are familiar with. It is also called the ℓ 2 norm , noted · 2 (or just · ) Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_22.webp Select a Performance Measure. 1 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_23.webp Select a Performance Measure. 1 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_24.webp Select a Performance Measure. 1 Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_25.webp Select a Performance Measure. k = v 0 k + v 1 k + ⋯ + vn k k . ℓ 0 just gives the cardinality of the vector (i.e., the number of elements), and ℓ ∞ gives the maximum absolute value in the vector Select a Performance Measure https://whizan.xyz/books/ml/images/mlchapter2_26.webp Creating an Isolated Environment. Your workspace in Jupyter. A notebook contains a list of cells. Creating an Isolated Environment https://whizan.xyz/books/ml/images/mlchapter2_27.webp Creating an Isolated Environment. A notebook contains a list of cells. Hello world Python notebook Creating an Isolated Environment https://whizan.xyz/books/ml/images/mlchapter2_28.webp Take a Quick Look at the Data Structure. Let’s take a look at the top five rows using the DataFrame’s head() method (see. Top five rows in the dataset Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_29.webp Take a Quick Look at the Data Structure. The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values (see. Housing info Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_30.webp Take a Quick Look at the Data Structure. Let’s look at the other fields. The describe() method shows a summary of the numerical attributes. Summary of each numerical attribute Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_31.webp Take a Quick Look at the Data Structure. plt.show(). A histogram for each numerical attribute Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_32.webp Take a Quick Look at the Data Structure. Get the Data. user-specified graphical backend to draw on your screen. Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_33.webp Take a Quick Look at the Data Structure. Chapter 2: End-to-End Machine Learning Project. Wait! Before you look at the data any further, you need to create a test set, put it aside, and never look at it Take a Quick Look at the Data Structure https://whizan.xyz/books/ml/images/mlchapter2_34.webp Create a Test Set. Suppose you chatted with experts who told you that the median income is a very important attribute to predict median housing prices. Histogram of income categories Create a Test Set https://whizan.xyz/books/ml/images/mlchapter2_35.webp Create a Test Set. With similar code you can measure the income category proportions in the test set.compares the income category proportions in the overall dataset, in the test set generated with stratified sampling, and in a test set generated using purely random sampling. Sampling bias comparison of stratified versus purely random sampling Create a Test Set https://whizan.xyz/books/ml/images/mlchapter2_36.webp Visualizing Geographical Data. housing.plot(kind="scatter", x="longitude", y="latitude"). A geographical scatterplot of the data Visualizing Geographical Data https://whizan.xyz/books/ml/images/mlchapter2_37.webp Visualizing Geographical Data. housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1). A better visualization highlighting high-density areas Visualizing Geographical Data https://whizan.xyz/books/ml/images/mlchapter2_38.webp Visualizing Geographical Data. California housing prices. This image tells you that the housing prices are very much related to the location (e.g., close to the ocean) and to the population density, as you probably knew already. Visualizing Geographical Data https://whizan.xyz/books/ml/images/mlchapter2_39.webp Looking for Correlations. The correlation coefficient ranges from –1 to 1. Standard correlation coefficient of various datasets (source: Wikipedia; public domain image) Looking for Correlations https://whizan.xyz/books/ml/images/mlchapter2_40.webp Looking for Correlations. Standard correlation coefficient of various datasets (source: Wikipedia; public domain image). The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). Looking for Correlations https://whizan.xyz/books/ml/images/mlchapter2_41.webp Looking for Correlations. scatter_matrix(housing[attributes], figsize=(12, 8)). Scatter matrix Looking for Correlations https://whizan.xyz/books/ml/images/mlchapter2_42.webp Looking for Correlations. Median income versus median house value. This plot reveals a few things. Looking for Correlations https://whizan.xyz/books/ml/images/mlchapter2_43.webp Feature Scaling. Prepare the Data for Machine Learning Algorithms. As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data) Feature Scaling https://whizan.xyz/books/ml/images/mlchapter2_44.webp Better Evaluation Using Cross-Validation. Select and Train a Model. Scikit-Learn cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a neg‐ ative value), which is why the preceding cod Better Evaluation Using Cross-Validation https://whizan.xyz/books/ml/images/mlchapter2_45.webp Better Evaluation Using Cross-Validation. Wow, this is much better: Random Forests look very promising. You should save every model you experiment with, so you can come back easily to any model you want. Better Evaluation Using Cross-Validation https://whizan.xyz/books/ml/images/mlchapter2_46.webp Fine-Tune Your Model. grid_search = GridSearchCV(forest_reg, param_grid, cv=5. scoring='neg_mean_squared_error') grid_search.fit(housing_prepared, housing_labels) Fine-Tune Your Model https://whizan.xyz/books/ml/images/mlchapter2_47.webp Fine-Tune Your Model. Chapter 2: End-to-End Machine Learning Project. Since 30 is the maximum value of n_estimators that was evalu‐ ated, you should probably evaluate higher values as well, since the score may continue to improve Fine-Tune Your Model https://whizan.xyz/books/ml/images/mlchapter2_48.webp Fine-Tune Your Model. RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features=6, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1, oob_score=False, random_state=None, verbose=. If GridSearchCV is initialized with refit=True (which is the default), then once it finds the best estimator using cross- validation, it retrains it on the whole training set. Fine-Tune Your Model https://whizan.xyz/books/ml/images/mlchapter2_49.webp Fine-Tune Your Model. default hyperparameter values (which was 52,634). Congratulations, you have suc‐ cessfully fine-tuned your best model!. Don’t forget that you can treat some of the data preparation steps as hyperparameters. Fine-Tune Your Model https://whizan.xyz/books/ml/mlchapter3 https://whizan.xyz/books/ml/images/mlchapter3_0.webp MNIST. plt.axis("off") plt.show(). This looks like a 5, and indeed that’s what the label tells us MNIST https://whizan.xyz/books/ml/images/mlchapter3_1.webp MNIST. shows a few more images from the MNIST dataset to give you a feel for the complexity of the classification task. A few digits from the MNIST dataset MNIST https://whizan.xyz/books/ml/images/mlchapter3_2.webp Training a Binary Classifier. sgd_clf = SGDClassifier(random_state=42) sgd_clf.fit(X_train, y_train_5). The SGDClassifier relies on randomness during training (hence the name “stochastic”). If you want reproducible results, you should set the random_state parameter Training a Binary Classifier https://whizan.xyz/books/ml/images/mlchapter3_3.webp Confusion Matrix. If you are confused about the confusion matrix,may help. An illustrated confusion matrix Confusion Matrix https://whizan.xyz/books/ml/images/mlchapter3_4.webp Precision/Recall Tradeoff. Decision threshold and precision/recall tradeoff. Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it uses to make predictions. Precision/Recall Tradeoff https://whizan.xyz/books/ml/images/mlchapter3_5.webp Precision/Recall Tradeoff. plot_precision_recall_vs_threshold(precisions, recalls, thresholds) plt.show(). Precision and recall versus the decision threshold Precision/Recall Tradeoff https://whizan.xyz/books/ml/images/mlchapter3_6.webp Precision/Recall Tradeoff. Precision and recall versus the decision threshold. You may wonder why the precision curve is bumpier than the recall curve in. Precision/Recall Tradeoff https://whizan.xyz/books/ml/images/mlchapter3_7.webp Precision/Recall Tradeoff. Precision versus recall. You can see that precision really starts to fall sharply around 80% recall. You will probably want to select a precision/recall tradeoff just before that drop—for example, at around 60% recall. But of course the choice depends on your project Precision/Recall Tradeoff https://whizan.xyz/books/ml/images/mlchapter3_8.webp Precision/Recall Tradeoff. Great, you have a 90% precision classifier (or close enough)!. If someone says “let’s reach 99% precision,” you should ask, “at what recall?” Precision/Recall Tradeoff https://whizan.xyz/books/ml/images/mlchapter3_9.webp The ROC Curve. plot_roc_curve(fpr, tpr) plt.show(). ROC curve The ROC Curve https://whizan.xyz/books/ml/images/mlchapter3_10.webp The ROC Curve. >>> roc_auc_score(y_train_5, y_scores) 0.97061072797174941. Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. The ROC Curve https://whizan.xyz/books/ml/images/mlchapter3_11.webp The ROC Curve. plt.show(). Comparing ROC curves The ROC Curve https://whizan.xyz/books/ml/images/mlchapter3_12.webp Multiclass Classification. >>> sgd_clf.classes[5] 5.0. When a classifier is trained, it stores the list of target classes in its classes_ attribute, ordered by value. Multiclass Classification https://whizan.xyz/books/ml/images/mlchapter3_13.webp Error Analysis. plt.matshow(conf_mx, cmap=plt.cm.gray) plt.show(). This confusion matrix looks fairly good, since most images are on the main diagonal, which means that they were classified correctly. Error Analysis https://whizan.xyz/books/ml/images/mlchapter3_14.webp Error Analysis. np.fill_diagonal(norm_conf_mx, 0) plt.matshow(norm_conf_mx, cmap=plt.cm.gray) plt.show(). Now you can clearly see the kinds of errors the classifier makes. Error Analysis https://whizan.xyz/books/ml/images/mlchapter3_15.webp Error Analysis. plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5) plt.show(). The two 5×5 blocks on the left show digits classified as 3s, and the two 5×5 blocks on the right show images classified as 5s. Error Analysis https://whizan.xyz/books/ml/images/mlchapter3_16.webp Multioutput Classification. To illustrate this, let’s build a system that removes noise from images. The line between classification and regression is sometimes blurry, such as in this example. Multioutput Classification https://whizan.xyz/books/ml/images/mlchapter3_17.webp Multioutput Classification. Scikit-Learn offers a few other averaging options and multilabel classifier metrics; see the documentation for more details. On the left is the noisy input image, and on the right is the clean target image. Now let’s train the classifier and make it clean this image Multioutput Classification https://whizan.xyz/books/ml/images/mlchapter3_18.webp Multioutput Classification. clean_digit = knn_clf.predict([X_test_mod[some_index]]) plot_digit(clean_digit). Looks close enough to the target! Multioutput Classification https://whizan.xyz/books/ml/mlchapter4 https://whizan.xyz/books/ml/images/mlchapter4_0.webp Training Models. Finally, we will look at two more models that are commonly used for classification tasks: Logistic Regression and Softmax Regression. There will be quite a few math equations in this chapter, using basic notions of linear algebra and calculus. Training Models https://whizan.xyz/books/ml/images/mlchapter4_1.webp Linear Regression. Equation 4-2. Linear Regression model prediction (vectorized form) Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_2.webp Linear Regression. Equation 4-2. Linear Regression model prediction (vectorized form). y = hθ = θT · Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_3.webp Linear Regression. Equation 4-3. MSE cost function for a Linear Regression model Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_4.webp Linear Regression. Equation 4-3. MSE cost function for a Linear Regression model Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_5.webp Linear Regression. Equation 4-3. MSE cost function for a Linear Regression model. MSE , h = 1 ∑ m θT · i − y i 2 Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_6.webp Linear Regression. MSE , h = 1 ∑ m θT · i − y i 2. θ m i = 1 Linear Regression https://whizan.xyz/books/ml/images/mlchapter4_7.webp The Normal Equation. Equation 4-4. Normal Equation The Normal Equation https://whizan.xyz/books/ml/images/mlchapter4_8.webp The Normal Equation. Equation 4-4. Normal Equation. θ = T · −1 · T · The Normal Equation https://whizan.xyz/books/ml/images/mlchapter4_9.webp The Normal Equation. θ = T · −1 · T ·. θ is the value of θ that minimizes the cost function The Normal Equation https://whizan.xyz/books/ml/images/mlchapter4_10.webp The Normal Equation. Now let’s compute θ using the Normal Equation. We will use the inv() function from NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and the dot() method for matrix multiplication. X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance The Normal Equation https://whizan.xyz/books/ml/images/mlchapter4_11.webp Computational Complexity. The Normal Equation computes the inverse of X T · X , which is an n × n matrix (where n is the number of features). The Normal Equation gets very slow when the number of features grows large (e.g., 100,000) Computational Complexity https://whizan.xyz/books/ml/images/mlchapter4_12.webp Gradient Descent. Concretely, you start by filling θ with random values (this is called random initializa‐ tion ), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorith Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_13.webp Gradient Descent. Learning rate too small. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_14.webp Gradient Descent. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. Learning rate too large Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_15.webp Gradient Descent. Gradient Descent pitfalls. Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function , which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_16.webp Gradient Descent. In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales.shows Gradient Descent on a train‐ ing set where features 1 and 2 have the same scale (on the left), and on a training set where feature 1 has much smaller values than feature 2 (on the right). Gradient Descent with and without feature scaling Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_17.webp Gradient Descent. As you can see, on the left the Gradient Descent algorithm goes straight toward the minimum, thereby reaching it quickly, whereas on the right it first goes in a direction almost orthogonal to the direction of the global minimum, and it ends with a long march down an almost flat valley. When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_18.webp Batch Gradient Descent. To implement Gradient Descent, you need to compute the gradient of the cost func‐ tion with regards to each model parameter θ j . Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_19.webp Batch Gradient Descent. To implement Gradient Descent, you need to compute the gradient of the cost func‐ tion with regards to each model parameter θ j . ter θ j , noted ∂ MSE θ Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_20.webp Batch Gradient Descent. Equation 4-5. Partial derivatives of the cost function Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_21.webp Batch Gradient Descent. Equation 4-5. Partial derivatives of the cost function Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_22.webp Batch Gradient Descent. Equation 4-5. Partial derivatives of the cost function. ∂ MSE θ = 2 ∑ m θT · i − y i x i Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_23.webp Batch Gradient Descent. ∂ MSE θ = 2 ∑ m θT · i − y i x i. ∂ θj Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_24.webp Batch Gradient Descent. ∂ θ 0. ∂ MSE θ 2 T Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_25.webp Batch Gradient Descent. ∂ MSE θ 2 T Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_26.webp Batch Gradient Descent. ∂ MSE θ 2 T. ∇ θ MSE θ = Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_27.webp Batch Gradient Descent. = m Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_28.webp Batch Gradient Descent. = m. · · θ − Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_29.webp Batch Gradient Descent. ∂ θn. Notice that this formula involves calculations over the full training set X , at each Gradient Descent step! Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_30.webp Batch Gradient Descent. Equation 4-7. Gradient Descent step Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_31.webp Batch Gradient Descent. Equation 4-7. Gradient Descent step Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_32.webp Batch Gradient Descent. Equation 4-7. Gradient Descent step Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_33.webp Batch Gradient Descent. θ next step = θ − η ∇ θ MSE θ Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_34.webp Batch Gradient Descent. Hey, that’s exactly what the Normal Equation found!. Gradient Descent with various learning rates Batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_35.webp Stochastic Gradient Descent. On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on aver‐ age. Stochastic Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_36.webp Stochastic Gradient Descent. Stochastic Gradient Descent first 10 steps. Note that since instances are picked randomly, some instances may be picked several times per epoch while others may not be picked at all. Stochastic Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_37.webp Mini-batch Gradient Descent. The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. Gradient Descent paths in parameter space Mini-batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_38.webp Mini-batch Gradient Descent. Mini-batch GD Fast Yes Fast ≥2 Yes n/a. There is almost no difference after training: all these algorithms end up with very similar models and make predictions in exactly the same way Mini-batch Gradient Descent https://whizan.xyz/books/ml/images/mlchapter4_39.webp Polynomial Regression. y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1). Generated nonlinear and noisy dataset Polynomial Regression https://whizan.xyz/books/ml/images/mlchapter4_40.webp Polynomial Regression. Polynomial Regression model predictions. Not bad: the model estimates y = 0 . 56 x 2 + 0 . 93 x + 1 . 78 when in fact the original Polynomial Regression https://whizan.xyz/books/ml/images/mlchapter4_41.webp Polynomial Regression. features a 2, a 3, b 2, and b 3, but also the combinations ab , a 2 b , and ab 2. PolynomialFeatures(degree=d) transforms an array containing n Polynomial Regression https://whizan.xyz/books/ml/images/mlchapter4_42.webp Learning Curves. If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. High-degree Polynomial Regression Learning Curves https://whizan.xyz/books/ml/images/mlchapter4_43.webp Learning Curves. lin_reg = LinearRegression() plot_learning_curves(lin_reg, X, y). Learning curves Learning Curves https://whizan.xyz/books/ml/images/mlchapter4_44.webp Learning Curves. These learning curves are typical of an underfitting model. Both curves have reached a plateau; they are close and fairly high. If your model is underfitting the training data, adding more train‐ ing examples will not help. You need to use a more complex model or come up with better features Learning Curves https://whizan.xyz/books/ml/images/mlchapter4_45.webp Learning Curves. Learning curves for the polynomial model Learning Curves https://whizan.xyz/books/ml/images/mlchapter4_46.webp Learning Curves. Learning curves for the polynomial model. One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error Learning Curves https://whizan.xyz/books/ml/images/mlchapter4_47.webp Ridge Regression. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. It is quite common for the cost function used during training to be different from the performance measure used for testing. Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_48.webp Ridge Regression. Equation 4-8. Ridge Regression cost function Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_49.webp Ridge Regression. Equation 4-8. Ridge Regression cost function Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_50.webp Ridge Regression. Equation 4-8. Ridge Regression cost function Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_51.webp Ridge Regression. J θ = MSE θ + α 1 ∑ n θ 2 Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_52.webp Ridge Regression. 2 i = 1 i Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_53.webp Ridge Regression. 2 i = 1 i Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_54.webp Ridge Regression. 2 i = 1 i Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_55.webp Ridge Regression. Note that the bias term θ 0 is not regularized (the sum starts at i = 1, not 0). Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_56.webp Ridge Regression. Note that the bias term θ 0 is not regularized (the sum starts at i = 1, not 0). It is important to scale the data (e.g., using a StandardScaler) before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_57.webp Ridge Regression. Chapter 4: Training Models. Equation 4-9. Ridge Regression closed-form solution Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_58.webp Ridge Regression. Equation 4-9. Ridge Regression closed-form solution Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_59.webp Ridge Regression. Equation 4-9. Ridge Regression closed-form solution. θ = T · + α −1 · T · Ridge Regression https://whizan.xyz/books/ml/images/mlchapter4_60.webp Lasso Regression. n Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_61.webp Lasso Regression. n Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_62.webp Lasso Regression. n Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_63.webp Lasso Regression. J θ = MSE θ + αi ∑= 1 θi Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_64.webp Lasso Regression. shows the same thing asbut replaces Ridge models with Lasso models and uses smaller α values Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_65.webp Lasso Regression. the contours represent the same cost function plus an ℓ 1 penalty with α = 0.5. Lasso versus Ridge regularization Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_66.webp Lasso Regression. Lasso versus Ridge regularization. On the Lasso cost function, the BGD path tends to bounce across the gutter toward the end. This is because the slope changes abruptly at θ 2 = 0. You need to gradually reduce the learning rate in order to actually converge to the global minimum Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_67.webp Lasso Regression. Equation 4-11. Lasso Regression subgradient vector. sign θ 1 Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_68.webp Lasso Regression. −1 if θi < 0 Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_69.webp Lasso Regression. −1 if θi < 0 Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_70.webp Lasso Regression. −1 if θi < 0 Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_71.webp Lasso Regression Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_72.webp Lasso Regression. g θ , J = ∇ θ MSE θ + α sign θ 2 Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_73.webp Lasso Regression. ⋮. sign θn Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_74.webp Lasso Regression. sign θn. where sign θi Lasso Regression https://whizan.xyz/books/ml/images/mlchapter4_75.webp Elastic Net. Equation 4-12. Elastic Net cost function Elastic Net https://whizan.xyz/books/ml/images/mlchapter4_76.webp Elastic Net. Equation 4-12. Elastic Net cost function Elastic Net https://whizan.xyz/books/ml/images/mlchapter4_77.webp Elastic Net. Equation 4-12. Elastic Net cost function Elastic Net https://whizan.xyz/books/ml/images/mlchapter4_78.webp Elastic Net. J θ = MSE θ + rα ∑ θ + 1 − rα ∑ n θ 2 Elastic Net https://whizan.xyz/books/ml/images/mlchapter4_79.webp Early Stopping. A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. Early stopping regularization Early Stopping https://whizan.xyz/books/ml/images/mlchapter4_80.webp Early Stopping. Early stopping regularization. With Stochastic and Mini-batch Gradient Descent, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not. Early Stopping https://whizan.xyz/books/ml/images/mlchapter4_81.webp Estimating Probabilities. Equation 4-13. Logistic Regression model estimated probability (vectorized form) Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_82.webp Estimating Probabilities. Equation 4-13. Logistic Regression model estimated probability (vectorized form) Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_83.webp Estimating Probabilities. Equation 4-13. Logistic Regression model estimated probability (vectorized form) Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_84.webp Estimating Probabilities. p = hθ = σ θT · Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_85.webp Estimating Probabilities. Equation 4-14. Logistic function Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_86.webp Estimating Probabilities. Equation 4-14. Logistic function. 1 1 + exp − t Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_87.webp Estimating Probabilities. 1 1 + exp − t Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_88.webp Estimating Probabilities. 1 1 + exp − t. σ t = Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_89.webp Estimating Probabilities. Logistic function. Once the Logistic Regression model has estimated the probability p = h θ ( x ) that an instance x belongs to the positive class, it can make its prediction ŷ easily (see Equa‐ tion 4-15 ) Estimating Probabilities https://whizan.xyz/books/ml/images/mlchapter4_90.webp Training and Cost Function. Equation 4-16. Cost function of a single training instance. p Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_91.webp Training and Cost Function. p Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_92.webp Training and Cost Function. p. c θ = Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_93.webp Training and Cost Function. if y = 1 Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_94.webp Training and Cost Function. if y = 1. – log 1 − p if y = 0 Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_95.webp Training and Cost Function. Equation 4-17. Logistic Regression cost function (log loss) Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_96.webp Training and Cost Function. Equation 4-17. Logistic Regression cost function (log loss) Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_97.webp Training and Cost Function. Equation 4-17. Logistic Regression cost function (log loss) Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_98.webp Training and Cost Function Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_99.webp Training and Cost Function. J θ = − 1 ∑ m y i log + 1 − y i log 1 − p i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_100.webp Training and Cost Function. J θ = − 1 ∑ m y i log + 1 − y i log 1 − p i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_101.webp Training and Cost Function. J θ = − 1 ∑ m y i log + 1 − y i log 1 − p i. p i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_102.webp Training and Cost Function. p i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_103.webp Training and Cost Function. p i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_104.webp Training and Cost Function. p i. m i = 1 Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_105.webp Training and Cost Function. Equation 4-18. Logistic cost function partial derivatives Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_106.webp Training and Cost Function. Equation 4-18. Logistic cost function partial derivatives Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_107.webp Training and Cost Function. Equation 4-18. Logistic cost function partial derivatives Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_108.webp Training and Cost Function. ∂ J θ = 1 ∑ m σ θT · i − y i x i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_109.webp Training and Cost Function. ∂ J θ = 1 ∑ m σ θT · i − y i x i Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_110.webp Training and Cost Function. ∂ J θ = 1 ∑ m σ θT · i − y i x i. ∂ θj Training and Cost Function https://whizan.xyz/books/ml/images/mlchapter4_111.webp Decision Boundaries. Flowers of three iris plant species 16. Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature. First let’s load the data Decision Boundaries https://whizan.xyz/books/ml/images/mlchapter4_112.webp Decision Boundaries. Estimated probabilities and decision boundary. The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm. Decision Boundaries https://whizan.xyz/books/ml/images/mlchapter4_113.webp Decision Boundaries. Linear decision boundary. Just like the other linear models, Logistic Regression models can be regularized using ℓ 1 or ℓ 2 penalties. Scitkit-Learn actually adds an ℓ 2 penalty by default Decision Boundaries https://whizan.xyz/books/ml/images/mlchapter4_114.webp Decision Boundaries. Just like the other linear models, Logistic Regression models can be regularized using ℓ 1 or ℓ 2 penalties. Scitkit-Learn actually adds an ℓ 2 penalty by default. The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is regularized Decision Boundaries https://whizan.xyz/books/ml/images/mlchapter4_115.webp Softmax Regression. Equation 4-19. Softmax score for class k Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_116.webp Softmax Regression. Equation 4-19. Softmax score for class k. sk = θ T · Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_117.webp Softmax Regression. Equation 4-20. Softmax function Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_118.webp Softmax Regression. Equation 4-20. Softmax function. exp sk Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_119.webp Softmax Regression. j Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_120.webp Softmax Regression. j Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_121.webp Softmax Regression. j. pk = σ k = Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_122.webp Softmax Regression. Equation 4-21. Softmax Regression classifier prediction Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_123.webp Softmax Regression. Equation 4-21. Softmax Regression classifier prediction Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_124.webp Softmax Regression. Equation 4-21. Softmax Regression classifier prediction Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_125.webp Softmax Regression Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_126.webp Softmax Regression Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_127.webp Softmax Regression Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_128.webp Softmax Regression. y = argmax σ k = argmax sk = argmax θ T · Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_129.webp Softmax Regression. The argmax operator returns the value of a variable that maximizes a function. In this equation, it returns the value of k that maximizes the estimated probability σ ( s ( x )) k. The Softmax Regression classifier predicts only one class at a time (i.e., it is multiclass, not multioutput) so it should be used only with mutually exclusive classes such as different types of plants. Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_130.webp Softmax Regression. Chapter 4: Training Models Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_131.webp Softmax Regression. Chapter 4: Training Models. J Θ = − 1 ∑ m ∑ K y i log Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_132.webp Softmax Regression. J Θ = − 1 ∑ m ∑ K y i log. p i Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_133.webp Softmax Regression. Equation 4-23. Cross entropy gradient vector for class k Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_134.webp Softmax Regression. Equation 4-23. Cross entropy gradient vector for class k. ∇ J Θ = 1 ∑ − y i i Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_135.webp Softmax Regression. k. m Softmax Regression https://whizan.xyz/books/ml/images/mlchapter4_136.webp Softmax Regression. shows the resulting decision boundaries, represented by the background colors. Softmax Regression decision boundaries Softmax Regression https://whizan.xyz/books/ml/mlchapter5 https://whizan.xyz/books/ml/images/mlchapter5_0.webp Linear SVM Classification. Large margin classification. Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. Linear SVM Classification https://whizan.xyz/books/ml/images/mlchapter5_1.webp Linear SVM Classification. Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. SVMs are sensitive to the feature scales, as you can see in: on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. Linear SVM Classification https://whizan.xyz/books/ml/images/mlchapter5_2.webp Linear SVM Classification. SVMs are sensitive to the feature scales, as you can see in: on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. Sensitivity to feature scales Linear SVM Classification https://whizan.xyz/books/ml/images/mlchapter5_3.webp Soft Margin Classification. Hard margin sensitivity to outliers. To avoid these issues it is preferable to use a more flexible model. Soft Margin Classification https://whizan.xyz/books/ml/images/mlchapter5_4.webp Soft Margin Classification. In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparame‐ ter: a smaller C value leads to a wider street but more margin violations.shows the decision boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. Fewer margin violations versus large margin Soft Margin Classification https://whizan.xyz/books/ml/images/mlchapter5_5.webp Soft Margin Classification. Fewer margin violations versus large margin. If your SVM model is overfitting, you can try regularizing it by reducing C Soft Margin Classification https://whizan.xyz/books/ml/images/mlchapter5_6.webp Soft Margin Classification. array([ 1.]). Unlike Logistic Regression classifiers, SVM classifiers do not out‐ put probabilities for each class Soft Margin Classification https://whizan.xyz/books/ml/images/mlchapter5_7.webp Soft Margin Classification. Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it is much slower, especially with large training sets, so it is not recommended. The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. Soft Margin Classification https://whizan.xyz/books/ml/images/mlchapter5_8.webp Nonlinear SVM Classification. Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. Adding features to make a dataset linearly separable Nonlinear SVM Classification https://whizan.xyz/books/ml/images/mlchapter5_9.webp Nonlinear SVM Classification. Linear SVM classifier using polynomial features Nonlinear SVM Classification https://whizan.xyz/books/ml/images/mlchapter5_10.webp Polynomial Kernel. reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing it. The hyperparameter coef0 controls how much the model is influenced by high- degree polynomials versus low-degree polynomials. SVM classifiers with a polynomial kernel Polynomial Kernel https://whizan.xyz/books/ml/images/mlchapter5_11.webp Polynomial Kernel. SVM classifiers with a polynomial kernel. A common approach to find the right hyperparameter values is to use grid search (see Chapter 2 ). Polynomial Kernel https://whizan.xyz/books/ml/images/mlchapter5_12.webp Adding Similarity Features. Equation 5-1. Gaussian RBF Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_13.webp Adding Similarity Features. Equation 5-1. Gaussian RBF Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_14.webp Adding Similarity Features. Equation 5-1. Gaussian RBF Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_15.webp Adding Similarity Features Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_16.webp Adding Similarity Features Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_17.webp Adding Similarity Features. ϕγ , ℓ = exp − γ − ℓ 2 Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_18.webp Adding Similarity Features. Similarity features using the Gaussian RBF. You may wonder how to select the landmarks. Adding Similarity Features https://whizan.xyz/books/ml/images/mlchapter5_19.webp Gaussian RBF Kernel. SVM classifiers using an RBF kernel. Other kernels exist but are used much more rarely. Gaussian RBF Kernel https://whizan.xyz/books/ml/images/mlchapter5_20.webp Gaussian RBF Kernel. Other kernels exist but are used much more rarely. With so many kernels to choose from, how can you decide which one to use? Gaussian RBF Kernel https://whizan.xyz/books/ml/images/mlchapter5_21.webp SVM Regression. Chapter 5: Support Vector Machines. Adding more training instances within the margin does not affect the model’s predic‐ tions; thus, the model is said to be ϵ-insensitive SVM Regression https://whizan.xyz/books/ml/images/mlchapter5_22.webp SVM Regression. To tackle nonlinear regression tasks, you can use a kernelized SVM model. SVM regression using a 2nd-degree polynomial kernel SVM Regression https://whizan.xyz/books/ml/images/mlchapter5_23.webp SVM Regression. svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1) svm_poly_reg.fit(X, y). SVMs can also be used for outlier detection; see Scikit-Learn’s doc‐ umentation for more details SVM Regression https://whizan.xyz/books/ml/images/mlchapter5_24.webp Decision Function and Predictions. shows the decision function that corresponds to the model on the right of: it is a two-dimensional plane since this dataset has two features (petal width and petal length). Decision function for the iris dataset Decision Function and Predictions https://whizan.xyz/books/ml/images/mlchapter5_25.webp Training Objective. The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training Objective https://whizan.xyz/books/ml/images/mlchapter5_26.webp Training Objective. The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Consider the slope of the decision function: it is equal to the norm of the weight vec‐ tor, w . Training Objective https://whizan.xyz/books/ml/images/mlchapter5_27.webp Training Objective. A smaller weight vector results in a larger margin Training Objective https://whizan.xyz/books/ml/images/mlchapter5_28.webp Training Objective. A smaller weight vector results in a larger margin Training Objective https://whizan.xyz/books/ml/images/mlchapter5_29.webp Training Objective. A smaller weight vector results in a larger margin. So we want to minimize w to get a large margin. Training Objective https://whizan.xyz/books/ml/images/mlchapter5_30.webp Training Objective. subject to t T · i + b ≥ 1 for i = 1, 2, ⋯, m. i Training Objective https://whizan.xyz/books/ml/images/mlchapter5_31.webp Training Objective. 2 Training Objective https://whizan.xyz/books/ml/images/mlchapter5_32.webp Training Objective. 2 Training Objective https://whizan.xyz/books/ml/images/mlchapter5_33.webp Training Objective. 2. We are minimizing 1 w T · w , which is equal to 1 w 2, rather than Training Objective https://whizan.xyz/books/ml/images/mlchapter5_34.webp Training Objective. 2 2 Training Objective https://whizan.xyz/books/ml/images/mlchapter5_35.webp Training Objective. 2 2. minimizing w . This is because it will give the same result (since the values of w and b that minimize a value also minimize half of Training Objective https://whizan.xyz/books/ml/images/mlchapter5_36.webp Training Objective. minimizing w . This is because it will give the same result (since the values of w and b that minimize a value also minimize half of Training Objective https://whizan.xyz/books/ml/images/mlchapter5_37.webp Training Objective. minimizing w . This is because it will give the same result (since the values of w and b that minimize a value also minimize half of. its square), but 1 w 2 has a nice and simple derivative (it is just Training Objective https://whizan.xyz/books/ml/images/mlchapter5_38.webp Training Objective. 2 Training Objective https://whizan.xyz/books/ml/images/mlchapter5_39.webp Training Objective. 2. w ) while w is not differentiable at w = 0 . Optimization algo‐ rithms work much better on differentiable functions Training Objective https://whizan.xyz/books/ml/images/mlchapter5_40.webp Training Objective. subject to t T · i + b ≥ 1 − ζ i and ζ i ≥ 0 for i = 1, 2, ⋯, m. i Training Objective https://whizan.xyz/books/ml/images/mlchapter5_41.webp The Dual Problem. Equation 5-6. Dual form of the linear SVM objective The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_42.webp The Dual Problem. Equation 5-6. Dual form of the linear SVM objective The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_43.webp The Dual Problem. Equation 5-6. Dual form of the linear SVM objective The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_44.webp The Dual Problem The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_45.webp The Dual Problem The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_46.webp The Dual Problem. minimize 1 ∑ m ∑ m α i α j t i t j i T · j − ∑ m α i The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_47.webp The Dual Problem. = m α i t i i i = 1. ∑ The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_48.webp The Dual Problem. α i > 0 The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_49.webp The Dual Problem. α i > 0. ∑ The Dual Problem https://whizan.xyz/books/ml/images/mlchapter5_50.webp Kernelized SVM. x 2. x 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_51.webp Kernelized SVM. Equation 5-9. Kernel trick for a 2nd-degree polynomial mapping Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_52.webp Kernelized SVM. Equation 5-9. Kernel trick for a 2nd-degree polynomial mapping Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_53.webp Kernelized SVM. Equation 5-9. Kernel trick for a 2nd-degree polynomial mapping Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_54.webp Kernelized SVM. ϕ T · ϕ = Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_55.webp Kernelized SVM. 1. a Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_56.webp Kernelized SVM. 1. b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_57.webp Kernelized SVM. How about that? The dot product of the transformed vectors is equal to the square of the dot product of the original vectors: ϕ ( a ) T · ϕ ( b ) = ( a T · b )2 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_58.webp Kernelized SVM. How about that? The dot product of the transformed vectors is equal to the square of the dot product of the original vectors: ϕ ( a ) T · ϕ ( b ) = ( a T · b )2 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_59.webp Kernelized SVM. How about that? The dot product of the transformed vectors is equal to the square of the dot product of the original vectors: ϕ ( a ) T · ϕ ( b ) = ( a T · b )2. Now here is the key insight: if you apply the transformation ϕ to all training instan‐ ces, then the dual problem (see Equation 5-6 ) will contain the dot product ϕ ( x (i) ) T · ϕ ( x (j) ). Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_60.webp Kernelized SVM. Equation 5-10. Common kernels Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_61.webp Kernelized SVM. Equation 5-10. Common kernels. Linear: K , = T · Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_62.webp Kernelized SVM. Linear: K , = T · Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_63.webp Kernelized SVM. Linear: K , = T · Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_64.webp Kernelized SVM. Linear: K , = T · Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_65.webp Kernelized SVM. Polynomial: K , = γT · + r d Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_66.webp Kernelized SVM. Polynomial: K , = γT · + r d Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_67.webp Kernelized SVM. Polynomial: K , = γT · + r d Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_68.webp Kernelized SVM. Polynomial: K , = γT · + r d Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_69.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_70.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_71.webp Kernelized SVM. Gaussian RBF: K , = exp − γ − 2 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_72.webp Kernelized SVM. Gaussian RBF: K , = exp − γ − 2 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_73.webp Kernelized SVM. Gaussian RBF: K , = exp − γ − 2. Sigmoid: K , = tanh γT · + r Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_74.webp Kernelized SVM. Equation 5-11. Making predictions with a kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_75.webp Kernelized SVM. Equation 5-11. Making predictions with a kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_76.webp Kernelized SVM. Equation 5-11. Making predictions with a kernelized SVM. b ϕ Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_77.webp Kernelized SVM. = T Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_78.webp Kernelized SVM. = T. · ϕ n Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_79.webp Kernelized SVM. + b = α. ∑ Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_80.webp Kernelized SVM. T Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_81.webp Kernelized SVM. T. · ϕ n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_82.webp Kernelized SVM. · ϕ n + b. i Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_83.webp Kernelized SVM. = m α i t Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_84.webp Kernelized SVM. = m α i t. i Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_85.webp Kernelized SVM. i = 1. ϕ i T · ϕ n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_86.webp Kernelized SVM. ϕ i T · ϕ n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_87.webp Kernelized SVM. ϕ i T · ϕ n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_88.webp Kernelized SVM. ϕ i T · ϕ n + b. = m α i t i K i , n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_89.webp Kernelized SVM. = m α i t i K i , n + b Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_90.webp Kernelized SVM. = m α i t i K i , n + b. ∑ Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_91.webp Kernelized SVM. α i > 0. Note that since α(i) ≠ 0 only for support vectors, making predictions involves comput‐ ing the dot product of the new input vector x (n) with only the support vectors, not all Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_92.webp Kernelized SVM. i Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_93.webp Kernelized SVM. i. m Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_94.webp Kernelized SVM. m Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_95.webp Kernelized SVM. m Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_96.webp Kernelized SVM. m Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_97.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_98.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_99.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_100.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_101.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_102.webp Kernelized SVM. b = 1 ∑ m 1 − t i T · ϕ i = 1 ∑ 1 − t m α j t j ϕ j i Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_103.webp Kernelized SVM. b = 1 ∑ m 1 − t i T · ϕ i = 1 ∑ 1 − t m α j t j ϕ j i Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_104.webp Kernelized SVM. b = 1 ∑ m 1 − t i T · ϕ i = 1 ∑ 1 − t m α j t j ϕ j i. ∑ · ϕ Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_105.webp Kernelized SVM. α i > 0. ns i = 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_106.webp Kernelized SVM. α i > 0. j = 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_107.webp Kernelized SVM. α i > 0 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_108.webp Kernelized SVM. α i > 0 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_109.webp Kernelized SVM. α i > 0 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_110.webp Kernelized SVM Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_111.webp Kernelized SVM. 1 − t i m α j t j K i , j j = 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_112.webp Kernelized SVM. 1 − t i m α j t j K i , j j = 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_113.webp Kernelized SVM. 1 − t i m α j t j K i , j j = 1 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_114.webp Kernelized SVM. 1 − t i m α j t j K i , j j = 1. α j > 0 Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_115.webp Kernelized SVM. α j > 0. ∑ Kernelized SVM https://whizan.xyz/books/ml/images/mlchapter5_116.webp Online SVMs. Equation 5-13. Linear SVM classifier cost function Online SVMs https://whizan.xyz/books/ml/images/mlchapter5_117.webp Online SVMs. Equation 5-13. Linear SVM classifier cost function Online SVMs https://whizan.xyz/books/ml/images/mlchapter5_118.webp Online SVMs. Equation 5-13. Linear SVM classifier cost function Online SVMs https://whizan.xyz/books/ml/images/mlchapter5_119.webp Online SVMs. J , b = 1 T · + C ∑ m max 0, 1 − t T · i + b Online SVMs https://whizan.xyz/books/ml/images/mlchapter5_120.webp Online SVMs. J , b = 1 T · + C ∑ m max 0, 1 − t T · i + b. i Online SVMs https://whizan.xyz/books/ml/images/mlchapter5_121.webp Online SVMs. It is also possible to implement online kernelized SVMs—for example, using “Incre‐ mental and Decremental SVM Learning” 7 or “Fast Kernel Classifiers with Online and Active Learning.” 8 However, these are implemented in Matlab and C++. Online SVMs https://whizan.xyz/books/ml/mlchapter6 https://whizan.xyz/books/ml/images/mlchapter6_0.webp Training and Visualizing a Decision Tree. Your first decision tree looks like. Iris Decision Tree Training and Visualizing a Decision Tree https://whizan.xyz/books/ml/images/mlchapter6_1.webp Making Predictions. 2.45 cm. One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering at all Making Predictions https://whizan.xyz/books/ml/images/mlchapter6_2.webp Making Predictions. p i , k is the ratio of class k instances among the training instances in the i th node. Scikit-Learn uses the CART algorithm, which produces only binary trees : nonleaf nodes always have two children (i.e., questions only have yes/no answers). Making Predictions https://whizan.xyz/books/ml/images/mlchapter6_3.webp Making Predictions. shows this Decision Tree’s decision boundaries. Decision Tree decision boundaries Making Predictions https://whizan.xyz/books/ml/images/mlchapter6_4.webp The CART Training Algorithm. ples_leaf, min_weight_fraction_leaf, and max_leaf_nodes). As you can see, the CART algorithm is a greedy algorithm : it greed‐ ily searches for an optimum split at the top level, then repeats the process at each level. The CART Training Algorithm https://whizan.xyz/books/ml/images/mlchapter6_5.webp Gini Impurity or Entropy?. pi , k ≠ 0 Gini Impurity or Entropy? https://whizan.xyz/books/ml/images/mlchapter6_6.webp Gini Impurity or Entropy?. pi , k ≠ 0. pi , k log pi , k Gini Impurity or Entropy? https://whizan.xyz/books/ml/images/mlchapter6_7.webp Regularization Hyperparameters. ples a node must have before it can be split), min_samples_leaf (the minimum num‐ ber of samples a leaf node must have), min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances), max_leaf_nodes (maximum number of leaf nodes), and max_features (maximum number of features that are evaluated for splitting at each node). Other algorithms work by first training the Decision Tree without restrictions, then pruning (deleting) unnecessary nodes. Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_8.webp Regularization Hyperparameters. shows two Decision Trees trained on the moons dataset (introduced in Chapter 5 ). Regularization using min_samples_leaf Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_9.webp Regularization Hyperparameters. The resulting tree is represented on. A Decision Tree for regression Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_10.webp Regularization Hyperparameters. Predictions of two Decision Tree regression models. The CART algorithm works mostly the same way as earlier, except that instead of try‐ ing to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE. Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_11.webp Regularization Hyperparameters. J k , t. = m left MSE Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_12.webp Regularization Hyperparameters. Just like for classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks. Regularizing a Decision Tree regressor Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_13.webp Regularization Hyperparameters. Hopefully by now you are convinced that Decision Trees have a lot going for them: they are simple to understand and interpret, easy to use, versatile, and powerful. Sensitivity to training set rotation Regularization Hyperparameters https://whizan.xyz/books/ml/images/mlchapter6_14.webp Regularization Hyperparameters. Instability. Sensitivity to training set details Regularization Hyperparameters https://whizan.xyz/books/ml/mlchapter7 https://whizan.xyz/books/ml/images/mlchapter7_0.webp Voting Classifiers. Training diverse classifiers. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classi‐ fier is called a hard voting classifier (see Voting Classifiers https://whizan.xyz/books/ml/images/mlchapter7_1.webp Voting Classifiers. A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classi‐ fier is called a hard voting classifier (see. Hard voting classifier predictions Voting Classifiers https://whizan.xyz/books/ml/images/mlchapter7_2.webp Voting Classifiers. How is this possible?. The law of large numbers Voting Classifiers https://whizan.xyz/books/ml/images/mlchapter7_3.webp Voting Classifiers. Similarly, suppose you build an ensemble containing 1,000 classifiers that are individ‐ ually correct only 51% of the time (barely better than random guessing). Ensemble methods work best when the predictors are as independ‐ ent from one another as possible. Voting Classifiers https://whizan.xyz/books/ml/images/mlchapter7_4.webp Bagging and Pasting. In other words, both bagging and pasting allow training instances to be sampled sev‐ eral times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor. Pasting/bagging training set sampling and training Bagging and Pasting https://whizan.xyz/books/ml/images/mlchapter7_5.webp Bagging and Pasting in Scikit-Learn. bag_clf.fit(X_train, y_train) y_pred = bag_clf.predict(X_test). The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class proba‐ bilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifiers Bagging and Pasting in Scikit-Learn https://whizan.xyz/books/ml/images/mlchapter7_6.webp Bagging and Pasting in Scikit-Learn. A single Decision Tree versus a bagging ensemble of 500 trees. Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced. Bagging and Pasting in Scikit-Learn https://whizan.xyz/books/ml/images/mlchapter7_7.webp Random Forests. You can create an Extra-Trees classifier using Scikit-Learn’s ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier class. Similarly, the Extra TreesRegressor class has the same API as the RandomForestRegressor class. It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier. Random Forests https://whizan.xyz/books/ml/images/mlchapter7_8.webp Feature Importance. Similarly, if you train a Random Forest classifier on the MNIST dataset (introduced in Chapter 3 ) and plot each pixel’s importance, you get the image represented in. MNIST pixel importance (according to a Random Forest classifier) Feature Importance https://whizan.xyz/books/ml/images/mlchapter7_9.webp AdaBoost. For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. AdaBoost sequential training with instance weight updates AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_10.webp AdaBoost. parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better. Decision boundaries of consecutive predictors AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_11.webp AdaBoost. Once all predictors are trained, the ensemble makes predictions very much like bag‐ ging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set. There is one important drawback to this sequential learning techni‐ que: it cannot be parallelized (or only partially), since each predic‐ tor can only be trained after the previous predictor has been trained and evaluated. AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_12.webp AdaBoost. w i if yj i = y i. i AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_13.webp AdaBoost. i. w AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_14.webp AdaBoost. w AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_15.webp AdaBoost. w. w i exp αj if yj i ≠ y i AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_16.webp AdaBoost. w i exp αj if yj i ≠ y i. Then all the instance weights are normalized (i.e., divided by ∑ m w i ) AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_17.webp AdaBoost. ∑ AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_18.webp AdaBoost. ∑. y = argmax AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_19.webp AdaBoost. ada_clf.fit(X_train, y_train). If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regulariz‐ ing the base estimator AdaBoost https://whizan.xyz/books/ml/images/mlchapter7_20.webp Gradient Boosting. Chapter 7: Ensemble Learning and Random Forests. The learning_rate hyperparameter scales the contribution of each tree. Gradient Boosting https://whizan.xyz/books/ml/images/mlchapter7_21.webp Gradient Boosting. GBRT ensembles with not enough predictors (left) and too many (right). In order to find the optimal number of trees, you can use early stopping (see Chap‐ ter 4 ). Gradient Boosting https://whizan.xyz/books/ml/images/mlchapter7_22.webp Gradient Boosting. Tuning the number of trees using early stopping. It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). Gradient Boosting https://whizan.xyz/books/ml/images/mlchapter7_23.webp Gradient Boosting. Boosting. It is possible to use Gradient Boosting with other cost functions. This is controlled by the loss hyperparameter (see Scikit-Learn’s documentation for more details) Gradient Boosting https://whizan.xyz/books/ml/images/mlchapter7_24.webp Stacking. The last Ensemble method we will discuss in this chapter is called stacking (short for stacked generalization ). Aggregating predictions using a blending predictor Stacking https://whizan.xyz/books/ml/images/mlchapter7_25.webp Stacking. Training the first layer. Next, the first layer predictors are used to make predictions on the second (held-out) set (see. Stacking https://whizan.xyz/books/ml/images/mlchapter7_26.webp Stacking. Next, the first layer predictors are used to make predictions on the second (held-out) set (see. Training the blender Stacking https://whizan.xyz/books/ml/images/mlchapter7_27.webp Stacking. It is actually possible to train several different blenders this way (e.g., one using Lin‐ ear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. Predictions in a multilayer stacking ensemble Stacking https://whizan.xyz/books/ml/mlchapter8 https://whizan.xyz/books/ml/images/mlchapter8_0.webp Dimensionality Reduction. Fortunately, in real-world problems, it is often possible to reduce the number of fea‐ tures considerably, turning an intractable problem into a tractable one. Reducing dimensionality does lose some information (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may also make your system per‐ form slightly worse. Dimensionality Reduction https://whizan.xyz/books/ml/images/mlchapter8_1.webp The Curse of Dimensionality. We are so used to living in three dimensions 1 that our intuition fails us when we try to imagine a high-dimensional space. Point, segment, square, cube, and tesseract (0D to 4D hypercubes) 2 The Curse of Dimensionality https://whizan.xyz/books/ml/images/mlchapter8_2.webp Projection. A 3D dataset lying close to a 2D subspace. Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. Projection https://whizan.xyz/books/ml/images/mlchapter8_3.webp Projection. Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. The new 2D dataset after projection Projection https://whizan.xyz/books/ml/images/mlchapter8_4.webp Projection. However, projection is not always the best approach to dimensionality reduction. In many cases the subspace may twist and turn, such as in the famous Swiss roll toy data‐ set represented in. Swiss roll dataset Projection https://whizan.xyz/books/ml/images/mlchapter8_5.webp Projection. Simply projecting onto a plane (e.g., by dropping x 3 ) would squash different layers of the Swiss roll together, as shown on the left of. However, what you really want is to unroll the Swiss roll to obtain the 2D dataset on the right of. Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right) Projection https://whizan.xyz/books/ml/images/mlchapter8_6.webp Manifold Learning. The decision boundary may not always be simpler with lower dimensions Manifold Learning https://whizan.xyz/books/ml/images/mlchapter8_7.webp Preserving the Variance. Selecting the subspace onto which to project. It seems reasonable to select the axis that preserves the maximum amount of var‐ iance, as it will most likely lose less information than the other projections. Preserving the Variance https://whizan.xyz/books/ml/images/mlchapter8_8.webp Principal Components. Chapter 8: Dimensionality Reduction. The direction of the principal components is not stable: if you per‐ turb the training set slightly and run PCA again, some of the new PCs may point in the opposite direction of the original PCs. Principal Components https://whizan.xyz/books/ml/images/mlchapter8_9.webp Principal Components. c2 = V.T[:, 1]. PCA assumes that the dataset is centered around the origin. Principal Components https://whizan.xyz/books/ml/images/mlchapter8_10.webp Choosing the Right Number of Dimensions. Yet another option is to plot the explained variance as a function of the number of dimensions (simply plot cumsum; see. Explained variance as a function of the number of dimensions Choosing the Right Number of Dimensions https://whizan.xyz/books/ml/images/mlchapter8_11.webp Choosing the Right Number of Dimensions. X_mnist_recovered = pca.inverse_transform(X_mnist_reduced). MNIST compression preserving 95% of the variance Choosing the Right Number of Dimensions https://whizan.xyz/books/ml/images/mlchapter8_12.webp Kernel PCA. Swiss roll reduced to 2D using kPCA with various kernels Kernel PCA https://whizan.xyz/books/ml/images/mlchapter8_13.webp Selecting a Kernel and Tuning Hyperparameters. Another approach, this time entirely unsupervised, is to select the kernel and hyper‐ parameters that yield the lowest reconstruction error. Kernel PCA and the reconstruction pre-image error Selecting a Kernel and Tuning Hyperparameters https://whizan.xyz/books/ml/images/mlchapter8_14.webp Selecting a Kernel and Tuning Hyperparameters. X_preimage = rbf_pca.inverse_transform(X_reduced). By default, fit_inverse_transform=False and KernelPCA has no inverse_transform() method. This method only gets created when you set fit_inverse_transform=True Selecting a Kernel and Tuning Hyperparameters https://whizan.xyz/books/ml/images/mlchapter8_15.webp LLE. lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10) X_reduced = lle.fit_transform(X). Unrolled Swiss roll using LLE LLE https://whizan.xyz/books/ml/images/mlchapter8_16.webp LLE. Equation 8-4. LLE step 1: linearly modeling local relationships. = argmin i LLE https://whizan.xyz/books/ml/images/mlchapter8_17.webp LLE. 2 LLE https://whizan.xyz/books/ml/images/mlchapter8_18.webp LLE. 2 LLE https://whizan.xyz/books/ml/images/mlchapter8_19.webp LLE. 2. – j ∑= 1 wi , j LLE https://whizan.xyz/books/ml/images/mlchapter8_20.webp LLE. m LLE https://whizan.xyz/books/ml/images/mlchapter8_21.webp LLE. m. wi , j = 0 if j is not one of the k c.n. of i j ∑= 1 wi , j = 1 for i = 1, 2, ⋯, m LLE https://whizan.xyz/books/ml/images/mlchapter8_22.webp LLE. After this step, the weight matrix (containing the weights wi , j ) encodes the local linear relationships between the training instances. LLE https://whizan.xyz/books/ml/images/mlchapter8_23.webp LLE. After this step, the weight matrix (containing the weights wi , j ) encodes the local linear relationships between the training instances. space, then we want the squared distance between z (i) and ∑ m w j to be as small LLE https://whizan.xyz/books/ml/images/mlchapter8_24.webp LLE. Equation 8-5. LLE step 2: reducing dimensionality while preserving relationships. = argmin i LLE https://whizan.xyz/books/ml/images/mlchapter8_25.webp LLE. 2 LLE https://whizan.xyz/books/ml/images/mlchapter8_26.webp LLE. 2 LLE https://whizan.xyz/books/ml/images/mlchapter8_27.webp LLE. 2. – j ∑= 1 wi , j LLE https://whizan.xyz/books/ml/images/mlchapter8_28.webp Other Dimensionality Reduction Techniques. Linear Discriminant Analysis (LDA) is actually a classification algorithm, but dur‐ ing training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. Reducing the Swiss roll to 2D using various techniques Other Dimensionality Reduction Techniques https://whizan.xyz/books/ml/mlchapter9 https://whizan.xyz/books/ml/images/mlchapter9_0.webp Up and Running with TensorFlow. TensorFlow is a powerful open source software library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. A simple computation graph Up and Running with TensorFlow https://whizan.xyz/books/ml/images/mlchapter9_1.webp Up and Running with TensorFlow. developed by the Google Brain team and it powers many of Google’s large-scale serv‐ ices, such as Google Cloud Speech, Google Photos, and Google Search. Parallel computation on multiple CPUs/GPUs/servers Up and Running with TensorFlow https://whizan.xyz/books/ml/images/mlchapter9_2.webp Installation. $ pip3 install --upgrade tensorflow. For GPU support, you need to install tensorflow-gpu instead of Installation https://whizan.xyz/books/ml/images/mlchapter9_3.webp Managing Graphs. >>> x2.graph is tf.get_default_graph() False. In Jupyter (or in a Python shell), it is common to run the same commands more than once while you are experimenting. Managing Graphs https://whizan.xyz/books/ml/images/mlchapter9_4.webp Lifecycle of a Node Value. print (z_val) # 15. In single-process TensorFlow, multiple sessions do not share any state, even if they reuse the same graph (each session would have its own copy of every variable). Lifecycle of a Node Value https://whizan.xyz/books/ml/images/mlchapter9_5.webp Implementing Gradient Descent. Let’s try using Batch Gradient Descent (introduced in Chapter 4 ) instead of the Nor‐ mal Equation. When using Gradient Descent, remember that it is important to first normalize the input feature vectors, or else training may be much slower. Implementing Gradient Descent https://whizan.xyz/books/ml/images/mlchapter9_6.webp Feeding Data to the Training Algorithm. [ 12. 13. 14.]]. You can actually feed the output of any operations, not just place‐ holders. In this case TensorFlow does not try to evaluate these operations; it uses the values you feed it Feeding Data to the Training Algorithm https://whizan.xyz/books/ml/images/mlchapter9_7.webp Feeding Data to the Training Algorithm. best_theta = theta.eval(). We don’t need to pass the value of X and y when evaluating theta Feeding Data to the Training Algorithm https://whizan.xyz/books/ml/images/mlchapter9_8.webp Visualizing the Graph and Training Curves Using TensorBoard. sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) [..]. Avoid logging training stats at every single training step, as this would significantly slow down training Visualizing the Graph and Training Curves Using TensorBoard https://whizan.xyz/books/ml/images/mlchapter9_9.webp Visualizing the Graph and Training Curves Using TensorBoard. Next open a browser and go to http://0.0.0.0:6006/ (or http://localhost:6006/ ). Visualizing training stats using TensorBoard Visualizing the Graph and Training Curves Using TensorBoard https://whizan.xyz/books/ml/images/mlchapter9_10.webp Visualizing the Graph and Training Curves Using TensorBoard. Visualizing the graph using TensorBoard Visualizing the Graph and Training Curves Using TensorBoard https://whizan.xyz/books/ml/images/mlchapter9_11.webp Visualizing the Graph and Training Curves Using TensorBoard. Visualizing the graph using TensorBoard. If you want to take a peek at the graph directly within Jupyter, you can use the show_graph() function available in the notebook for this chapter. Visualizing the Graph and Training Curves Using TensorBoard https://whizan.xyz/books/ml/images/mlchapter9_12.webp Name Scopes. A collapsed namescope in TensorBoard Name Scopes https://whizan.xyz/books/ml/images/mlchapter9_13.webp Modularity. Equation 9-1. Rectified linear unit Modularity https://whizan.xyz/books/ml/images/mlchapter9_14.webp Modularity. Equation 9-1. Rectified linear unit Modularity https://whizan.xyz/books/ml/images/mlchapter9_15.webp Modularity. Equation 9-1. Rectified linear unit Modularity https://whizan.xyz/books/ml/images/mlchapter9_16.webp Modularity. h , b = max · + b , 0 Modularity https://whizan.xyz/books/ml/images/mlchapter9_17.webp Modularity. Note that when you create a node, TensorFlow checks whether its name already exists, and if it does it appends an underscore followed by an index to make the name unique. Collapsed node series Modularity https://whizan.xyz/books/ml/images/mlchapter9_18.webp Modularity. with tf.name_scope("relu"): [..]. A clearer graph using name-scoped units Modularity https://whizan.xyz/books/ml/images/mlchapter9_19.webp Sharing Variables. threshold = tf.get_variable("threshold"). Once reuse is set to True, it cannot be set back to False within the block. Sharing Variables https://whizan.xyz/books/ml/images/mlchapter9_20.webp Sharing Variables. This code first defines the relu() function, then creates the relu/threshold variable (as a scalar that will later be initialized to 0.0) and builds five ReLUs by calling the relu() function. Variables created using get_variable() are always named using the name of their variable_scope as a prefix (e.g., "relu/thres hold"), but for all other nodes (including variables created with tf.Variable()) the variable scope acts like a new name scope. Sharing Variables https://whizan.xyz/books/ml/images/mlchapter9_21.webp Sharing Variables. Variables created using get_variable() are always named using the name of their variable_scope as a prefix (e.g., "relu/thres hold"), but for all other nodes (including variables created with tf.Variable()) the variable scope acts like a new name scope. Five ReLUs sharing the threshold variable Sharing Variables https://whizan.xyz/books/ml/images/mlchapter9_22.webp Sharing Variables. The resulting graph is slightly different than before, since the shared variable lives within the first ReLU (see. Five ReLUs sharing the threshold variable Sharing Variables https://whizan.xyz/books/docker/chapter1 https://whizan.xyz/books/docker/images/chapter1_0.webp What about Kubernetes. The Kubernetes Book . This is the ultimate book for mastering Kubernetes. I update both books annually to ensure they’re up-to-date with the latest and greatest developments in the cloud native ecosystem What about Kubernetes https://whizan.xyz/books/docker/chapter10 https://whizan.xyz/books/docker/images/chapter10_0.webp Docker Model Runner Architecture. 1 shows the high-level architecture and major components. 1 - Docker Model Runner Architecture Docker Model Runner Architecture https://whizan.xyz/books/docker/images/chapter10_1.webp Pull models from Docker Hub. architecture, training cut-off date, model variants, and even benchmark info. However, benchmark info is from the original model publisher, and you should always perform your own testing to see how well a model performs for your specific requirements. 2 - Model info card Pull models from Docker Hub https://whizan.xyz/books/docker/images/chapter10_2.webp Test your model. Open the Docker Desktop UI and click the Models tab in the left navigation bar. Click the model you want to test to open a chat session and then ask it the same questions. 3 - Docker Desktop’s Model Chat Window Test your model https://whizan.xyz/books/docker/images/chapter10_3.webp Use Docker Model Runner with Compose. and streams responses to the frontend. 4 - Chatbot architecture Use Docker Model Runner with Compose https://whizan.xyz/books/docker/images/chapter10_4.webp Use Docker Model Runner with Compose. Open your browser to http://localhost:3000 and ask your chatbot some questions. 5 - Working chatbot Use Docker Model Runner with Compose https://whizan.xyz/books/docker/images/chapter10_5.webp Connect to Open WebUI and use it. Once you’ve created your account, you’ll be automatically logged in and will see the Open WebUI interface as shown in.6. 6 - Open WebUI interface Connect to Open WebUI and use it https://whizan.xyz/books/docker/images/chapter10_6.webp Connect to Open WebUI and use it. 7 shows a very brief conversation asking how far away the moon is and then the sun. 7 - Conversational history Connect to Open WebUI and use it https://whizan.xyz/books/docker/chapter11 https://whizan.xyz/books/docker/images/chapter11_0.webp 11: Docker and Wasm. We built the first wave on virtual machines (VMs), the second on containers, and we’re building the third on Wasm. Each wave drives smaller, faster, and more secure workloads, and all three are working together to drive the future of cloud computing. In this chapter, you’ll write a simple Wasm application and use Docker to containerize and run it in a container. The goal is to introduce you to Wasm and show you how easy it is to work with Docker and Wasm together 11: Docker and Wasm https://whizan.xyz/books/docker/images/chapter11_1.webp Configure Docker Desktop for Wasm. 2 shows some of the settings. 2 - Docker Desktop Wasm settings Configure Docker Desktop for Wasm https://whizan.xyz/books/docker/images/chapter11_2.webp Write a Wasm app. Point your browser to http://127.0.0.1:3000/hello and make sure the app works. 3 - Wasm app running locally Write a Wasm app https://whizan.xyz/books/docker/images/chapter11_3.webp Containerize a Wasm app. If you look at Docker Hub, you can see it’s recognized it as a wasi/wasm image. You’ll also see there’s no vulnerability analysis data. This is because image scanning tools can’t analyze Wasm images yet. 4 - Wasm image on Docker Hub Containerize a Wasm app https://whizan.xyz/books/docker/images/chapter11_4.webp Run a Wasm container. Connect your browser to http://localhost:5556/hello to see the app. 5 - Wasm app running in container Run a Wasm container https://whizan.xyz/books/docker/chapter2 https://whizan.xyz/books/docker/images/chapter2_0.webp The Docker technology. 1 shows the high-level architecture. The client and engine can be on the same host or connected over the network. 1 - Docker client and engine The Docker technology https://whizan.xyz/books/docker/images/chapter2_1.webp The Docker technology. the engine. 2 - Docker CLI and daemon hiding complexity The Docker technology https://whizan.xyz/books/docker/chapter3 https://whizan.xyz/books/docker/images/chapter3_0.webp Installing Docker Desktop on Mac. 1 shows the high-level architecture for Docker Desktop on Mac. 1 Installing Docker Desktop on Mac https://whizan.xyz/books/docker/chapter4 https://whizan.xyz/books/docker/images/chapter4_0.webp Run the app as a container. You will see the following web page. 1 Run the app as a container https://whizan.xyz/books/docker/chapter5 https://whizan.xyz/books/docker/images/chapter5_0.webp Docker Engine – The TLDR. 1 shows the components of the Docker Engine that create and run containers. Other components exist, but this simplified diagram focuses on the components that start and run containers. 1 Docker Engine – The TLDR https://whizan.xyz/books/docker/images/chapter5_1.webp Breaking up the monolithic Docker daemon. 2 shows another view of the Docker Engine components that are used to run containers and lists the primary responsibilities of each component. 2 - Engine components and responsibilities Breaking up the monolithic Docker daemon https://whizan.xyz/books/docker/images/chapter5_2.webp Starting a new container (example). 3 summarizes the process. 3 Starting a new container (example) https://whizan.xyz/books/docker/chapter6 https://whizan.xyz/books/docker/images/chapter6_0.webp Intro to images. structs, whereas containers are run-time constructs.1 shows the build and run nature of each and that you can start multiple containers from a single image. 1 Intro to images https://whizan.xyz/books/docker/images/chapter6_1.webp Image registries. 2 shows the central nature of registries in the build > share > run pipeline. 2 Image registries https://whizan.xyz/books/docker/images/chapter6_2.webp Image registries. Image registries contain one or more image repositories , and image repositories contain one or more images.3 shows an image registry with three repositories, each with one or more images. 3 - Registry architecture Image registries https://whizan.xyz/books/docker/images/chapter6_3.webp Official repositories. 4 shows the official Alpine and NGINX repositories on Docker Hub. Both have the green Docker Official Image badge and have over a billion pulls each. Also, notice how both are available for a wide range of CPU architectures. 4 - Official repos on Docker Hub Official repositories https://whizan.xyz/books/docker/images/chapter6_4.webp Image naming and tagging. including the registry name, user/organization name, repository name, and tag. Docker automatically populates the registry and tag values if you don’t specify them. 5 - Fully qualified image name Image naming and tagging https://whizan.xyz/books/docker/images/chapter6_5.webp Images and layers. 6 shows an image with four layers. Docker takes care of stacking them and representing them as a single unified image. 6 - Image and stacked layers Images and layers https://whizan.xyz/books/docker/images/chapter6_6.webp Images and layers. Each line ending with Pull complete represents a layer that Docker pulled. This image has five layers and is shown in.7 with layer IDs. 7 - Image layers and IDs Images and layers https://whizan.xyz/books/docker/images/chapter6_7.webp Base layers. 24:04 image. 8 Base layers https://whizan.xyz/books/docker/images/chapter6_8.webp Base layers. It also shows that the layers are stored as independent objects, and the image is just metadata identifying the required layers and explaining how to stack them. 9 Base layers https://whizan.xyz/books/docker/images/chapter6_9.webp Base layers. is an updated version of File 5 directly below (inline). In this situation, the file in the higher layer obscures the file directly below it. This means you update files and make other changes to images by adding new layers containing the changes. 10 - Stacking layers Base layers https://whizan.xyz/books/docker/images/chapter6_10.webp Base layers. — all three layers stacked and merged into a single unified view. 11 - Unified view of multi-layer image Base layers https://whizan.xyz/books/docker/images/chapter6_11.webp Sharing image layers. <Snip>. 12 - Two images sharing a layer Sharing image layers https://whizan.xyz/books/docker/images/chapter6_12.webp Multi-architecture images. 13 shows how manifest lists and manifests are related. 13 - Manifest lists and manifests Multi-architecture images https://whizan.xyz/books/docker/images/chapter6_13.webp Vulnerability scanning with Docker Scout. SBOM of image already cached, 66 packages indexed. Detected 1 vulnerable package with 2 vulnerabilities ## Overview Vulnerability scanning with Docker Scout https://whizan.xyz/books/docker/images/chapter6_14.webp Vulnerability scanning with Docker Scout. pkg:apk/alpine/ expat@2.5.0-r2?os_name=alpine&os_version=3.19. HIGH CVE-2023-52425 Vulnerability scanning with Docker Scout https://whizan.xyz/books/docker/images/chapter6_15.webp Vulnerability scanning with Docker Scout. 14 shows how this looks in Docker Desktop, and you get similar integrations and views in Docker Hub. 14 - Docker Scout integration with Docker Desktop Vulnerability scanning with Docker Scout https://whizan.xyz/books/docker/chapter7 https://whizan.xyz/books/docker/images/chapter7_0.webp Containers – The TLDR. 1 shows multiple containers started from a single image. The shared image is read-only, but you can write to the containers. 1 Containers – The TLDR https://whizan.xyz/books/docker/images/chapter7_1.webp Containers vs VMs. 2 shows the two models side by side and attempts to demonstrate the more efficient nature of containers with the same server running 3x more containers than VMs. 2 Containers vs VMs https://whizan.xyz/books/docker/images/chapter7_2.webp Images and Containers. in.3, Docker accomplishes this by creating a thin read-write layer for each container and placing it on top of the shared image. 3 - Container R/W layers Images and Containers https://whizan.xyz/books/docker/images/chapter7_3.webp Starting a container. running Docker Desktop, you may need to substitute localhost with the name or IP of the host Docker is running on. 4 - Web app running in container Starting a container https://whizan.xyz/books/docker/images/chapter7_4.webp Debugging slim images and containers with Docker Debug. $ docker debug ddd-ctr. This is an attach shell, i.e Debugging slim images and containers with Docker Debug https://whizan.xyz/books/docker/images/chapter7_5.webp Debugging slim images and containers with Docker Debug. $ docker debug nigelpoulton/ddd-book:web0.1. Note: This is a sandbox shell. All changes will not affect the actual image Debugging slim images and containers with Docker Debug https://whizan.xyz/books/docker/chapter8 https://whizan.xyz/books/docker/images/chapter8_0.webp Containerizing an app – The TLDR. Run a container from the image You can see these five steps in.1. 1 - Basic flow of containerizing an app Containerizing an app – The TLDR https://whizan.xyz/books/docker/images/chapter8_1.webp Create the Dockerfile. CREATED: .dockerignore CREATED: Dockerfile CREATED: compose.yaml CREATED: README.Docker.md. Your Docker files are ready! Create the Dockerfile https://whizan.xyz/books/docker/images/chapter8_2.webp Containerize the app. then the WORKDIR, RUN and COPY instructions added three more layers. You can see this in.2. 2 - Dockerfile and image layers Containerize the app https://whizan.xyz/books/docker/images/chapter8_3.webp Push the image to Docker Hub. 3 shows how Docker figured out where to push the image. 3 Push the image to Docker Hub https://whizan.xyz/books/docker/images/chapter8_4.webp Test the app. You should see the app as shown in.4. 4 Test the app https://whizan.xyz/books/docker/images/chapter8_5.webp Looking a bit closer. 5 maps the Dockerfile instructions to image layers. The bold instructions with arrows create layers; the others create metadata. The layer IDs will be different in your environment. 5 Looking a bit closer https://whizan.xyz/books/docker/images/chapter8_6.webp Moving to production with multi-stage builds. 6 shows a high-level workflow. 6 Moving to production with multi-stage builds https://whizan.xyz/books/docker/images/chapter8_7.webp Buildx, BuildKit, drivers, and Build Cloud. 7 shows a Docker environment configured to talk to a local and a remote builder. 7 - Docker build architecture Buildx, BuildKit, drivers, and Build Cloud https://whizan.xyz/books/docker/images/chapter8_8.webp Multi-architecture builds. 8 shows how the images for both architectures appear on Docker Hub under the same repository and tag. 8 - Multi-platform image Multi-architecture builds https://whizan.xyz/books/docker/chapter9 https://whizan.xyz/books/docker/images/chapter9_0.webp The sample app. We’ll use the sample app shown in.1 with two services, a network, and a volume. 1 - Sample app The sample app https://whizan.xyz/books/docker/images/chapter9_1.webp The sample app. The compose.yaml file tells Docker how to deploy the app.2 shows the app in more detail. 2 - Detailed view of sample app The sample app https://whizan.xyz/books/docker/images/chapter9_2.webp Deploying apps with Compose. 5001 to view it. You can connect to localhost:5001 if you’re running Docker Desktop. Refresh the page a few times and watch the counter increment. This is the app counting page refreshes and storing the value on the volume in the Redis service Deploying apps with Compose https://whizan.xyz/books/systemdesign/chapter1 https://whizan.xyz/books/systemdesign/chapter1/image2.webp Single server setup. A journey of a thousand miles begins with a single step, and building a complex system is no different. To understand this setup, it is helpful to investigate the request flow and traffic source. Let us first look at the request flow Single server setup https://whizan.xyz/books/systemdesign/chapter1/image3.webp Single server setup. To understand this setup, it is helpful to investigate the request flow and traffic source. Let us first look at the request flow. Users access websites through domain names, such as api.mysite.com. Usually, the Domain Name System (DNS) is a paid service provided by 3rd parties and not hosted by our servers Single server setup https://whizan.xyz/books/systemdesign/chapter1/image4.webp Single server setup. GET /users/12 – Retrieve user object for id = 12 Single server setup https://whizan.xyz/books/systemdesign/chapter1/image5.webp Database. With the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database . Which databases to use? Database https://whizan.xyz/books/systemdesign/chapter1/image6.webp Load balancer. A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set.shows how a load balancer works. As shown in, users connect to the public IP of the load balancer directly. Load balancer https://whizan.xyz/books/systemdesign/chapter1/image7.webp Database replication. A master database generally only supports write operations. Advantages of database replication Database replication https://whizan.xyz/books/systemdesign/chapter1/image8.webp Database replication. shows the system design after adding the load balancer and database replication. Let us take a look at the design Database replication https://whizan.xyz/books/systemdesign/chapter1/image9.webp Cache tier. The cache tier is a temporary data store layer, much faster than the database. After receiving a request, a web server first checks if the cache has the available response. Cache tier https://whizan.xyz/books/systemdesign/chapter1/image10.webp Cache tier. Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. The following code snippet shows typical Memcached APIs Cache tier https://whizan.xyz/books/systemdesign/chapter1/image11.webp Considerations for using cache. Mitigating failures: A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8]. Eviction Policy: Once the cache is full, any requests to add items to the cache might cause existing items to be removed. Considerations for using cache https://whizan.xyz/books/systemdesign/chapter1/image12.webp Considerations for using cache. Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. demonstrates the CDN workflow Considerations for using cache https://whizan.xyz/books/systemdesign/chapter1/image13.webp Considerations for using cache. demonstrates the CDN workflow. User A tries to get image.webp by using an image URL. The URL’s domain is provided by the CDN provider. The following two image URLs are samples used to demonstrate what image URLs look like on Amazon and Akamai CDNs Considerations for using cache https://whizan.xyz/books/systemdesign/chapter1/image14.webp Considerations of using a CDN. shows the design after the CDN and cache are added. Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They are fetched from the CDN for better performance Considerations of using a CDN https://whizan.xyz/books/systemdesign/chapter1/image15.webp Stateful architecture. shows an example of a stateful architecture. In, user A’s session data and profile image are stored in Server 1. Stateful architecture https://whizan.xyz/books/systemdesign/chapter1/image16.webp Stateless architecture. shows the stateless architecture. In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. Stateless architecture https://whizan.xyz/books/systemdesign/chapter1/image17.webp Stateless architecture. shows the updated design with a stateless web tier. In, we move the session data out of the web tier and store them in the persistent data store. Stateless architecture https://whizan.xyz/books/systemdesign/chapter1/image18.webp Data centers. shows an example setup with two data centers. In the event of any significant data center outage, we direct all traffic to a healthy data center. In, data center 2 (US-West) is offline, and 100% of the traffic is routed to data center 1 (US-East) Data centers https://whizan.xyz/books/systemdesign/chapter1/image19.webp Data centers. In the event of any significant data center outage, we direct all traffic to a healthy data center. In, data center 2 (US-West) is offline, and 100% of the traffic is routed to data center 1 (US-East). Several technical challenges must be resolved to achieve multi-data center setup Data centers https://whizan.xyz/books/systemdesign/chapter1/image20.webp Message queue. A message queue is a durable component, stored in memory, that supports asynchronous communication. Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. Message queue https://whizan.xyz/books/systemdesign/chapter1/image21.webp Message queue. However, if the queue is empty most of the time, the number of workers can be reduced Message queue https://whizan.xyz/books/systemdesign/chapter1/image22.webp Adding message queues and different tools. Logging, monitoring, metrics, and automation tools are included. As the data grows every day, your database gets more overloaded. It is time to scale the data tier Adding message queues and different tools https://whizan.xyz/books/systemdesign/chapter1/image23.webp Horizontal scaling. Horizontal scaling, also known as sharding, is the practice of adding more servers.- 20 compares vertical scaling with horizontal scaling. Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to the shard Horizontal scaling https://whizan.xyz/books/systemdesign/chapter1/image24.webp Horizontal scaling. equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used. The same logic applies to other shards. shows the user table in sharded databases Horizontal scaling https://whizan.xyz/books/systemdesign/chapter1/image25.webp Horizontal scaling. shows the user table in sharded databases. The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Horizontal scaling https://whizan.xyz/books/systemdesign/chapter1/image26.webp Horizontal scaling. In, we shard databases to support rapidly increasing data traffic. At the same time, some of the non-relational functionalities are moved to a NoSQL data store to reduce the database load. Here is an article that covers many use cases of NoSQL [14] Horizontal scaling https://whizan.xyz/books/systemdesign/chapter10 https://whizan.xyz/books/systemdesign/chapter10/image122.webp CHAPTER 10: DESIGN A NOTIFICATION SYSTEM. A notification is more than just mobile push notification. Three types of notification formats are: mobile push notification, SMS message, and Email.shows an example of each of these notifications CHAPTER 10: DESIGN A NOTIFICATION SYSTEM https://whizan.xyz/books/systemdesign/chapter10/image123.webp iOS push notification. We start by looking at how each notification type works at a high level. We primary need three components to send an iOS push notification iOS push notification https://whizan.xyz/books/systemdesign/chapter10/image124.webp iOS push notification. Payload: This is a JSON dictionary that contains a notification’s payload. Here is an example. APNS: This is a remote service provided by Apple to propagate push notifications to iOS devices iOS push notification https://whizan.xyz/books/systemdesign/chapter10/image125.webp Android push notification. Android adopts a similar notification flow. Instead of using APNs, Firebase Cloud Messaging (FCM) is commonly used to send push notifications to android devices Android push notification https://whizan.xyz/books/systemdesign/chapter10/image126.webp SMS message. For SMS messages, third party SMS services like Twilio [1], Nexmo [2], and many others are commonly used. Most of them are commercial services SMS message https://whizan.xyz/books/systemdesign/chapter10/image127.webp Email. Although companies can set up their own email servers, many of them opt for commercial email services. Sendgrid [3] and Mailchimp [4] are among the most popular email services, which offer a better delivery rate and data analytics. shows the design after including all the third-party services Email https://whizan.xyz/books/systemdesign/chapter10/image128.webp Email. shows the design after including all the third-party services Email https://whizan.xyz/books/systemdesign/chapter10/image129.webp Contact info gathering flow. To send notifications, we need to gather mobile device tokens, phone numbers, or email addresses. As shown in, when a user installs our app or signs up for the first time, API servers collect user contact info and store it in the database. shows simplified database tables to store contact info. Email addresses and phone numbers are stored in the user table, whereas device tokens are stored in the device table. A Contact info gathering flow https://whizan.xyz/books/systemdesign/chapter10/image130.webp Contact info gathering flow. user can have multiple devices, indicating that a push notification can be sent to all the user devices Contact info gathering flow https://whizan.xyz/books/systemdesign/chapter10/image131.webp High-level design. shows the design, and each system component is explained below. Service 1 to N: A service can be a micro-service, a cron job, or a distributed system that triggers notification sending events. High-level design https://whizan.xyz/books/systemdesign/chapter10/image132.webp High-level design. Introduce message queues to decouple the system components.shows the improved high-level design. The best way to go through the above diagram is from left to right High-level design https://whizan.xyz/books/systemdesign/chapter10/image133.webp Request body. Put notification data to message queues for parallel processing. Here is an example of the API to send an email. Cache: User info, device info, notification templates are cached Request body https://whizan.xyz/books/systemdesign/chapter10/image134.webp Reliability. One of the most important requirements in a notification system is that it cannot lose data. Will recipients receive a notification exactly once? Reliability https://whizan.xyz/books/systemdesign/chapter10/image135.webp Monitor queued notifications. A key metric to monitor is the total number of queued notifications. Monitor queued notifications https://whizan.xyz/books/systemdesign/chapter10/image136.webp Events tracking. Notification metrics, such as open rate, click rate, and engagement are important in understanding customer behaviors. Events tracking https://whizan.xyz/books/systemdesign/chapter10/image137.webp Updated design. Putting everything together,shows the updated notification system design. In this design, many new components are added in comparison with the previous design Updated design https://whizan.xyz/books/systemdesign/chapter11 https://whizan.xyz/books/systemdesign/chapter11/image138.webp CHAPTER 11: DESIGN A NEWS FEED SYSTEM. Instagram feed, Twitter timeline, etc CHAPTER 11: DESIGN A NEWS FEED SYSTEM https://whizan.xyz/books/systemdesign/chapter11/image139.webp Feed publishing. shows the high-level design of the feed publishing flow. User: a user can view news feeds on a browser or mobile app. A user makes a post with content “Hello” through API Feed publishing https://whizan.xyz/books/systemdesign/chapter11/image140.webp Newsfeed building. In this section, we discuss how news feed is built behind the scenes.shows the high-level design. User: a user sends a request to retrieve her news feed. The request looks like this Newsfeed building https://whizan.xyz/books/systemdesign/chapter11/image141.webp Feed publishing deep dive. outlines the detailed design for feed publishing. We have discussed most of components in high-level design, and we will focus on two components: web servers and fanout service Feed publishing deep dive https://whizan.xyz/books/systemdesign/chapter11/image142.webp Fanout service. Let us take a close look at the fanout service as shown in. The fanout service works as follows Fanout service https://whizan.xyz/books/systemdesign/chapter11/image143.webp Fanout service. Store <post_id, user_id > in news feed cache.shows an example of what the news feed looks like in cache Fanout service https://whizan.xyz/books/systemdesign/chapter11/image144.webp Newsfeed retrieval deep dive. illustrates the detailed design for news feed retrieval. As shown in, media content (images, videos, etc.) are stored in CDN for fast retrieval. Let us look at how a client retrieves news feed Newsfeed retrieval deep dive https://whizan.xyz/books/systemdesign/chapter11/image145.webp Cache architecture. Cache is extremely important for a news feed system. We divide the cache tier into 5 layers as shown in. News Feed: It stores IDs of news feeds Cache architecture https://whizan.xyz/books/systemdesign/chapter12 https://whizan.xyz/books/systemdesign/chapter12/image146.webp CHAPTER 12: DESIGN A CHAT SYSTEM. In this chapter we explore the design of a chat system. Almost everyone uses a chat app.shows some of the most popular apps in the marketplace. A chat app performs different functions for different people. CHAPTER 12: DESIGN A CHAT SYSTEM https://whizan.xyz/books/systemdesign/chapter12/image147.webp Step 2 - Propose high-level design and get buy-in. shows the relationships between clients (sender and receiver) and the chat service. When a client intends to start a chat, it connects the chats service using one or more network protocols. For a chat service, the choice of network protocols is important. Let us discuss this with the interviewer Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter12/image148.webp Polling. As shown in, polling is a technique that the client periodically asks the server if there are messages available. Polling https://whizan.xyz/books/systemdesign/chapter12/image149.webp Long polling. Because polling could be inefficient, the next progression is long polling. In long polling, a client holds the connection open until there are actually new messages available or a timeout threshold has been reached. Long polling https://whizan.xyz/books/systemdesign/chapter12/image150.webp WebSocket. WebSocket is the most common solution for sending asynchronous updates from server to client.shows how it works. WebSocket connection is initiated by the client. WebSocket https://whizan.xyz/books/systemdesign/chapter12/image151.webp WebSocket. Earlier we said that on the sender side HTTP is a fine protocol to use, but since WebSocket is bidirectional, there is no strong technical reason not to use it also for sending.shows how WebSockets (ws) is used for both sender and receiver sides. By using WebSocket for both sending and receiving, it simplifies the design and makes implementation on both client and server more straightforward. WebSocket https://whizan.xyz/books/systemdesign/chapter12/image152.webp High-level design. As shown in, the chat system is broken down into three major categories: stateless services, stateful services, and third-party integration High-level design https://whizan.xyz/books/systemdesign/chapter12/image153.webp Scalability. However, it is perfectly fine to start with a single server design. Just make sure the interviewer knows this is a starting point. Putting everything we mentioned together,shows the adjusted high-level design. In, the client maintains a persistent WebSocket connection to a chat server for real-time messaging Scalability https://whizan.xyz/books/systemdesign/chapter12/image154.webp Message table for 1 on 1 chat. shows the message table for 1 on 1 chat. The primary key is message_id, which helps to decide message sequence. We cannot rely on created_at to decide the message sequence because two messages can be created at the same time Message table for 1 on 1 chat https://whizan.xyz/books/systemdesign/chapter12/image155.webp Message table for group chat. shows the message table for group chat. The composite primary key is (channel_id, message_id). Channel and group represent the same meaning here. channel_id is the partition key because all queries in a group chat operate in a channel Message table for group chat https://whizan.xyz/books/systemdesign/chapter12/image156.webp Service discovery. shows how service discovery (Zookeeper) works. User A tries to log in to the app Service discovery https://whizan.xyz/books/systemdesign/chapter12/image157.webp 1 on 1 chat flow. explains what happens when User A sends a message to User B. User A sends a chat message to Chat server 1 1 on 1 chat flow https://whizan.xyz/books/systemdesign/chapter12/image158.webp Message synchronization across multiple devices. Many users have multiple devices. We will explain how to sync messages across multiple devices.shows an example of message synchronization. In, user A has two devices: a phone and a laptop. When User A logs in to the chat app with her phone, it establishes a WebSocket connection with Chat server 1. Similarly, there is a connection between the laptop and Chat server 1 Message synchronization across multiple devices https://whizan.xyz/books/systemdesign/chapter12/image159.webp Small group chat flow. In comparison to the one-on-one chat, the logic of group chat is more complicated. Figures 12-14 and 12-15 explain the flow. explains what happens when User A sends a message in a group chat. Small group chat flow https://whizan.xyz/books/systemdesign/chapter12/image160.webp Small group chat flow. On the recipient side, a recipient can receive messages from multiple users. Each recipient has an inbox (message sync queue) which contains messages from different senders.illustrates the design Small group chat flow https://whizan.xyz/books/systemdesign/chapter12/image161.webp User login. The user login flow is explained in the “Service Discovery” section. User login https://whizan.xyz/books/systemdesign/chapter12/image162.webp User logout. When a user logs out, it goes through the user logout flow as shown in. The online status is changed to offline in the KV store. The presence indicator shows a user is offline User logout https://whizan.xyz/books/systemdesign/chapter12/image163.webp User disconnection. In, the client sends a heartbeat event to the server every 5 seconds. User disconnection https://whizan.xyz/books/systemdesign/chapter12/image164.webp Online status fanout. How do user A’s friends know about the status changes?explains how it works. The above design is effective for a small user group. Online status fanout https://whizan.xyz/books/systemdesign/chapter13 https://whizan.xyz/books/systemdesign/chapter13/image165.webp CHAPTER 13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM. When searching on Google or shopping at Amazon, as you type in the search box, one or more matches for the search term are presented to you. CHAPTER 13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM https://whizan.xyz/books/systemdesign/chapter13/image166.webp Data gathering service. Let us use a simplified example to see how data gathering service works. Data gathering service https://whizan.xyz/books/systemdesign/chapter13/image167.webp Query service. Frequency: it represents the number of times a query has been searched. When a user types “tw” in the search box, the following top 5 searched queries are displayed , assuming the frequency table is based on Table 13-1 Query service https://whizan.xyz/books/systemdesign/chapter13/image168.webp Query service. When a user types “tw” in the search box, the following top 5 searched queries are displayed , assuming the frequency table is based on Table 13-1. To get top 5 frequently searched queries, execute the following SQL query Query service https://whizan.xyz/books/systemdesign/chapter13/image169.webp Query service. To get top 5 frequently searched queries, execute the following SQL query. This is an acceptable solution when the data set is small. When it is large, accessing the database becomes a bottleneck. We will explore optimizations in deep dive Query service https://whizan.xyz/books/systemdesign/chapter13/image170.webp Trie data structure. shows a trie with search queries “tree”, “try”, “true”, “toy”, “wish”, “win”. Search queries are highlighted with a thicker border. Basic trie data structure stores characters in nodes. To support sorting by frequency, frequency info needs to be included in nodes. Assume we have the following frequency table Trie data structure https://whizan.xyz/books/systemdesign/chapter13/image171.webp Trie data structure. Basic trie data structure stores characters in nodes. To support sorting by frequency, frequency info needs to be included in nodes. Assume we have the following frequency table. After adding frequency info to nodes, updated trie data structure is shown in Trie data structure https://whizan.xyz/books/systemdesign/chapter13/image172.webp Trie data structure. After adding frequency info to nodes, updated trie data structure is shown in. How does autocomplete work with trie? Before diving into the algorithm, let us define some terms Trie data structure https://whizan.xyz/books/systemdesign/chapter13/image173.webp c: number of children of a given node. Step 3: Sort the children and get top 2. [true: 35] and [try: 29] are the top 2 queries with prefix “tr”. The time complexity of this algorithm is the sum of time spent on each step mentioned above c: number of children of a given node https://whizan.xyz/books/systemdesign/chapter13/image174.webp Cache top search queries at each node. shows the updated trie data structure. Top 5 queries are stored on each node. For example, the node with prefix “be” stores the following: [best: 35, bet: 29, bee: 20, be: 15, beer: 10]. Let us revisit the time complexity of the algorithm after applying those two optimizations Cache top search queries at each node https://whizan.xyz/books/systemdesign/chapter13/image175.webp Data gathering service. shows the redesigned data gathering service. Each component is examined one by one. Analytics Logs. It stores raw data about search queries. Logs are append-only and are not indexed. Table 13-3 shows an example of the log file Data gathering service https://whizan.xyz/books/systemdesign/chapter13/image176.webp Data gathering service. Analytics Logs. It stores raw data about search queries. Logs are append-only and are not indexed. Table 13-3 shows an example of the log file. Aggregators. The size of analytics logs is usually very large, and data is not in the right format. We need to aggregate data so it can be easily processed by our system Data gathering service https://whizan.xyz/books/systemdesign/chapter13/image177.webp Data gathering service. Table 13-4 shows an example of aggregated weekly data. “time” field represents the start time of a week. “frequency” field is the sum of the occurrences for the corresponding query in that week. Workers. Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB Data gathering service https://whizan.xyz/books/systemdesign/chapter13/image178.webp Data gathering service. Data on each trie node is mapped to a value in a hash table.shows the mapping between the trie and hash table. In, each trie node on the left is mapped to the <key, value> pair on the right. If you are unclear how key-value stores work, refer to Chapter 6: Design a key-value store Data gathering service https://whizan.xyz/books/systemdesign/chapter13/image179.webp Query service. In the high-level design, query service calls the database directly to fetch the top 5 results.shows the improved design as previous design is inefficient. A search query is sent to the load balancer Query service https://whizan.xyz/books/systemdesign/chapter13/image180.webp Query service. caches the results in the browser for 1 hour. Please note: “private” in cache-control means results are intended for a single user and must not be cached by a shared cache. “max- age=3600” means the cache is valid for 3600 seconds, aka, an hour. Data sampling: For a large-scale system, logging every search query requires a lot of processing power and storage. Data sampling is important. For instance, only 1 out of every N requests is logged by the system Query service https://whizan.xyz/books/systemdesign/chapter13/image181.webp Update. Option 2: Update individual trie node directly. Update https://whizan.xyz/books/systemdesign/chapter13/image182.webp Delete. We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions. Delete https://whizan.xyz/books/systemdesign/chapter13/image183.webp Scale the storage. To mitigate the data imbalance problem, we analyze historical data distribution pattern and apply smarter sharding logic as shown in. Scale the storage https://whizan.xyz/books/systemdesign/chapter14 https://whizan.xyz/books/systemdesign/chapter14/image184.webp CHAPTER 14: DESIGN YOUTUBE. In this chapter, you are asked to design YouTube. The solution to this question can be applied to other interview questions like designing a video sharing platform such as Netflix and Hulu.shows the YouTube homepage. YouTube looks simple: content creators upload videos and viewers click play. CHAPTER 14: DESIGN YOUTUBE https://whizan.xyz/books/systemdesign/chapter14/image185.webp Total daily storage space needed: 5 million * 10% * 300 MB = 150TB. From the rough cost estimation, we know serving videos from the CDN costs lots of money. Total daily storage space needed: 5 million * 10% * 300 MB = 150TB https://whizan.xyz/books/systemdesign/chapter14/image186.webp Step 2 - Propose high-level design and get buy-in. At the high-level, the system comprises three components. Client: You can watch YouTube on your computer, mobile phone, and smartTV Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter14/image187.webp Video uploading flow. shows the high-level design for the video uploading. It consists of the following components Video uploading flow https://whizan.xyz/books/systemdesign/chapter14/image188.webp Flow a: upload the actual video. Update video metadata. Metadata contains information about video URL, size, resolution, format, user info, etc. shows how to upload the actual video. The explanation is shown below Flow a: upload the actual video https://whizan.xyz/books/systemdesign/chapter14/image189.webp Flow b: update the metadata. While a file is being uploaded to the original storage, the client in parallel sends a request to update the video metadata as shown in. Flow b: update the metadata https://whizan.xyz/books/systemdesign/chapter14/image190.webp Video streaming flow. Videos are streamed from CDN directly. The edge server closest to you will deliver the video. Thus, there is very little latency.shows a high level of design for video streaming Video streaming flow https://whizan.xyz/books/systemdesign/chapter14/image191.webp Directed acyclic graph (DAG) model. To support different video processing pipelines and maintain high parallelism, it is important to add some level of abstraction and let client programmers define what tasks to execute. In, the original video is split into video, audio, and metadata. Here are some of the tasks that can be applied on a video file Directed acyclic graph (DAG) model https://whizan.xyz/books/systemdesign/chapter14/image192.webp Directed acyclic graph (DAG) model. Watermark: An image overlay on top of your video contains identifying information about your video Directed acyclic graph (DAG) model https://whizan.xyz/books/systemdesign/chapter14/image193.webp Video transcoding architecture. The proposed video transcoding architecture that leverages the cloud services, is shown in. The architecture has six main components: preprocessor, DAG scheduler, resource manager, task workers, temporary storage, and encoded video as the output. Let us take a close look at each component Video transcoding architecture https://whizan.xyz/books/systemdesign/chapter14/image194.webp Preprocessor. The architecture has six main components: preprocessor, DAG scheduler, resource manager, task workers, temporary storage, and encoded video as the output. Let us take a close look at each component. The preprocessor has 4 responsibilities Preprocessor https://whizan.xyz/books/systemdesign/chapter14/image195.webp Preprocessor. DAG generation. The processor generates DAG based on configuration files client programmers write.is a simplified DAG representation which has 2 nodes and 1 edge. This DAG representation is generated from the two configuration files below Preprocessor https://whizan.xyz/books/systemdesign/chapter14/image196.webp Preprocessor. This DAG representation is generated from the two configuration files below. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations Preprocessor https://whizan.xyz/books/systemdesign/chapter14/image197.webp DAG scheduler. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations. The DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager.shows an example of how the DAG scheduler works DAG scheduler https://whizan.xyz/books/systemdesign/chapter14/image198.webp DAG scheduler. The DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager.shows an example of how the DAG scheduler works. As shown in, the original video is split into three stages: Stage 1: video, audio, and metadata. DAG scheduler https://whizan.xyz/books/systemdesign/chapter14/image199.webp Resource manager. As shown in, the original video is split into three stages: Stage 1: video, audio, and metadata. The resource manager is responsible for managing the efficiency of resource allocation. It contains 3 queues and a task scheduler as shown in Resource manager https://whizan.xyz/books/systemdesign/chapter14/image200.webp Resource manager. Task scheduler: It picks the optimal task/worker, and instructs the chosen task worker to execute the job. The resource manager works as follows Resource manager https://whizan.xyz/books/systemdesign/chapter14/image201.webp Task workers. The task scheduler removes the job from the running queue once the job is done. Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks as shown in Task workers https://whizan.xyz/books/systemdesign/chapter14/image202.webp Task workers. Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks as shown in Task workers https://whizan.xyz/books/systemdesign/chapter14/image203.webp Temporary storage. Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks as shown in. Multiple storage systems are used here. The choice of storage system depends on factors like data type, data size, access frequency, data life span, etc. For instance, metadata is frequently Temporary storage https://whizan.xyz/books/systemdesign/chapter14/image204.webp Encoded video. accessed by workers, and the data size is usually small. Thus, caching metadata in memory is a good idea. For video or audio data, we put them in blob storage. Data in temporary storage is freed up once the corresponding video processing is complete. Encoded video is the final output of the encoding pipeline. Here is an example of the output: funny_720p.mp4 Encoded video https://whizan.xyz/books/systemdesign/chapter14/image205.webp Speed optimization: parallelize video uploading. Uploading a video as a whole unit is inefficient. We can split a video into smaller chunks by GOP alignment as shown in. This allows fast resumable uploads when the previous upload failed. The job of splitting a video file by GOP can be implemented by the client to improve the upload speed as shown in Speed optimization: parallelize video uploading https://whizan.xyz/books/systemdesign/chapter14/image206.webp Speed optimization: parallelize video uploading. This allows fast resumable uploads when the previous upload failed. The job of splitting a video file by GOP can be implemented by the client to improve the upload speed as shown in Speed optimization: parallelize video uploading https://whizan.xyz/books/systemdesign/chapter14/image207.webp Speed optimization: place upload centers close to users. globe . People in the United States can upload videos to the North America upload center, and people in China can upload videos to the Asian upload center. To achieve this, we use CDN as upload centers Speed optimization: place upload centers close to users https://whizan.xyz/books/systemdesign/chapter14/image208.webp Speed optimization: parallelism everywhere. Our design needs some modifications to achieve high parallelism. To make the system more loosely coupled, we introduced message queues as shown in Figure Speed optimization: parallelism everywhere https://whizan.xyz/books/systemdesign/chapter14/image209.webp Speed optimization: parallelism everywhere. After the message queue is introduced, the encoding module does not need to wait for the output of the download module anymore. If there are events in the message queue, the encoding module can execute those jobs in parallel Speed optimization: parallelism everywhere https://whizan.xyz/books/systemdesign/chapter14/image210.webp Safety optimization: pre-signed upload URL. Safety is one of the most important aspects of any product. To ensure only authorized users upload videos to the right location, we introduce pre-signed URLs as shown in. The upload flow is updated as follows Safety optimization: pre-signed upload URL https://whizan.xyz/books/systemdesign/chapter14/image211.webp Cost-saving optimization. storage video servers. For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand Cost-saving optimization https://whizan.xyz/books/systemdesign/chapter15 https://whizan.xyz/books/systemdesign/chapter15/image212.webp CHAPTER 15: DESIGN GOOGLE DRIVE. Let us take a moment to understand Google Drive before jumping into the design. CHAPTER 15: DESIGN GOOGLE DRIVE https://whizan.xyz/books/systemdesign/chapter15/image213.webp CHAPTER 15: DESIGN GOOGLE DRIVE. Let us take a moment to understand Google Drive before jumping into the design. CHAPTER 15: DESIGN GOOGLE DRIVE https://whizan.xyz/books/systemdesign/chapter15/image214.webp Step 2 - Propose high-level design and get buy-in. shows an example of how the /drive directory looks like on the left side and its expanded view on the right side Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter15/image215.webp Move away from single server. As more files are uploaded, eventually you get the space full alert as shown in. Only 10 MB of storage space is left! This is an emergency as users cannot upload files anymore. The first solution comes to mind is to shard the data, so it is stored on multiple storage servers.shows an example of sharding based on user_id Move away from single server https://whizan.xyz/books/systemdesign/chapter15/image216.webp Move away from single server. Only 10 MB of storage space is left! This is an emergency as users cannot upload files anymore. The first solution comes to mind is to shard the data, so it is stored on multiple storage servers.shows an example of sharding based on user_id. You pull an all-nighter to set up database sharding and monitor it closely. Move away from single server https://whizan.xyz/books/systemdesign/chapter15/image217.webp Move away from single server. Redundant files are stored in multiple regions to guard against data loss and ensure availability. A bucket is like a folder in file systems. After putting files in S3, you can finally have a good night's sleep without worrying about data losses. To stop similar problems from happening in the future, you decide to do further research on areas you can improve. Here are a few areas you find Move away from single server https://whizan.xyz/books/systemdesign/chapter15/image218.webp Move away from single server. After applying the above improvements, you have successfully decoupled web servers, metadata database, and file storage from a single server. The updated design is shown in Move away from single server https://whizan.xyz/books/systemdesign/chapter15/image219.webp Sync conflicts. For a large storage system like Google Drive, sync conflicts happen from time to time. In, user 1 and user 2 tries to update the same file at the same time, but user 1’s file is processed by our system first. Sync conflicts https://whizan.xyz/books/systemdesign/chapter15/image220.webp Sync conflicts. In, user 1 and user 2 tries to update the same file at the same time, but user 1’s file is processed by our system first. While multiple users are editing the same document at the same, it is challenging to keep the document synchronized. Interested readers should refer to the reference material [4] [5] Sync conflicts https://whizan.xyz/books/systemdesign/chapter15/image221.webp High-level design. illustrates the proposed high-level design. Let us examine each component of the system. User: A user uses the application either through a browser or mobile app High-level design https://whizan.xyz/books/systemdesign/chapter15/image222.webp Block servers. shows how a block server works when a new file is added. A file is split into smaller blocks Block servers https://whizan.xyz/books/systemdesign/chapter15/image223.webp Block servers. illustrates delta sync, meaning only modified blocks are transferred to cloud storage. Highlighted blocks “block 2” and “block 5” represent changed blocks. Using delta sync, only those two blocks are uploaded to the cloud storage. Block servers allow us to save network traffic by providing delta sync and compression Block servers https://whizan.xyz/books/systemdesign/chapter15/image224.webp Metadata database. shows the database schema design. Please note this is a highly simplified version as it only includes the most important tables and interesting fields. User: The user table contains basic information about the user such as username, email, profile photo, etc Metadata database https://whizan.xyz/books/systemdesign/chapter15/image225.webp Upload flow. Let us discuss what happens when a client uploads a file. To better understand the flow, we draw the sequence diagram as shown in. In, two requests are sent in parallel: add file metadata and upload the file to cloud storage. Both requests originate from client 1 Upload flow https://whizan.xyz/books/systemdesign/chapter15/image226.webp Download flow. Once a client knows a file is changed, it first requests metadata via API servers, then downloads blocks to construct the file.shows the detailed flow. Note, only the most important components are shown in the diagram due to space constraint. Notification service informs client 2 that a file is changed somewhere else Download flow https://whizan.xyz/books/systemdesign/chapter2 https://whizan.xyz/books/systemdesign/chapter2/image27.webp Power of two. Although data volume can become enormous when dealing with distributed systems, calculation all boils down to the basics. Power of two https://whizan.xyz/books/systemdesign/chapter2/image28.webp Latency numbers every programmer should know. Dr. Latency numbers every programmer should know https://whizan.xyz/books/systemdesign/chapter2/image29.webp 1 µs= 10^-6 seconds = 1,000 ns. A Google software engineer built a tool to visualize Dr. Dean’s numbers. The tool also takes the time factor into consideration. Figures 2-1 shows the visualized latency numbers as of 2020 (source of figures: reference material [3]). By analyzing the numbers in, we get the following conclusions 1 µs= 10^-6 seconds = 1,000 ns https://whizan.xyz/books/systemdesign/chapter2/image30.webp Availability numbers. A service level agreement (SLA) is a commonly used term for service providers. Example: Estimate Twitter QPS and storage requirements Availability numbers https://whizan.xyz/books/systemdesign/chapter3 https://whizan.xyz/books/systemdesign/chapter3/image31.webp Example. andpresent high-level designs for feed publishing and news feed building flows, respectively Example https://whizan.xyz/books/systemdesign/chapter3/image32.webp Example. andpresent high-level designs for feed publishing and news feed building flows, respectively Example https://whizan.xyz/books/systemdesign/chapter3/image33.webp News feed retrieval. andshow the detailed design for the two use cases, which will be explained in detail in Chapter 11 News feed retrieval https://whizan.xyz/books/systemdesign/chapter3/image34.webp News feed retrieval. andshow the detailed design for the two use cases, which will be explained in detail in Chapter 11 News feed retrieval https://whizan.xyz/books/systemdesign/chapter4 https://whizan.xyz/books/systemdesign/chapter4/image35.webp Step 2 - Propose high-level design and get buy-in. Server-side implementation.shows a rate limiter that is placed on the server- side. Besides the client and server-side implementations, there is an alternative way. Instead of putting a rate limiter at the API servers, we create a rate limiter middleware, which throttles requests to your APIs as shown in Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter4/image36.webp Step 2 - Propose high-level design and get buy-in. Besides the client and server-side implementations, there is an alternative way. Instead of putting a rate limiter at the API servers, we create a rate limiter middleware, which throttles requests to your APIs as shown in. Let us use an example into illustrate how rate limiting works in this design. Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter4/image37.webp Step 2 - Propose high-level design and get buy-in. Let us use an example into illustrate how rate limiting works in this design. Cloud microservices [4] have become widely popular and rate limiting is usually implemented within a component called API gateway. Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter4/image38.webp Token bucket algorithm. A token bucket is a container that has pre-defined capacity. Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket.explains how it works Token bucket algorithm https://whizan.xyz/books/systemdesign/chapter4/image39.webp Token bucket algorithm. If there are not enough tokens, the request is dropped. illustrates how token consumption, refill, and rate limiting logic work. In this example, the token bucket size is 4, and the refill rate is 4 per 1 minute Token bucket algorithm https://whizan.xyz/books/systemdesign/chapter4/image40.webp Token bucket algorithm. illustrates how token consumption, refill, and rate limiting logic work. In this example, the token bucket size is 4, and the refill rate is 4 per 1 minute. The token bucket algorithm takes two parameters Token bucket algorithm https://whizan.xyz/books/systemdesign/chapter4/image41.webp Leaking bucket algorithm. Requests are pulled from the queue and processed at regular intervals.explains how the algorithm works. Leaking bucket algorithm takes the following two parameters Leaking bucket algorithm https://whizan.xyz/books/systemdesign/chapter4/image42.webp Fixed window counter algorithm. Let us use a concrete example to see how it works. In, the time unit is 1 second and the system allows a maximum of 3 requests per second. In each second window, if more than 3 requests are received, extra requests are dropped as shown in. A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through. Consider the following case Fixed window counter algorithm https://whizan.xyz/books/systemdesign/chapter4/image43.webp Fixed window counter algorithm. A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through. Consider the following case. In, the system allows a maximum of 5 requests per minute, and the available quota resets at the human-friendly round minute. Fixed window counter algorithm https://whizan.xyz/books/systemdesign/chapter4/image44.webp Sliding window log algorithm. We explain the algorithm with an example as revealed in. In this example, the rate limiter allows 2 requests per minute. Usually, Linux timestamps are stored in the log. However, human-readable representation of time is used in our example for better readability Sliding window log algorithm https://whizan.xyz/books/systemdesign/chapter4/image45.webp Sliding window counter algorithm. The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. Sliding window counter algorithm https://whizan.xyz/books/systemdesign/chapter4/image46.webp High-level architecture. shows the high-level architecture for rate limiting, and this works as follows. The client sends a request to rate limiting middleware High-level architecture https://whizan.xyz/books/systemdesign/chapter4/image47.webp Detailed design. presents a detailed design of the system. Rules are stored on the disk. Workers frequently pull rules from the disk and store them in the cache Detailed design https://whizan.xyz/books/systemdesign/chapter4/image48.webp Race condition. Race conditions can happen in a highly concurrent environment as shown in. Assume the counter value in Redis is 3. Race condition https://whizan.xyz/books/systemdesign/chapter4/image49.webp Synchronization issue. rate limiter 2. One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. Synchronization issue https://whizan.xyz/books/systemdesign/chapter4/image50.webp Synchronization issue. One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. Synchronization issue https://whizan.xyz/books/systemdesign/chapter4/image51.webp Performance optimization. First, multi-data center setup is crucial for a rate limiter because latency is high for users located far away from the data center. Second, synchronize data with an eventual consistency model. If you are unclear about the eventual consistency model, refer to the “Consistency” section in “Chapter 6: Design a Key- value Store.” Performance optimization https://whizan.xyz/books/systemdesign/chapter5 https://whizan.xyz/books/systemdesign/chapter5/image52.webp The rehashing problem. Let us use an example to illustrate how it works. As shown in Table 5-1, we have 4 servers and 8 string keys with their hashes. To fetch the server where a key is stored, we perform the modular operation f(key) % 4. For instance, hash(key0) % 4 = 1 means a client must contact server 1 to fetch the cached data.shows the distribution of keys based on Table 5-1 The rehashing problem https://whizan.xyz/books/systemdesign/chapter5/image53.webp The rehashing problem. To fetch the server where a key is stored, we perform the modular operation f(key) % 4. For instance, hash(key0) % 4 = 1 means a client must contact server 1 to fetch the cached data.shows the distribution of keys based on Table 5-1. This approach works well when the size of the server pool is fixed, and the data distribution is even. The rehashing problem https://whizan.xyz/books/systemdesign/chapter5/image54.webp The rehashing problem. This approach works well when the size of the server pool is fixed, and the data distribution is even. shows the new distribution of keys based on Table 5-2 The rehashing problem https://whizan.xyz/books/systemdesign/chapter5/image55.webp The rehashing problem. shows the new distribution of keys based on Table 5-2. As shown in, most keys are redistributed, not just the ones originally stored in the offline server (server 1). The rehashing problem https://whizan.xyz/books/systemdesign/chapter5/image56.webp Hash space and hash ring. Now we understand the definition of consistent hashing, let us find out how it works. By collecting both ends, we get a hash ring as shown in Hash space and hash ring https://whizan.xyz/books/systemdesign/chapter5/image57.webp Hash space and hash ring. By collecting both ends, we get a hash ring as shown in Hash space and hash ring https://whizan.xyz/books/systemdesign/chapter5/image58.webp Hash servers. Using the same hash function f, we map servers based on server IP or name onto the ring.shows that 4 servers are mapped on the hash ring Hash servers https://whizan.xyz/books/systemdesign/chapter5/image59.webp Hash keys. One thing worth mentioning is that hash function used here is different from the one in “the rehashing problem,” and there is no modular operation. As shown in, 4 cache keys (key0, key1, key2, and key3) are hashed onto the hash ring Hash keys https://whizan.xyz/books/systemdesign/chapter5/image60.webp Server lookup. To determine which server a key is stored on, we go clockwise from the key position on the ring until a server is found.explains this process. Server lookup https://whizan.xyz/books/systemdesign/chapter5/image61.webp Add a server. In, after a new server 4 is added, only key0 needs to be redistributed. Add a server https://whizan.xyz/books/systemdesign/chapter5/image62.webp Remove a server. The rest of the keys are unaffected Remove a server https://whizan.xyz/books/systemdesign/chapter5/image63.webp Two issues in the basic approach. Two problems are identified with this approach. Second, it is possible to have a non-uniform key distribution on the ring. For instance, if servers are mapped to positions listed in, most of the keys are stored on server 2. However, server 1 and server 3 have no data Two issues in the basic approach https://whizan.xyz/books/systemdesign/chapter5/image64.webp Two issues in the basic approach. Second, it is possible to have a non-uniform key distribution on the ring. For instance, if servers are mapped to positions listed in, most of the keys are stored on server 2. However, server 1 and server 3 have no data. A technique called virtual nodes or replicas is used to solve these problems Two issues in the basic approach https://whizan.xyz/books/systemdesign/chapter5/image65.webp Virtual nodes. arbitrarily chosen; and in real-world systems, the number of virtual nodes is much larger. To find which server a key is stored on, we go clockwise from the key’s location and find the first virtual node encountered on the ring. Virtual nodes https://whizan.xyz/books/systemdesign/chapter5/image66.webp Virtual nodes. To find which server a key is stored on, we go clockwise from the key’s location and find the first virtual node encountered on the ring. As the number of virtual nodes increases, the distribution of keys becomes more balanced. Virtual nodes https://whizan.xyz/books/systemdesign/chapter5/image67.webp Find affected keys. In, server 4 is added onto the ring. The affected range starts from s4 (newly added node) and moves anticlockwise around the ring until a server is found (s3). Thus, keys located between s3 and s4 need to be redistributed to s4. When a server (s1) is removed as shown in, the affected range starts from s1 (removed node) and moves anticlockwise around the ring until a server is found (s0). Thus, keys located between s0 and s1 must be redistributed to s2 Find affected keys https://whizan.xyz/books/systemdesign/chapter5/image68.webp Find affected keys. When a server (s1) is removed as shown in, the affected range starts from s1 (removed node) and moves anticlockwise around the ring until a server is found (s0). Thus, keys located between s0 and s1 must be redistributed to s2 Find affected keys https://whizan.xyz/books/systemdesign/chapter6 https://whizan.xyz/books/systemdesign/chapter6/image69.webp Hashed key: 253DDEC4. Here is a data snippet in a key-value store. In this chapter, you are asked to design a key-value store that supports the following operations Hashed key: 253DDEC4 https://whizan.xyz/books/systemdesign/chapter6/image70.webp CAP theorem. CAP theorem states that one of the three properties must be sacrificed to support 2 of the 3 properties as shown in. Nowadays, key-value stores are classified based on the two CAP characteristics they support CAP theorem https://whizan.xyz/books/systemdesign/chapter6/image71.webp Ideal situation. In the ideal world, network partition never occurs. Data written to n1 is automatically replicated to n2 and n3. Both consistency and availability are achieved Ideal situation https://whizan.xyz/books/systemdesign/chapter6/image72.webp Real-world distributed systems. In a distributed system, partitions cannot be avoided, and when a partition occurs, we must choose between consistency and availability. If we choose consistency over availability (CP system), we must block all write operations to n1 and n2 to avoid data inconsistency among these three servers, which makes the system unavailable. Real-world distributed systems https://whizan.xyz/books/systemdesign/chapter6/image73.webp Data partition. Next, a key is hashed onto the same ring, and it is stored on the first server encountered while moving in the clockwise direction. For instance, key0 is stored in s1 using this logic. Using consistent hashing to partition data has the following advantages Data partition https://whizan.xyz/books/systemdesign/chapter6/image74.webp Data replication. To achieve high availability and reliability, data must be replicated asynchronously over N servers, where N is a configurable parameter. With virtual nodes, the first N nodes on the ring may be owned by fewer than N physical servers. To avoid this issue, we only choose unique servers while performing the clockwise walk logic Data replication https://whizan.xyz/books/systemdesign/chapter6/image75.webp N = The number of replicas. Consider the following example shown inwith N = 3. W = 1 does not mean data is written on one server. N = The number of replicas https://whizan.xyz/books/systemdesign/chapter6/image76.webp Inconsistency resolution: versioning. As shown in, both replica nodes n1 and n2 have the same value. Let us call this value the original value. Server 1 and server 2 get the same value for get(“name”) operation. Next, server 1 changes the name to “johnSanFrancisco”, and server 2 changes the name to “johnNewYork” as shown in. These two changes are performed simultaneously. Now, we have conflicting values, called versions v1 and v2 Inconsistency resolution: versioning https://whizan.xyz/books/systemdesign/chapter6/image77.webp Inconsistency resolution: versioning. Next, server 1 changes the name to “johnSanFrancisco”, and server 2 changes the name to “johnNewYork” as shown in. These two changes are performed simultaneously. Now, we have conflicting values, called versions v1 and v2. In this example, the original value could be ignored because the modifications were based on it. Inconsistency resolution: versioning https://whizan.xyz/books/systemdesign/chapter6/image78.webp Inconsistency resolution: versioning. The above abstract logic is explained with a concrete example as shown in. A client writes a data item D1 to the system, and the write is handled by server Sx, which now has the vector clock D1[(Sx, 1)] Inconsistency resolution: versioning https://whizan.xyz/books/systemdesign/chapter6/image79.webp Failure detection. As shown in, all-to-all multicasting is a straightforward solution. However, this is inefficient when many servers are in the system. A better solution is to use decentralized failure detection methods like gossip protocol. Gossip protocol works as follows Failure detection https://whizan.xyz/books/systemdesign/chapter6/image80.webp Failure detection. If the heartbeat has not increased for more than predefined periods, the member is considered as offline. As shown in Failure detection https://whizan.xyz/books/systemdesign/chapter6/image81.webp Handling temporary failures. If a server is unavailable due to network or server failures, another server will process requests temporarily. Handling temporary failures https://whizan.xyz/books/systemdesign/chapter6/image82.webp Handling permanent failures. Step 1: Divide key space into buckets (4 in our example) as shown in. A bucket is used as the root level node to maintain a limited depth of the tree. Step 2: Once the buckets are created, hash each key in a bucket using a uniform hashing method Handling permanent failures https://whizan.xyz/books/systemdesign/chapter6/image83.webp Handling permanent failures. Step 2: Once the buckets are created, hash each key in a bucket using a uniform hashing method. Step 3: Create a single hash node per bucket Handling permanent failures https://whizan.xyz/books/systemdesign/chapter6/image84.webp Handling permanent failures. Step 3: Create a single hash node per bucket. Step 4: Build the tree upwards till root by calculating hashes of children Handling permanent failures https://whizan.xyz/books/systemdesign/chapter6/image85.webp Handling permanent failures. Step 4: Build the tree upwards till root by calculating hashes of children. To compare two Merkle trees, start by comparing the root hashes. Handling permanent failures https://whizan.xyz/books/systemdesign/chapter6/image86.webp System architecture diagram. Now that we have discussed different technical considerations in designing a key-value store, we can shift our focus on the architecture diagram, shown in. Main features of the architecture are listed as follows System architecture diagram https://whizan.xyz/books/systemdesign/chapter6/image87.webp System architecture diagram. There is no single point of failure as every node has the same set of responsibilities. As the design is decentralized, each node performs many tasks as presented in System architecture diagram https://whizan.xyz/books/systemdesign/chapter6/image88.webp Write path. explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra [8]. The write request is persisted on a commit log file Write path https://whizan.xyz/books/systemdesign/chapter6/image89.webp Read path. After a read request is directed to a specific node, it first checks if data is in the memory cache. If so, the data is returned to the client as shown in. If the data is not in memory, it will be retrieved from the disk instead. We need an efficient way to find out which SSTable contains the key. Bloom filter [10] is commonly used to solve this problem Read path https://whizan.xyz/books/systemdesign/chapter6/image90.webp Read path. The read path is shown inwhen data is not in memory. The system first checks if data is in memory. If not, go to step 2 Read path https://whizan.xyz/books/systemdesign/chapter6/image91.webp Summary. This chapter covers many concepts and techniques. To refresh your memory, the following table summarizes features and corresponding techniques used for a distributed key-value store Summary https://whizan.xyz/books/systemdesign/chapter7 https://whizan.xyz/books/systemdesign/chapter7/image92.webp CHAPTER 7: DESIGN A UNIQUE ID GENERATOR IN DISTRIBUTED SYSTEMS. Here are a few examples of unique IDs. Step 1 - Understand the problem and establish design scope CHAPTER 7: DESIGN A UNIQUE ID GENERATOR IN DISTRIBUTED SYSTEMS https://whizan.xyz/books/systemdesign/chapter7/image93.webp Multi-master replication. As shown in, the first approach is multi-master replication. This approach uses the databases’ auto_increment feature. Multi-master replication https://whizan.xyz/books/systemdesign/chapter7/image94.webp UUID. Here is an example of UUID: 09c93e62-50b4-468d-bf8a-c07e1040bfb2. UUIDs can be generated independently without coordination between servers.presents the UUIDs design. In this design, each web server contains an ID generator, and a web server is responsible for generating IDs independently UUID https://whizan.xyz/books/systemdesign/chapter7/image95.webp Ticket Server. Ticket servers are another interesting way to generate unique IDs. Flicker developed ticket servers to generate distributed primary keys [2]. It is worth mentioning how the system works. The idea is to use a centralized auto_increment feature in a single database server (Ticket Server). To learn more about this, refer to flicker’s engineering blog article [2] Ticket Server https://whizan.xyz/books/systemdesign/chapter7/image96.webp Twitter snowflake approach. Divide and conquer is our friend. Instead of generating an ID directly, we divide an ID into different sections.shows the layout of a 64-bit ID. Each section is explained below Twitter snowflake approach https://whizan.xyz/books/systemdesign/chapter7/image97.webp Step 3 - Design deep dive. In the high-level design, we discussed various options to design a unique ID generator in distributed systems. Datacenter IDs and machine IDs are chosen at the startup time, generally fixed once the system is up running. Step 3 - Design deep dive https://whizan.xyz/books/systemdesign/chapter7/image98.webp Timestamp. The most important 41 bits make up the timestamp section. The maximum timestamp that can be represented in 41 bits is Timestamp https://whizan.xyz/books/systemdesign/chapter8 https://whizan.xyz/books/systemdesign/chapter8/image99.webp URL redirecting. shows what happens when you enter a tinyurl onto the browser. Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect. The detailed communication between clients and servers is shown in URL redirecting https://whizan.xyz/books/systemdesign/chapter8/image100.webp URL redirecting. The detailed communication between clients and servers is shown in. One thing worth discussing here is 301 redirect vs 302 redirect URL redirecting https://whizan.xyz/books/systemdesign/chapter8/image101.webp URL shortening. Let us assume the short URL looks like this: www.tinyurl.com/{hashValue}. To support the URL shortening use case, we must find a hash function fx that maps a long URL to the hashValue, as shown in. The hash function must satisfy the following requirements URL shortening https://whizan.xyz/books/systemdesign/chapter8/image102.webp Data model. In the high-level design, everything is stored in a hash table. Data model https://whizan.xyz/books/systemdesign/chapter8/image103.webp Hash value length. ≥ 365 billion. The system must support up to 365 billion URLs based on the back of the envelope estimation. Table 8-1 shows the length of hashValue and the corresponding maximal number of URLs it can support. When n = 7, 62 ^ n = ~3.5 trillion, 3.5 trillion is more than enough to hold 365 billion URLs, so the length of hashValue is 7 Hash value length https://whizan.xyz/books/systemdesign/chapter8/image104.webp Hash + collision resolution. To shorten a long URL, we should implement a hash function that hashes a long URL to a 7- character string. As shown in Table 8-2, even the shortest hash value (from CRC32) is too long (more than 7 characters). How can we make it shorter? Hash + collision resolution https://whizan.xyz/books/systemdesign/chapter8/image105.webp Hash + collision resolution. 5. This method can eliminate collision; however, it is expensive to query the database to check if a shortURL exists for every request. Hash + collision resolution https://whizan.xyz/books/systemdesign/chapter8/image106.webp Base 62 conversion. representation.shows the conversation process. Thus, the short URL is https://tinyurl.com /2TX Comparison of the two approaches Base 62 conversion https://whizan.xyz/books/systemdesign/chapter8/image107.webp Base 62 conversion. Table 8-3 shows the differences of the two approaches Base 62 conversion https://whizan.xyz/books/systemdesign/chapter8/image108.webp URL shortening deep dive. As one of the core pieces of the system, we want the URL shortening flow to be logically simple and functional. Base 62 conversion is used in our design. We build the following diagram to demonstrate the flow. longURL is the input URL shortening deep dive https://whizan.xyz/books/systemdesign/chapter8/image109.webp URL shortening deep dive. Save ID, shortURL, and longURL to the database as shown in Table 8-4. The distributed unique ID generator is worth mentioning. Its primary function is to generate globally unique IDs, which are used for creating shortURLs. In a highly distributed URL shortening deep dive https://whizan.xyz/books/systemdesign/chapter8/image110.webp URL redirecting deep dive. shows the detailed design of the URL redirecting. As there are more reads than writes, <shortURL, longURL> mapping is stored in a cache to improve performance. The flow of URL redirecting is summarized as follows URL redirecting deep dive https://whizan.xyz/books/systemdesign/chapter9 https://whizan.xyz/books/systemdesign/chapter9/image111.webp CHAPTER 9: DESIGN A WEB CRAWLER. A web crawler is known as a robot or spider. A crawler is used for many purposes CHAPTER 9: DESIGN A WEB CRAWLER https://whizan.xyz/books/systemdesign/chapter9/image112.webp Step 2 - Propose high-level design and get buy-in. Once the requirements are clear, we move on to the high-level design. Inspired by previous studies on web crawling [4] [5], we propose a high-level design as shown in. First, we explore each design component to understand their functionalities. Then, we examine the crawler workflow step-by-step Step 2 - Propose high-level design and get buy-in https://whizan.xyz/books/systemdesign/chapter9/image113.webp URL Extractor. URL Extractor parses and extracts links from HTML pages.shows an example of a link extraction process. Relative paths are converted to absolute URLs by adding the “https://en.wikipedia.org” prefix URL Extractor https://whizan.xyz/books/systemdesign/chapter9/image114.webp Web crawler workflow. To better explain the workflow step-by-step, sequence numbers are added in the design diagram as shown in Web crawler workflow https://whizan.xyz/books/systemdesign/chapter9/image115.webp DFS vs BFS. Most links from the same web page are linked back to the same host. Standard BFS does not take the priority of a URL into consideration. DFS vs BFS https://whizan.xyz/books/systemdesign/chapter9/image116.webp Politeness. The general idea of enforcing politeness is to download one page at a time from the same host. Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host Politeness https://whizan.xyz/books/systemdesign/chapter9/image117.webp Politeness. Mapping table: It maps each host to a queue. FIFO queues b1, b2 to bn: Each queue contains URLs from the same host Politeness https://whizan.xyz/books/systemdesign/chapter9/image118.webp Priority. shows the design that manages URL priority. Prioritizer: It takes URLs as input and computes the priorities Priority https://whizan.xyz/books/systemdesign/chapter9/image119.webp Back queues: manage politeness. presents the URL frontier design, and it contains two modules Back queues: manage politeness https://whizan.xyz/books/systemdesign/chapter9/image120.webp Distributed crawl. To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. Distributed crawl https://whizan.xyz/books/systemdesign/chapter9/image121.webp Extensibility. As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules.shows how to add new modules. PNG Downloader module is plugged-in to download PNG files Extensibility