pytorch回归
In this notebook, we shall use this dataset containing data about passengers from the Titanic. Based on this data, we will use a Ridge Regression model which just means a Logistic Regression model that uses L2 Regularization for predicting whether a person survived the sinking based on their passenger class, sex, the number of their siblings/spouses aboard, the number of their parents/children aboard and the fare they payed.
在这款笔记本上,我们将使用这种含有大约从泰坦尼克号的乘客数据集。 基于此数据,我们将使用Ridge回归模型,该模型仅表示使用L2正则化的Logistic回归模型,根据乘客的等级,性别,同胞/配偶的人数,人数来预测某人是否沉没。他们的父母/孩子和他们支付的车费。
First, we import everything we need for plotting data and creating a great model to make predictions on the data.
首先,我们导入绘制数据和创建出色模型以对数据进行预测所需的一切。
数据探索 (Data Exploration)
Here, we can see what the data actually looks like. The first column indicates whether the person survived with a 1 or a 0, where the 1 stands for survival and the 0 for death. The rest of the columns are all our input columns used to predict the survival. We will, however, forget about as it does not hold important information needed to predict survival. You can also see below that we have 887 persons with their data and 8 total columns where 6 of them will be the input values and 1 of them (the Survived column) the corresponding label.
在这里,我们可以看到数据的实际外观。 第一列表示该人是生存还是生存1或0,其中1代表生存,0代表死亡。 其余的列都是我们用来预测生存率的所有输入列。 但是,我们将忘记它,因为它不具备预测生存所需的重要信息。 您还可以在下面看到,我们有887个人及其数据,共有8列,其中6个为输入值,其中1个(生存列)为相应的标签。
To get a little bit more familiar with the data we can do some computations with it and plot it. First we print how big the part of the survivors is which can also be described as the total probability of survival.
为了更熟悉数据,我们可以对其进行一些计算并绘制出图表。 首先,我们打印幸存者部分的大小,这也可以描述为生存的总概率。
When we look at how likely different people from different classes and of different sexes are to survive we can see a clear trend that the higher the class the higher the survival probability and also that women are more likely to survive. Ever wondered why is this the case? The answer is quite simple.
当我们观察来自不同阶级和性别的不同人的生存可能性时,我们可以清楚地看到一个趋势,即阶级越高,生存的可能性就越高,而且女性生存的可能性就越大。 有没有想过为什么会这样呢? 答案很简单。
When the Titanic began to sink, women and children would go off-board in the lifeboats first before the men. The lower class passengers were not treated equally at the time of sinking as there were so many people in the lower class that not all could not be informed by the stewardesses. Subsequently, it took much longer for them to get to the deck for rescue while first and second class passengers were already boarding the lifeboats. Also, the sailors fastened down the hatchways leading to the third-class section. They said they wanted to keep the air down there so the vessel could stay up longer. It meant all hope was gone for the passengers still down there.
当泰坦尼克号沉没时,妇女和儿童将首先在救生艇上驶离船上,然后才驶向男子。 在下沉时,下层旅客没有得到同等的对待,因为下层旅客太多,因此乘务员并不能告知所有人。 随后,当头等舱和二等舱乘客已经登上救生艇时,他们花了更长的时间才能到达甲板上进行救援。 此外,水手们将舱口固定在通往三等舱的舱口上。 他们说,他们希望将空气保持在那里,以便船只可以停留更长的时间。 这意味着所有仍在那里的乘客的希望已荡然无存。
Another reason why so many people died was the missing safety measures onboard the Titanic. For example, there were not enough boats for the passengers to escape the ship. The lifeboats would have only been sufficient for half the people onboard and due to bad organization not all of them were completely filled. More than half of the passengers were left behind. One good aspect, however, is that the laws for a ship’s safety have become more strict since this disaster. If you want to read about the sinking in detail, have a look at this: https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic
如此之多的人死亡的另一个原因是泰坦尼克号上缺少安全措施。 例如,没有足够的船只供乘客逃脱。 救生艇仅能满足船上一半人的需要,而且由于组织不善,并非所有人都被完全装满。 一半以上的乘客被抛在后面。 但是,一个好的方面是,自这场灾难以来,船舶安全的法律变得更加严格。 如果您想详细了解下沉,请查看以下内容: https : //en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic
Looking at the prices, which are all measured in pounds, we can see the total average fare and then the different ones from the different classes. Note that due to inflation these numbers measured in pounds today would be a lot higher.
查看全部以磅为单位的价格,我们可以看到总平均票价,然后是不同类别的票价。 请注意,由于通货膨胀,今天这些以磅为单位的数字会高得多。
There are also people from all ages on board while the average age is 30 years.
船上还有各个年龄段的人,平均年龄为30岁。
To see the difference of survival probability already mentioned and explained above more visually, we can plot them like the following plots show. Here, you can see the difference between the different classes and sex very well.
为了更直观地看到上面已经提到和解释的生存概率的差异,我们可以像下面的图所示绘制它们。 在这里,您可以很好地看到不同阶级和性别之间的差异。
Let’s now look at the fare distribution and the costs from the different classes.
现在,让我们看一下不同类别的票价分配和费用。
In the following we can clearly see that most passengers did not have any siblings/spouses or parents/children aboard.
在下面的内容中,我们可以清楚地看到,大多数乘客没有兄弟姐妹/配偶,也没有父母/子女。
Lastly we look at the distribution of the ages of the passengers.
最后,我们看看乘客的年龄分布。
资料准备 (Data Preparation)
Now to be able to train our model we want to convert our pandas dataframe into PyTorch Tensors. To do this we define the dataframe_to_arrays
method which does the conversion to NumPy arrays. To use the function we need to specify 3 kinds of columns namely input columns, categorical columns (columns that do not contain numbers but rather a string standing for a category) and output columns so that it can properly create a NumPy array for input data and labels with all input data from the input columns (by first converting the categorical columns to numerical ones) and the labels from the output columns. Then we can easily convert them to PyTorch Tensors and specify the desired data types so that we are ready to define the model to be ready for the training on the data. Note also that the normalize
parameter is set to True
which makes the function normalize the input data by squishing all values in a range between 0 and 1 with Min Max Normalization for the model to be able to better learn from the data as it is more uniform now.
现在要能够训练我们的模型,我们想将熊猫数据框转换为PyTorch张量。 为此,我们定义了dataframe_to_arrays
方法,该方法可以转换为NumPy数组。 要使用该函数,我们需要指定3种列,即输入列,分类列(不包含数字,而是代表类别的字符串的列)和输出列,以便它可以正确地为输入数据和输入创建NumPy数组。标签,其中包含来自输入列的所有输入数据(首先将分类列转换为数字列),以及来自输出列的标签。 然后,我们可以轻松地将它们转换为PyTorch张量并指定所需的数据类型,以便我们准备定义模型以进行数据训练。 另请注意,将normalize
参数设置为True
,这会使函数通过使用Min Max Normalization挤压0到1之间范围内的所有值来对输入数据进行归一化,以便模型可以更好地从数据中学习,因为它更统一现在。
Now that we have the PyTorch Tensors for the input data and the labels we put them into a PyTorch TensorDataset
which contains pairs of inputs and labels.
现在我们有了用于输入数据和标签的PyTorch张量,我们将它们放入包含输入和标签对的PyTorch TensorDataset
中。
Another thing we have to do is to split the original dataset into one for training the model and another one for validating that the model is learning something. This means that the validation dataset contains data that the model has never seen before and by making predictions on it we can see how well the model can perform on unknown data. This accuracy from the validation data will be used as a metric for all training epochs as well as the loss on the validation data.
我们要做的另一件事是将原始数据集分为一个用于训练模型的数据集,另一个用于验证模型正在学习某些东西的数据集。 这意味着验证数据集包含模型从未见过的数据,并且通过对其进行预测,我们可以看到模型对未知数据的性能如何。 验证数据的准确性将用作所有训练时期以及验证数据损失的度量。
Last thing we do with our data is to put it into a DataLoader
(one for the validation data and one for the training data) which will be used to train the model with shuffled and batched data.
我们对数据做的最后一件事是将其放入DataLoader
(一个用于验证数据,另一个用于训练数据),该数据将用于使用混排和批处理的数据来训练模型。
定义模型结构 (Defining Model Structure)
Now we can create our model which is just a simple Logistic Regression model with a linear layer that accepts 6 inputs and outputs 1 value between 0 and 1 which basically makes the model’s forward
method return the probability of survival it predicts by using the sigmoid activation function. This is necessary as then we can train the model to output a 1 when it thinks that the person would survive and a 0 when it does think that the person will not survive even though the model will probably never return a 1 or a 0 but it will predict a probability closer to 1 or 0 after some time of training. Moreover we define some other methods in our model for training and computing accuracies or printing them. One more thing to note is that as a loss function in training_step
we use the Binary Cross Entropy loss. Lastly, we create an instance of the TitanicModel
called model
which we will train on the training data.
现在我们可以创建我们的模型,它只是一个简单的Logistic回归模型,该模型具有一个线性层,该层可以接受6个输入并输出0到1之间的1值,这基本上使该模型的forward
方法返回使用S形激活函数预测的生存概率。 这是必要的,因为这样我们可以训练模型,当模型认为某人可以生存时输出1,而当模型认为该人可能永远不会返回1或0时却认为该人将无法生存,则输出0。经过一段时间的训练,将预测接近1或0的概率。 此外,我们在模型中定义了一些其他方法来训练和计算精度或打印它们。 还有一点要注意的是,作为training_step
的损失函数,我们使用了二进制交叉熵损失。 最后,我们创建称为model
的TitanicModel
实例,我们将在训练数据上进行训练。
训练模型 (Training the Model)
Now comes the cool part — the actual training! For this we need to make a fit_one_cycle
function to do the training for our model. Like any usual fit functions this one uses an optimizer to adjust the models parameters with a certain learning rate according to the gradients (the gradients are just the partial derivatives for each parameter with respect to the loss) which are obtained by backpropagating the loss backwards through the model. Here however there are some things about the fit function I want to point out that are not just the usual computing loss and then adjusting weights and biases thing.
现在是最酷的部分-实际训练! 为此,我们需要创建一个fit_one_cycle
函数来对我们的模型进行训练。 像任何通常的拟合函数一样,该函数使用优化器根据梯度(梯度只是每个参数相对于损耗的偏导数)以一定的学习率调整模型参数,这些梯度是通过反向传播损耗而获得的。该模型。 但是,在这里我要指出一些关于拟合函数的问题,这些事情不仅是通常的计算损失,而且还有调整权重和偏差的事情。
Learning rate scheduling: This is a technique replacing the fixed learning rate usually done by changing the learning rate after every batch of training. This can be done several ways but the way we will do it is with the “One Cycle Learning Policy” which starts with a smaller learning rate and then starts to gradually increase for the first 30% of epochs and then decreasing it again for optimal learning. For this scheduler we just need to set the maximum learning rate to which it will increase over time. If you want to go deeper into this topic, I suggest you read this: https://sgugger.github.io/the-1cycle-policy.html
学习率安排 :这是一种通常通过在每批培训后更改学习率来代替固定学习率的技术。 这可以通过多种方式完成,但是我们将采用“单周期学习策略”来实现 ,该策略以较小的学习率开始,然后在前30%的时期开始逐渐增加,然后再次降低以实现最佳学习。 对于此调度程序,我们只需要设置随时间增加的最大学习率即可。 如果您想更深入地研究这个主题,建议您阅读以下内容: https : //sgugger.github.io/the-1cycle-policy.html
Weight decay / L2 Regularization: Another thing we use is weight decay which adds the sum of the weights squared to the loss function so that bigger weights will be punished as bigger weights are usually a sign of overfitting. Thereby we make the model able to generalize better and achieve better results on unknown data as the weights are being lowered. See this for more information about weight decay: https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
权重衰减/ L2正则化 :我们使用的另一种方法是权重衰减,它会将权重的平方和加到损失函数上,以便惩罚更大的权重,因为更大的权重通常表示过度拟合。 因此,随着权重的降低,我们使模型能够更好地泛化并在未知数据上获得更好的结果。 请参阅此以获取有关体重减轻的更多信息: https : //towardsdatascience.com/this-thing-叫做-weight-decay- a7cd4bcfccab
Gradient clipping: Lastly there is gradient clipping. This is actually quite simple but still very useful. The way gradient clipping works is that it just limits the gradient to a certain value so that if the gradient would take the model in the wrong direction it is limited to a certain size which means that the model can’t be hurt due to large gradient values. Here is an interesting post about it: https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48
渐变裁剪 :最后是渐变裁剪。 这实际上很简单,但仍然非常有用。 渐变裁剪的工作方式是将渐变限制为某个值,因此,如果渐变将模型沿错误的方向放置,则将其限制为一定的大小,这意味着不会因大渐变而伤害模型价值观。 这是关于它的一个有趣的帖子: https : //towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48
Now, we should be all set to start the training. First, we compute the accuracy and loss on the validation data to see how good the model initially performed vs how it performs after the training. For the training itself we define the hyperparameters like maximum learning rate, epochs to train for, weigh decay value, gradient clipping and the optimizer which will be the adam optimizer.
现在,我们都应该开始培训了。 首先,我们计算验证数据的准确性和损失,以查看模型最初执行的效果与训练后的执行效果。 对于训练本身,我们定义了超参数,例如最大学习率,要训练的时期,权重衰减值,梯度削波以及将作为亚当优化器的优化器。
After this brief training, our model should be very good at predicting survival for Titanic passengers (Note that it can never be perfect as there is always a component of luck involved in the individual probability for survival). You can also see this when you look at the accuracy and the loss on the validation data and compare it to the one computed before the training. To verify this further, let’s plot the accuracies on the validation data over the course of the training phase as well as the loss on both training data and validation data. This last thing can also show us how much the model is overfitting since as soon as the training data loss decreases, the validation data loss increases or stays the same. We are overfitting since we are getting better and better on the training data, but worse on the validation data which is definitely not what we want. However, the model does not seem to be overfitting, which is great!
经过简短的培训后,我们的模型应该非常擅长预测泰坦尼克号乘客的生存(注意,由于运气总是与个人生存概率有关,因此它永远不可能是完美的)。 当您查看验证数据的准确性和损失并将其与训练之前计算出的值进行比较时,也可以看到这一点。 为了进一步验证这一点,让我们在训练阶段的过程中绘制验证数据的准确性,以及训练数据和验证数据的损失。 这最后一件事还可以向我们显示模型有多适合,因为一旦训练数据丢失减少,验证数据丢失就会增加或保持不变。 我们正在过度拟合,因为我们在训练数据上越来越好,但在验证数据上却越来越差,这绝对不是我们想要的。 但是,该模型似乎并不过分拟合,这太棒了!
保存模型 (Save the Model)
Now we need to save our model by writing its state which means all of its parameters to a file and log our hyperparameters, final accuracy and final loss so we can later easily compare different model architectures and different choices of hyperparameters to see how well they perform.
现在我们需要通过写入模型的状态来保存模型,这意味着将其所有参数写入文件并记录我们的超参数,最终精度和最终损失,以便稍后可以轻松比较不同的模型架构和不同的超参数选择,以查看它们的性能如何。
Note that when looking at the weights from the model.state_dict()
output we can see how important each of the input values is. For example, we can see that the class is associated with a negative value which is good since people from a class described with a higher number like class 3 were less likely to survive. The next weight shows the extreme importance of the sex for prediction as it is associated with the largest negative value which can be understood, if you know that a man is represented with 1 and a woman with 0. What we can also deduce from the last weight is that the larger the fare paid the higher the survival probability which makes sense, too.
请注意,当查看来自model.state_dict()
输出的权重时,我们可以看到每个输入值的重要性。 例如,我们可以看到该类别与一个负值相关联,这是一个很好的值,因为来自该类别的人们用较高的数字(例如第3类)描述的生存可能性较小。 下一个权重显示了性别对于预测的极端重要性,因为它与可以理解的最大负值相关,如果您知道男人用1表示,女人用0表示。我们也可以从上一个推论得出权重是,票价越高,生存率越高,这也是有道理的。
在样品上测试模型 (Test the Model on Samples)
Having the training phase behind us, we can do some testing on various single examples from the validation data to get a feeling for how well the model performs. Therefore, we need to make a function which will return the models prediction for a given dataset element as well as what the person’s data is and whether the person actually survived or not. As you can see in order to display the data it is important that we denormalize our data again by putting all values from the range between 0 and 1 back to their initial range and converting the categorical column Sex back to the strings female and male from the numbers 0 and 1 as which they were represented in the dataset.
有了培训阶段,我们可以对来自验证数据的各个示例进行一些测试,以了解模型的性能。 因此,我们需要创建一个函数,该函数将返回给定数据集元素的模型预测,以及该人的数据是什么,以及该人是否实际幸存。 如您所见,为了显示数据,重要的是我们再次对数据进行非规范化,方法是将0到1之间的所有值都放回其初始范围,然后将分类列Sex转换为从在数据集中表示的数字0和1。
As expected, the model gets most predictions right with a survival probability that makes complete sense when looking at the input data. Even though it was wrong on the first prediction, this is not a bad sign since there is also always a component of luck involved which makes a case like this not perfectly predictable. If we recall the survival probabilities for the persons of different sexes and of different classes, we can see that the prediction is actually pretty close to that, which I think is a good sign.
正如预期的那样,该模型可以通过查看输入数据时完全有意义的生存概率正确地进行大多数预测。 即使在第一个预测中是错误的,但这也不是一个坏兆头,因为总会有涉及运气的因素,这使得这种情况无法完全预测。 如果我们回想起不同性别和不同阶级的人的生存概率,我们可以看到预测实际上与该预测非常接近,我认为这是一个好兆头。
Don’t you want to find out as well whether you would have survived the Titanic disaster. To do this we have a nice function that asks you to input your data and then returns its prediction after converting the categorical values to numericals and normalizing the input data. Just think of a fare reasonable for your chosen class (or not and try to break the predictions). You can, of course, completely make up data to test the model and see which people would have survived.
您是否也不想找出是否可以在泰坦尼克号灾难中幸存下来。 为此,我们有一个很好的函数,要求您输入数据,然后在将分类值转换为数值并对输入数据进行归一化后返回其预测。 只需考虑适合您所选班级的票价(否则,请尝试打破预期)。 当然,您可以完全组成数据来测试模型,并查看哪些人可以幸免。
Lastly we can make a submission .csv file for the Titanic competition on kaggle to become first place p ;).
最后,我们可以为kaggle上的泰坦尼克号比赛提交.csv文件,以成为第一名p;)。
See the entire notebook here.
在这里查看整个笔记本。
总结和未来工作的机会 (Summary and Opportunities for future work)
Lastly, I want to summarize the amazing things I learned from this nice project at Jovian. The first major takeaway was how to deal with Pandas dataframes and data in general which was usually done for me when I was provided a starter notebook. Now that I did this project from scratch I read about the pandas library and its various functions so I was able to use this data for my project very well. I also learned quite a bit about data normalization.
最后,我想总结一下我从Jovian的一个不错的项目中学到的令人惊奇的事情。 第一个主要的收获是如何处理Pandas数据框和一般数据,这在为我提供入门笔记本时通常为我完成。 现在,我从头开始做这个项目,我了解了pandas库及其各种功能,因此我能够很好地将这些数据用于我的项目。 我还学到了很多有关数据标准化的知识。
Another thing I took away from this was a lot of knowledge about Logistic Regression as I read quite a lot on the various approaches. For example, I read about why you would use 1 output neuron vs 2 output neurons for binary classification and came to the result that the usage of 1 output neuron is less prone to overfitting as it has less parameters which makes totally sense. This is also why I used this for my model with the Binary Cross Entropy. Moreover, I learned the math behind regularization to be able to better understand it and implement it which helped a lot when implementing regularization and choosing the weight decay hyperparameter.
我从中学到的另一件事是关于逻辑回归的大量知识,因为我对各种方法学到了很多东西。 例如,我读到了为什么要使用1个输出神经元而不是2个输出神经元进行二进制分类,并且得出的结果是,使用1个输出神经元具有较少的参数就不太容易过拟合,因为这完全有意义。 这也是为什么我将其用于具有二进制交叉熵的模型的原因。 此外,我了解了正则化背后的数学知识,以便能够更好地理解它并实现它,这在实现正则化和选择权重衰减超参数时很有帮助。
Not to forget are also the things I learned about the disaster by examining the data and also from additional research, which was very interesting.
我也不要忘记通过检查数据以及从其他研究中学到的有关灾难的知识,这非常有趣。
To sum up, I cannot stress enough on how great such projects are for learning as by doing everything yourself you can learn much better. I feel more comfortable with the PyTorch library and Machine Learning now.
综上所述,我不能过多强调此类项目对学习的意义,因为您自己做的一切都可以使您学得更好。 现在,我对PyTorch库和机器学习感到更加自在。
I can’t wait to work on more challenging projects in future with other datasets and compete in various interesting Kaggle challenges with all the newly-learned things to deepen my knowledge in the world of AI and have fun. I am really looking forward to doing the thousands of projects that I have in my mind!
我迫不及待地希望将来与其他数据集一起从事更具挑战性的项目,并与所有新近学习的知识一起竞争各种有趣的Kaggle挑战,以加深我在AI世界中的知识并从中获得乐趣。 我真的很想做我脑海中成千上万的项目!
pytorch回归