import torch
import numpy as np

Data

- model: $y_i= w_0+w_1 x_i +\epsilon_i = 2.5 + 4x_i +\epsilon_i, \quad i=1,2,\dots,n$

- model: ${\bf y}={\bf X}{\bf W} +\boldsymbol{\epsilon}$

${\bf y}=\begin{bmatrix} y_1 \\ y_2 \\ \dots \\ y_n\end{bmatrix}, \quad {\bf X}=\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \dots \\ 1 & x_n\end{bmatrix}, \quad {\bf W}=\begin{bmatrix} 2.5 \\ 4 \end{bmatrix}, \quad \boldsymbol{\epsilon}= \begin{bmatrix} \epsilon_1 \\ \dots \\ \epsilon_n\end{bmatrix}$

torch.manual_seed(202150754)
n=100
ones= torch.ones(n)
x,_ = torch.randn(n).sort()
X = torch.vstack([ones,x]).T
W = torch.tensor([2.5,4])
ϵ = torch.randn(n)*0.5
y = X@W + ϵ
ytrue = X@W

step1~2 요약

방법1: 모델을 직접선언 + loss함수도 직접선언

What1=torch.tensor([-5.0,10.0],requires_grad=True) 
yhat1=X@What1
loss1=torch.mean((y-yhat1)**2) 
loss1

tensor(110.0313, grad_fn=<MeanBackward0>)

방법2: 모델식을 torch.nn으로 선언 (bias=False) + loss 직접선언

net2=torch.nn.Linear(in_features=2,out_features=1,bias=False) 
net2.weight.data= torch.tensor([[-5.0,10.0]]) 
yhat2=net2(X) 
loss2=torch.mean((y.reshape(100,1)-yhat2)**2) 
loss2

tensor(110.0313, grad_fn=<MeanBackward0>)

방법3: 모델식을 torch.nn으로 선언 (bias=True) + loss 직접선언

net3=torch.nn.Linear(in_features=1,out_features=1,bias=True) 
net3.weight.data= torch.tensor([[10.0]])
net3.bias.data= torch.tensor([[-5.0]]) 
yhat3=net3(x.reshape(100,1)) 
loss3=torch.mean((y.reshape(100,1)-yhat3)**2) 
loss3

tensor(110.0313, grad_fn=<MeanBackward0>)

방법4: 모델식을 직접선언 + loss함수는 torch.nn.MSELoss()

What4=torch.tensor([-5.0,10.0],requires_grad=True) 
yhat4=X@What4 
lossfn=torch.nn.MSELoss() 
loss4=lossfn(y,yhat4) 
loss4

tensor(110.0313, grad_fn=<MseLossBackward0>)

방법5: 모델식을 torch.nn으로 선언 (bias=False) + loss함수는 torch.nn.MSELoss()

net5=torch.nn.Linear(in_features=2,out_features=1,bias=False) 
net5.weight.data= torch.tensor([[-5.0,10.0]]) 
yhat5=net5(X) 
#lossfn=torch.nn.MSELoss() 
loss5=lossfn(y.reshape(100,1),yhat5) 
loss5

tensor(110.0313, grad_fn=<MseLossBackward0>)

방법6: 모델식을 torch.nn으로 선언 (bias=True) + loss함수는 torch.nn.MSELoss()

net6=torch.nn.Linear(in_features=1,out_features=1,bias=True) 
net6.weight.data= torch.tensor([[10.0]])
net6.bias.data= torch.tensor([[-5.0]]) 
yhat6=net6(x.reshape(100,1)) 
loss6=lossfn(y.reshape(100,1),yhat6) 
loss6

tensor(110.0313, grad_fn=<MseLossBackward0>)

step3: derivation

loss1

loss1.backward()

What1.grad.data

tensor([-17.3043,  14.8581])

loss2

loss2.backward()

net2.weight.grad

tensor([[-17.3043,  14.8581]])

loss3

loss3.backward()

net3.bias.grad,net3.weight.grad

(tensor([[-17.3043]]), tensor([[14.8581]]))

loss4

loss4.backward()

What4.grad.data

tensor([-17.3043,  14.8581])

loss5

loss5.backward()

net5.weight.grad

tensor([[-17.3043,  14.8581]])

loss6

loss6.backward()

net6.bias.grad,net6.weight.grad

(tensor([[-17.3043]]), tensor([[14.8581]]))

step4: update

loss1

What1.data ## update 전

tensor([-5., 10.])

lr=0.1  # learning rate
What1.data = What1.data - lr*What1.grad.data ## update 후 
What1

tensor([-3.2696,  8.5142], requires_grad=True)

loss2

net2.weight.data

tensor([[-3.2696,  8.5142]])

SGD: Implements stochastic gradient descent (optionally with momentum).

optmz2 = torch.optim.SGD(net2.parameters(),lr=0.1)

list(net2.parameters())

[Parameter containing:
 tensor([[-3.2696,  8.5142]], requires_grad=True)]

optmz2.step() ## update

net2.weight.data ## update 후

tensor([[-1.5391,  7.0284]])

loss3

net3.bias.data,net3.weight.data

(tensor([[-5.]]), tensor([[10.]]))

optmz3 = torch.optim.SGD(net3.parameters(),lr=0.1)

optmz3.step()

net3.bias.data,net3.weight.data

(tensor([[1.9217]]), tensor([[4.0567]]))

list(net3.parameters())

[Parameter containing:
 tensor([[4.0567]], requires_grad=True),
 Parameter containing:
 tensor([[1.9217]], requires_grad=True)]

loss4

What4.data ## update 전

tensor([-5., 10.])

lr=0.1 
What4.data = What4.data - lr*What4.grad.data ## update 후 
What4

tensor([-3.2696,  8.5142], requires_grad=True)

loss5

net5.weight.data

tensor([[-5., 10.]])

optmz5 = torch.optim.SGD(net5.parameters(),lr=0.1)

optmz5.step() ## update

net5.weight.data ## update 후

tensor([[-3.2696,  8.5142]])

loss6

net6.bias.data,net6.weight.data

(tensor([[-5.]]), tensor([[10.]]))

optmz6 = torch.optim.SGD(net6.parameters(),lr=0.1)

optmz6.step()

net6.bias.data,net6.weight.data

(tensor([[-3.2696]]), tensor([[8.5142]]))

step1~4를 반복하면된다.

net=torch.nn.Linear(in_features=2,out_features=1,bias=False) ## 모형정의 
optmz=torch.optim.SGD(net.parameters(),lr=0.1)
mseloss=torch.nn.MSELoss() 
for epoc in range(100): 
    # step1: yhat 
    yhat=net(X) ## yhat 계산 
    # step2: loss
    loss=mseloss(y.reshape(100,1),yhat) 
    # step3: derivation 
    loss.backward() 
    # step4: update
    optmz.step()
    optmz.zero_grad() ## 누적되는 기울기값 방지

list(net.parameters())

[Parameter containing:
 tensor([[2.5306, 3.9915]], requires_grad=True)]

숙제

아래를 실행해보고 결과를 관찰하라.

net=torch.nn.Linear(in_features=2,out_features=1,bias=False) ## 모형정의 
optmz=torch.optim.SGD(net.parameters(),lr=0.1)
mseloss=torch.nn.MSELoss() 
for epoc in range(100): 
    # step1: yhat 
    yhat=net(X) ## yhat 계산 
    # step2: loss
    loss=mseloss(y.reshape(100,1),yhat) 
    # step3: derivation 
    loss.backward() 
    # step4: update
    optmz.step()

list(net.parameters())

[Parameter containing:
 tensor([[5.9608, 7.2038]], requires_grad=True)]

definition of SGD

ref: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html

CLASS > torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)</strong></p> </div> </div> </div>

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

lr (float) – learning rate

momentum (float, optional) – momentum factor (default: 0)

weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

dampening (float, optional) – dampening for momentum (default: 0)

nesterov (bool, optional) – enables Nesterov momentum (default: False)

사용한 function

step(closure=None) > Performs a single optimization step.

Parameters > closure (callable, optional) – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none=False)> Sets the gradients of all optimized torch.Tensor s to zero.

Parameters> set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).

</div>

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)

사용한 function

step(closure=None) > Performs a single optimization step.

Parameters > closure (callable, optional) – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none=False)> Sets the gradients of all optimized torch.Tensor s to zero.

Parameters> set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).