(review) Graph Neural Network-Based Anomaly Detection in Multivariate Time Series
논문
- 생각해본 자료
git 사용
github code : https://github.com/d-ailin/GDN
import pandas as pd
import matplotlib.pyplot as plt
- $N$ 개의 sensors
- $T_{min}$ 개 time stick
- $S_{train} = [ s^{(1)}_{train} \dots s^{(T_{min}}_{train} ]$
Strain = pd.concat([pd.DataFrame([f"서울{idx}" for idx in range(1,366)]).T,
pd.DataFrame([f"경기{idx}" for idx in range(1,366)]).T,
pd.DataFrame([f"부산{idx}" for idx in range(1,366)]).T],axis=0)
Strain
Strain_1 = Strain.iloc[:,0]
Strain_1
Strain_365 = Strain.iloc[:,364]
Strain_365
$S_{train}^{(t)} \in \mathbb{R}^{N}$
Stest = pd.concat([pd.DataFrame([f"서울{idx}" for idx in range(366,731)]).T,
pd.DataFrame([f"경기{idx}" for idx in range(366,731)]).T,
pd.DataFrame([f"부산{idx}" for idx in range(366,731)]).T],axis=0)
Stest
Steat_1 = Stest.iloc[:,0]
Steat_1
Stest_365 = Stest.iloc[:,364]
Stest_365
$a(t) \in \{ 0,1 \}$
$a(t) = 1$ indicates that time t is anomalous.
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[2,2,2,2,2,4,2,2,2,2,2])
여기서 $a(6) = 1$, 나머지는 0
we do this by introducing an embedding vector for each sensor, representing its characteristics:
$v_i \in \mathbb{R}^{d} , \text{for } i \in \{ 1,2, \dots , N \}$
['V_서울','V_부산', 'V_경기']
v_서울 = pd.DataFrame([f"서울{idx}" for idx in range(1,10)]).T
v_서울
$v_i$의 $d$ 는 임의로 정하는 것이며, 이는 $S_{traIn}$의 행을 의미한다.
- v_서울은 1행에서 10개를 뽑은 것
This prior information can be flexibly represented as a set of candidate relations $\mathcal{C}_i$ for each sensor $i$, i.e. the sensors it could be dependent on:
$\mathcal{C}_i \subseteq \{ 1,2, \dots , N \} \ \{ i \}$
C_서울 = ['경기','인천']
C_서울
위와 같이 인접 구역을 의미하며, 인접 거리에 대한 정의는 임의로
$e_{ji} = \cfrac{v_i^{\top} v_j}{||v_i||\cdot||v_j||} \text{for } j \in \mathcal{C}_i$
위는 distance 개념
$A_{ji} = \mathbb{1} \{ j \in \text{TopK } (\{e_{ki} : k \in \mathcal(C)_i \} ) \}$
연결 되면 1이 되고, 연결되어 있지 않으면 0이 된다.
cosine 유사도가 높은 몇 개들이 선택될 걸
$e_{ji}$와 $A_{ji}$는 모두 $\mathcal{C}_i$의 연장선이다.
$\mathcal{C}_i$에서 아무리 길게 정해도 $A_{ji}$에서 정한 top을 넘으면 잘림!
TopK denotes the indices of top-k values among its input (i.e. the normalized dot products).
$x^{(t)} \in \mathbb{R}^{N \times w}$
$x^{(t)} := [s^{(t-w)},s^{(t-w+1)},\dots, s^{(t-1)}]$
x_t = pd.DataFrame([['서울1','서울2','서울t-w','서울t-2','서울t-1'],['경기1','경기2','경기t-w','경기t-2','경기t-1'],['인천1','인천2','인천t-w','인천t-2','인천t-1']])
x_t
이 $x_t$는 $S_{train}$에서 추출된 w 열 만큼의 데이터이다.
그리고 w는 moving 수, window(siding window size)를 의미한다.
s_t_1 = x_t.iloc[:,4] # S(t-1)
s_t_1
s_t_w = x_t.iloc[:,2] # S(t-w)
s_t_w
여기서 입력 데이터는 $N \times w$ size의 $S$
Feature Extractor
$z_i^{(t)} = \text{ReLU} \big( \alpha_{i,i} \mathbf{W} x_{i}^{(t)} + \sum_{j \in \mathbb{N}(i)} \alpha_{i,j} \mathbf{W} x_{j}^{(t)} \big)$
여기서 $z_{i}^{(t)}$는 $d \times 1$ 행렬
$x_{i}^{(t)} \in \mathbb{R}^{w}$
$\mathcal{N}(i) = \{ j | A_{ji} > 0 \}$
$\mathcal{N}(i) = \mathcal{N}(서울) = \{ 경기, 인천 \}$
$\mathbf{W} \in \mathbb{R}^{d \times w}$
우리가 학습시킬 $\mathbf{W}$
$g_{i}^{(t)} = v_i \oplus \mathbf{W} x_{i}^{(t)}$
(2d X 1) = (d X 1) (d X 1)
$\pi(i,j) = \text{LeakyReLU}(a^{\top}(g_{i}^{(t)} \oplus g_{j}^{(t)}))$
$\pi(i,j) = X\hat{\beta}$
$\alpha_{i,j} = \cfrac{exp(\pi(i,j))}{\sum_{k \in \mathbb{N} \cup\{ i\}}exp(\pi (i,k))}$
$\alpha$는 확률 p랑 비슷, feature가 비슷하면 1에 가까울 것, 거리 상관 없이
- i가 j에 끼치는 영향이 z의 relu구하는 식에서 사라질테니까
We use LeakyReLU as the non-linear activation to compute the attention coefficient, and normalize the attention coefficents using the softmax function in Eq. (8)(알파 구하는 식을 의미).
Output Layer
From the above feature extractor, we obtain representations for all $N$ nodes, namely $\{ z^{(t)}_{1},\dots , z^{(t)}_{N} \}$. For each $z^{(t)}_{i}$, we element-wise multiply (denoted ◦) it with the corresponding time series embedding $v_i$, and use the results across all nodes as the input of stacked fully-connected layers with output dimensionality N, to predict the vector of sensor values at time step $t$, i.e. $s^{(t)}$:
$\hat{s}^{(t)} = f_{\theta} ([v_1 \circ z_{1}^{(t)} , \dots , v_N \circ z_{N}^{(t)}])$
여기서 $v_1$과 $z_{i}^{(t)}$는 모두 $d \times 1$ 행렬
- 특히 $z_{i}^{(t)}$는 근처 정보를 요약한 것
The model’s predicted output is denoted as $\hat{s}^{(t)}$. We use the Mean Squared Error between the predicted output $\hat{s}^{(t)}$ and the observed data, $s^{(t)}$, as the loss function for minimization
$L_{MSE} = \cfrac{1}{T_{train}-w} \displaystyle\sum^{T_{train}}_{t=w+1} ||\hat{s}^{(t)} - s^{(t)}||^{2}_{2}$
오차 제곱합 구하기
The anomalousness score compares the expected behavior at time t to the observed behavior, computing an error value $Err$ at time $t$ and sensor $i$:
$Err_i (t) = |s_{i}^{(t)} - \hat{s}_{i}^{(t)}|$
To prevent the deviations arising from any one sensor from being overly dominant over the other sensors, we perform a robust normalization of the error values of each sensor:
$a_i (t) = \cfrac{Err_i (t) - \tilde{\mu}_i}{\tilde{\sigma}_i}$
표준화
to compute the overall anomalousness at time tick $t$, we aggregate over sensors using the max function (we use max as it is plausible for anomalies to affect only a small subset of sensors, or even a single sensor) :
$A(t) = \max\limits_i a_i(t)$
ex) 모든 시점의 (a_서울)에서 가장 큰 값
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[2,2,2,2,2,4,2,2,2,2,2],color='red')
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[1.5,1.5,1.5,1.5,1.5,2.5,1.5,1.5,1.5,1.5,1.5],color='orange')
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[1,1,1,1,1,2,1,1,1,1,1],color='yellow')
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5],color='green')
plt.plot([1,2,3,4,5,6,7,8,9,10,11],[3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5],color='blue')
빨주노초파 순으로 t = 1,2,3,4,5 라고 가정하면, 빨간색에서 anomaly값이 가장 크니까, $A(t)$의 값은 $a_{서울}(1)$에서 최대인 4가 된다.
Finally, a time tick t is labelled as an anomaly if $A_s(t)$ exceeds a fixed threshold. While different approaches could be employed to set the threshold such as extreme value theory (Siffer et al. 2017), to avoid introducing additional hyperparameters, we use in our experiments a simple approach of setting the threshold as the max of $A_s(t)$ over the validation data.
이 $A$를 threshold로 놓고 넘으면 이상치라 판별
평가지표
- RQ1 (Accuracy): Does our method outperform baseline methods in accuracy of anomaly detection in multivariate time series, based on ground truth labelled anomalies?
- Anomaly detection accuracy in terms of precision(%), recall(%), and F1-score
- RQ2 (Ablation): How do the various components of the method contribute to its performance?
- RQ3 (Interpretability of Model): How can we understand our model based on its embeddings and its learned graph structure?
- RQ4 (Localizing Anomalies): Can our method localize anomalies and help users to identify the affected sensors, as well as to understand how the anomaly deviates from the expected behavior?
dataset
The Secure Water Treatment (SWaT)
Water Distribution (WADI)
크기가 너무 커서 10초의 중앙값을 가져오는 downsampling 진행 후 두 데이터셋 다 2160개 가지고 있음
S_train = pd.read_csv('https://raw.githubusercontent.com/d-ailin/GDN/main/data/msl/train.csv')
S_train
S_test = pd.read_csv('https://raw.githubusercontent.com/d-ailin/GDN/main/data/msl/test.csv')
S_test
listset = pd.read_csv('https://raw.githubusercontent.com/d-ailin/GDN/main/data/msl/list.txt',header=None)
listset