This tutorial is about GODE: Graph fourier transform based Outlier Detection using Emprical Bayesian thresholding paper.
Install
!pip install git+https://github.com/seoyeonc/GODE_pkg.git
Collecting git+https://github.com/seoyeonc/GODE_pkg.git
Cloning https://github.com/seoyeonc/GODE_pkg.git to /tmp/pip-req-build-yj3lha0v
Running command git clone --filter=blob:none --quiet https://github.com/seoyeonc/GODE_pkg.git /tmp/pip-req-build-yj3lha0v
Resolved https://github.com/seoyeonc/GODE_pkg.git to commit 31de35a92b36c51f792474ca0e6d96840364818f
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch==2.0.1 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from GODE-pkg==0.0.1) (2.0.1)
Requirement already satisfied: scipy==1.10.1 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from GODE-pkg==0.0.1) (1.10.1)
Requirement already satisfied: statsmodels==0.14.0 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from GODE-pkg==0.0.1) (0.14.0)
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from scipy==1.10.1->GODE-pkg==0.0.1) (1.23.5)
Requirement already satisfied: packaging>=21.3 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from statsmodels==0.14.0->GODE-pkg==0.0.1) (23.0)
Requirement already satisfied: patsy>=0.5.2 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from statsmodels==0.14.0->GODE-pkg==0.0.1) (0.5.3)
Requirement already satisfied: pandas>=1.0 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from statsmodels==0.14.0->GODE-pkg==0.0.1) (2.0.1)
Requirement already satisfied: filelock in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from torch==2.0.1->GODE-pkg==0.0.1) (3.9.0)
Requirement already satisfied: typing-extensions in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from torch==2.0.1->GODE-pkg==0.0.1) (4.5.0)
Requirement already satisfied: sympy in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from torch==2.0.1->GODE-pkg==0.0.1) (1.11.1)
Requirement already satisfied: networkx in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from torch==2.0.1->GODE-pkg==0.0.1) (2.8.4)
Requirement already satisfied: jinja2 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from torch==2.0.1->GODE-pkg==0.0.1) (3.1.2)
Requirement already satisfied: tzdata>=2022.1 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from pandas>=1.0->statsmodels==0.14.0->GODE-pkg==0.0.1) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from pandas>=1.0->statsmodels==0.14.0->GODE-pkg==0.0.1) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from pandas>=1.0->statsmodels==0.14.0->GODE-pkg==0.0.1) (2023.3)
Requirement already satisfied: six in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from patsy>=0.5.2->statsmodels==0.14.0->GODE-pkg==0.0.1) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from jinja2->torch==2.0.1->GODE-pkg==0.0.1) (2.1.1)
Requirement already satisfied: mpmath>=0.19 in /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages (from sympy->torch==2.0.1->GODE-pkg==0.0.1) (1.2.1)
Building wheels for collected packages: GODE-pkg
Building wheel for GODE-pkg (setup.py) ... done
Created wheel for GODE-pkg: filename=GODE_pkg-0.0.1-py3-none-any.whl size=5549 sha256=9620982596f7b580e2b9433f9b31f17d4eb473218cf0e7dc87e9590d1669d5b9
Stored in directory: /tmp/pip-ephem-wheel-cache-pidojor0/wheels/31/6c/30/59343bcf10b1fadb0fee243ad2b5ea8efc398ff37f2fd2b4b9
Successfully built GODE-pkg
Installing collected packages: GODE-pkg
Successfully installed GODE-pkg-0.0.1
0. Import
import numpy as np
import pandas as pd
from pygsp import graphs, filters, plotting, utils
import pickle
1. Linear
1.1. Data
Data description
- Graph \({\cal G}=({\cal V},E,{\bf W})\)
- Vertex set \({\cal V}=\{1,2,\dots,n\}=\{ v_1,v_2,\dots,v_N\}\)
- \(N=1000\)
- \(W_{ij} = \begin{cases} 1, & \text{ if } j-i = 1 \\ 0, & \text{ otherwise}\end{cases}\)
- Graph signal \(y:{\cal V} \to \mathbb{R}\)
\[y_i=\frac{10}{N}v_i+\eta_i+\epsilon_i,\]
- \(\eta_i = \begin{cases} U(5,7) & \text{ with probability 0.025}\\ U(-7,-5) & \text{ with probability 0.025} \\ 0 & \text{ otherwise } \end{cases}\)
- \(\epsilon_i \sim N(0,1)\)
Generate Dataset
np.random.seed(6)
n = 1000
eta_sparsity = 0.05
epsilon = np.around(np.random.normal(size=n),15)
signal = np.random.choice(np.concatenate((np.random.uniform(-7, -5, round(n*eta_sparsity/2)).round(15), np.random.uniform(5, 7, round(n*eta_sparsity/2)).round(15), np.repeat(0, n - round(n*eta_sparsity)))), n)
eta = signal + epsilon
outlier_true_linear= signal.copy()
outlier_true_linear = list(map(lambda x: 1 if x!=0 else 0,outlier_true_linear))
x_1 = np.linspace(0,2,n)
y1_1 = 5 * x_1
y_1 = y1_1 + eta # eta = signal + epsilon
_df=pd.DataFrame({'x':x_1, 'y':y_1})
index_of_trueoutlier_bool = signal!=0
1.2. GODE
Lin = GODE_ebayes.Linear(_df)
outlier_old_linear, outlier_linear, outlier_index_linear = GODE_ebayes.GODE_Anomalous(Lin(_df),contamination=0.05)
1.3. Plot
GODE_ebayes.Linear_plot(Lin(_df),index_of_trueoutlier_bool, outlier_index_linear)
how to access data
array([0. , 0.002002 , 0.004004 , 0.00600601, 0.00800801,
0.01001001, 0.01201201, 0.01401401, 0.01601602, 0.01801802])
array([-0.31178367, -6.08358567, 0.23784081, -0.86906177, -2.44674061,
0.96330157, 1.18712379, -1.44402316, 1.71937116, -0.33980351])
array([0.26224717, 0.37091714, 0.37104804, 0.3712662 , 0.3715716 ,
0.37196424, 0.37244408, 0.37301111, 0.3736653 , 0.37440661])
1.4. Confusion matrix
Conf_linear = GODE_ebayes.Conf_matrx(outlier_true_linear,outlier_linear)
Accuracy: 0.999
Precision: 1.000
Recall: 0.980
F1 Score: 0.990
{'Accuracy': 0.999,
'Precision': 1.0,
'Recall': 0.9803921568627451,
'F1 Score': 0.99009900990099}
2. Orbit
2.1. Data
Data description
- Graph \({\cal G}=({\cal V},E,{\bf W})\)
- \(N=1000\)
- \({\boldsymbol v}_i=(r_i \cos(\theta_i), r_i \sin(\theta_i))\)
- \(r_i= 5 + \cos(\frac{12\pi (i - 1)}{n - 1})\)
- \(\theta_i= -\pi + \frac{{\pi(N-2)(i - 1)}}{N(N - 1)}\)
- graph signal \(y:{\cal V} \to \mathbb{R}\)
\[y_i = 10 \cdot \sin\left(\frac{{6\pi \cdot (i - 1)}}{{N - 1}}\right)+\epsilon_i+\eta_i\]
- \(\eta_i = \begin{cases} U(1,4) & \text{ with probability 0.025}\\ U(-4,-1) & \text{ with probability 0.025} \\ 0 & \text{ otherwise } \end{cases}\)
- \(\epsilon_i \sim N(0,1)\)
- \({\bf W} = \begin{cases} \exp\left(-\frac{|{\boldsymbol v}_i -{\boldsymbol v}_j|^2}{2\theta^2}\right), & \text{ if } |{\boldsymbol v}_i - {\boldsymbol v}_j| \le \kappa \\ 0, & \text{ otherwise}\end{cases}\)
- \(\theta=6.45\) and \(\kappa=1.21\)
Generate Dataset
n = 1000
eta_sparsity = 0.05
random_seed=77
np.random.seed(777)
epsilon = np.around(np.random.normal(size=n),15)
signal = np.random.choice(np.concatenate((np.random.uniform(-4, -1, round(n * eta_sparsity / 2)).round(15), np.random.uniform(1, 4, round(n * eta_sparsity / 2)).round(15), np.repeat(0, n - round(n * eta_sparsity)))), n)
eta = signal + epsilon
pi=np.pi
ang=np.linspace(-pi,pi-2*pi/n,n)
r=5+np.cos(np.linspace(0,12*pi,n))
vx=r*np.cos(ang)
vy=r*np.sin(ang)
f1=10*np.sin(np.linspace(0,6*pi,n))
f = f1 + eta
_df = pd.DataFrame({'x' : vx, 'y' : vy, 'f' : f,'f1':f1})
outlier_true_orbit = signal.copy()
outlier_true_orbit = list(map(lambda x: 1 if x!=0 else 0,outlier_true_orbit))
index_of_trueoutlier_bool = signal!=0
2.2. GODE
Or = GODE_ebayes.Orbit(_df)
Or.fit(sd=15,method = 'Euclidean',kappa=1.21)
100%|██████████| 1000/1000 [00:01<00:00, 697.19it/s]
outlier_old_orbit, outlier_orbit, outlier_index_orbit = GODE_ebayes.GODE_Anomalous(Or(_df),contamination=0.05)
2.3. Plot
GODE_ebayes.Orbit_plot(Or(_df),index_of_trueoutlier_bool,outlier_index_orbit)
how to access data
array([-6. , -5.99916963, -5.9966797 , -5.99253378, -5.98673778])
array([-7.34788079e-16, -3.76943905e-02, -7.53604665e-02, -1.12969981e-01,
-1.50494820e-01])
array([0. , 0.18867305, 0.37727893, 0.56575049, 0.75402065])
array([-0.46820879, -0.63415181, 0.31189882, -0.14761143, 1.66037153])
array([-0.12358846, 0.28544397, 0.69162952, 0.75432307, 0.83443358])
2.4. Confusion matrix
Conf_orbit = GODE_ebayes.Conf_matrx(outlier_true_orbit,outlier_orbit)
Accuracy: 0.957
Precision: 0.560
Recall: 0.571
F1 Score: 0.566
{'Accuracy': 0.957,
'Precision': 0.56,
'Recall': 0.5714285714285714,
'F1 Score': 0.5656565656565656}
3. Bunny
3.1. Data
Data description
- Stanford bunny data
- \(n=2503\)
Generate Dataset
eta_sparsity = 0.05
random_seed=77
n = 2503
with open("Bunny.pkl", "rb") as file:
loaded_obj = pickle.load(file)
_df = pd.DataFrame({'x':loaded_obj['x'],'y':loaded_obj['y'],'z':loaded_obj['z'],'f':loaded_obj['f']+loaded_obj['noise'],'f1':loaded_obj['f'],'noise':loaded_obj['noise']})
outlier_true_bunny = loaded_obj['unif'].copy()
outlier_true_bunny = list(map(lambda x: 1 if x !=0 else 0,outlier_true_bunny))
index_of_trueoutlier_bool_bunny = loaded_obj['unif']!=0
_W = loaded_obj['W'].copy()
3.2. GODE
bu = GODE_ebayes.BUNNY(_df,_W)
outlier_old_bunny, outlier_bunny, outlier_index_bunny = GODE_ebayes.GODE_Anomalous(bu(_df),contamination=eta_sparsity)
3.3. Plot
GODE_ebayes.Bunny_plot(bu(_df),index_of_trueoutlier_bool_bunny,outlier_index_bunny)
how to access data
{'x': array([ 0.26815193, -0.58456893, -0.02730755, ..., 0.15397547,
-0.45056488, -0.29405249]),
'y': array([ 0.39314334, 0.63468595, 0.33280949, ..., 0.80205526,
0.6207154 , -0.40187451]),
'z': array([-0.13834514, -0.22438843, 0.08658215, ..., 0.33698514,
0.58353051, -0.08647485]),
'f1': array([-1.54422488, -0.03596483, -0.93972715, ..., -0.01924028,
-0.02470869, -0.26266752]),
'f': array([-1.63569131, 0.49423926, -1.04026277, ..., -1.0694093 ,
-0.24395499, 0.41729667]),
'fhat': array([-1.66060553, 0.09516849, -1.11731239, ..., 0.23695946,
0.16766298, -0.11652469])}
3.4. Confusion matrix
Conf_bunny = GODE_ebayes.Conf_matrx(outlier_true_bunny,outlier_bunny)
Accuracy: 0.988
Precision: 0.864
Recall: 0.900
F1 Score: 0.882
{'Accuracy': 0.9884139033160207,
'Precision': 0.864,
'Recall': 0.9,
'F1 Score': 0.8816326530612244}
4. Earthquake
4.1. Data
Data description
USGS data from \(2010\) to \(2014\)
Graph \({\cal G}=({\cal V},E,{\bf W})\)
Vertex set \({\cal V}= \{v_1, v_2, \ldots, v_n\}\)
\(v:= \tt{( Latitude}\), \(\tt{Longitude)}\)
\(N=10762\)
\({\bf W_{ij}} = \begin{cases} \exp(-\frac{\rho(i,j)}{2 \theta^2}), & \text{ if } |\rho(i,j)| \le \kappa \\ 0, & \text{ otherwise}\end{cases}\)
\(\rho(i,j)=hs({\boldsymbol v}_i, {\boldsymbol v}_j)\)
\(\theta = 8810.87\) and \(\kappa = 2500\)
graph signal \(y:{\cal V} \to \mathbb{R}\)
_df = pd.read_csv('./earthquake_tutorial.csv')
_df = _df.assign(Year=list(map(lambda x: x.split('-')[0], _df.time))).rename(columns={'latitude' : 'x', 'longitude' : 'y', 'mag': 'f'}).iloc[:,1:]
_df.Year = _df.Year.astype(np.float64)
4.2. GODE
Er = GODE_ebayes.Earthquake(_df.query("2010 <= Year < 2011"))
Er.fit(sd=20, method = 'Haversine',kappa=2500)
100%|██████████| 4790/4790 [00:30<00:00, 157.15it/s]
outlier_old_earthquake, outlier_earthquake, outlier_index_earthquake = GODE_ebayes.GODE_Anomalous(Er(_df.query("2010 <= Year < 2011")),contamination=0.05)
4.3. Plot
GODE_ebayes.Earthquake_plot(Er(_df.query("2010 <= Year < 2011")),outlier_index_earthquake,lat_center=37.7749, lon_center=-122.4194,fThresh=7,adjzoom=5,adjmarkersize = 40)
how to access data
Er(_df.query("2010 <= Year < 2011"))
{'x': array([ 0.663, -19.209, -31.83 , ..., 40.726, 30.646, 26.29 ]),
'y': array([ -26.045, 167.902, -178.135, ..., 51.925, 83.791, 99.866]),
'f': array([5.5, 5.1, 5. , ..., 5. , 5.2, 5. ]),
'fhat': array([5.55678619, 5.05719136, 5.02115785, ..., 5.49869582, 5.45532648,
5.42616814])}