Abstract
Introduction
In recent years, the field of the spatio temporal dataset has emerged, enabling the simultaneous consideration of both the time and space dimensions. The examples include consumer demand data, wind turbine records, neuroscience data including brain networks, traffic data like taxi GPS traces, and more.
- STdataset 가 출현함
Classic time-series statistical methods to analyze those kind of data already exist, but they are limited by certain conditions, such as assumptions about the data. Specifically, these classic methods are hard to account for spatio temporal correlations and are not designed to work with spatio temporal data.
- 근데 그건 전통적인 시계열로 분석이 어려움
In result, when we analize spatio temporal data to use enough information, it leads to improve accuracy during using appropriate geometric deep learning frameworks.
- 그래서 GDL을 써야함
- 접근하는 방법은 크게 전통적인 시계열방법과 GDL이 있음.
However, dealing with spatio temporal datasets often presents a common challenge, which is the frequent occurrence of irregularly observed data. For instance, as highlighted by , traffic sensor data commonly suffers from missing observations due to electronic unit failures, which can significantly impact prediction accuracy.
- 그러나 STdataset은 미싱이 있으면 동작X
The difficulty in handling irregular data is that many traditional data analysis procedures were designed for datasets with complete observations .
- 이유1. GDL의 대부분 방법은 fully observed 되었다고 가정함.
~Second, when dealing with time-series datasets containing missing data, attempting to learn from such data can lead to challenges as it may result in the failure to capture certain time points.~
- ~이유2. -> 시계열에서 미싱이 있을 경우 분석에 어려움이 있다는 연구가 있음.~
That’s why it’s essential to transform incomplete data into complete data before conducting any learning or analysis.
- 그래서 complete data를 만들어야함.
Therefore, we set the purpose of this paper that finds closely approciate complete data when we approach the irregularly data.
- 이 논문의 목적은 컴플릿 데이터에 가까운 값을 찾는 것임.
Table 1: Definitions
Notations | Definitions or Description |
---|---|
\({\cal G}\) | Input Graph |
\(T\) | the length of the time interval |
\(\cal{V}\) | a set of Verteces |
\(V\) | a set of an index node |
\(N\) | the number of Verteces |
\(\cal{E}\) | a set of undirected Edges |
\(\textbf{y}\) | a graph signal, a function defined on the vertices of the graph \(\cal{G}\) |
\(\textbf{W}\) | a weighted adjacency matrix |
To describe the geometric structures of data domain, graphs are well known as generic data representation forms, so we interpret data on Graph domain. Table 1 indicates the descriptions of notations we use respectively.
- Shuman 등이 2013년에 발표한 연구에 의하면, 데이터 도메인의 기하학적 구조를 설명할 때 그래프가 일반적인 데이터 표현 형식으로 잘 알려져 있어 데이터를 그래프로 해석함.
To simply understand our method, let’s consider the following simple example. Here is a sample dataset:
- Graph \({\cal G} = (\cup_{t \in {\cal T}} {\cal V}_t, \cup_{t \in {\cal T}} {\cal E}_t, \textbf{W})\) ,\({\cal T}:=\{1 \text{ to } 10,000\}\)
- \(\cal{V}\) is a set of Verteces. \({\cal V} = \{ v_0, v_1 \}\), \(|{\cal V}| = 2\), \(V = \{ 0,1\}\)
- \(\cal{E}\) is a set of undirected Edges. \({\cal E} =\{ ( 0, 1)\}\)
- a graph signal \(\textbf{y}\): \({\cal V}_t \to \mathbb{R}^2\)
- node 0 \(y = \cos(2t) + \cos(4t) + \cos(8t) + \epsilon\)
- node 1 \(y = \cos(2t) + \cos(4t) + \epsilon\)
- \({\cal Y} = \{ y_{v,t}: t \in {\cal T}, v \in V\}\)
- Define \({\cal Y} = ({\cal O} , {\cal M})\)
- A missing data \({\cal M} = \{ y_{v,t_v}: t_v \in {\cal M}_v, v \in V \}\)
- missing rate on node 0 = random missing 60%
- missing rate on node 1 = random missing 60% + block missing \(1600\) to \(4000\)
- A observed data \({\cal O} = \{ y_{v,t_v}: t_v \in {\cal O}_v, v \in V \}\)
For learning data on example, we used GConvGRU. After updating the graph signal set \(\cal{Y}\) 200 epochs, we achieved meaningful outcomes(animation link).
In this paper, we expect to contribute as follows:
- we can appropriately predict data when we get spatio temporal dataset with higher missing rate.
우리는 높은 미싱 비율을 가진 stdata를 만날때마다 적절히 대처 가능
Methodology
Self-Consistent Estimator
To deal with incomplete data, we introduce the concept of "self-consistency" which is proposed by \citet{efron1967two} who originally used it to address censored data. Then, \citet{hastie1989principal} provided a definition of principal curves as smooth curves that maintain self-consistency across a distribution or dataset. Especially, the term of "self-consistency" is presented for regression function estimator by \citet{flury1996self, lee2007self}. According to \citet{lee2007self}, we can estimate $\hat{f}_{obs}$ given underlying function of observed data with this specific self-consistent equation:
$E(\hat{f}_{com} |x_{obs}, f = \hat{f}_{obs}) = \hat{f}_{obs} \cdots (1)$
$\hat{f}_{com}$ is an estimate of $f$ via $x_{com}$ which is assumed as complete data and consisted of $\{ x_{obs}, x_{mis}\}$ where $x_{obs}$ is observed data and $x_{mis}$ is missing value which is not available. Additionally, \citet{lee2007self} obtained the "optimal" estimate $\hat{f}_{com}$ for their regression function $f$, facilitating the derivation of their "best" incomplete data estimator $\hat{f}_{obs}$ using the corresponding complete data procedure. Given the independence of this equation from estimation, it suggests potential applicability to our dataset assuming the presence of missing values. Under this condition, we extend the self-consistent equation with following one:
$E ( \hat{s}_{t,com} | \{ f \{ n \} : n \in O \}, \{ s_t \} = \{ \hat{s}_{t, obs} \})= \hat{s}_{t, obs}, t = 1, 2, \dots T \cdots (2)$
Whitin this context, $\hat{s}_{com}$ represents an estimation of $s_t$ derived from complete data, while $\hat{s}_{t,obs}$ is an estimation derived from observed data. We utilized $\{ \tilde{f} (n): n \in M\}$ as missing data $\{ f(n): n \in M\}$ is not available. After calcurating an estimated complete dataset $\{ \hat{f} (n)\} = \{ f(n) : n \in O \} \cup \{ \tilde{f}(n) : n \in M\}$, it can be explained as the corresponding decomposition followed:
$\hat{f}(n) = \sum^T_{t=1} \hat{s}_t(n), t=1,2,\dots , T\cdots (3)$
The simplest way to obtain $\hat{f}_{obs}$ is to update $\{ f^{(i)}(n): n \in M\}$ and decompose $\hat{f}^{(i)}(n) = \sum^K_{t=1} \hat{s}^{(i)}_k(n)$ iteratively. Furthermore, $f^{(i)}(n)$ is satisfied the condition of self-consistency if $f^{(i)}(n) = f^{(i+1)}(n)$, and it was varyfied by \citet{cox1984analysis, flury1996self}.
Note that we define the $i$ as epoch and use linear implication method for part of missing data in our experiment.
\subsection{Ebayesthresh}
Then we can say that if $F(\omega)$ can properly estimated and $F_s(\omega)$ can be properly extracted from $F(\omega)$, the deterministic term(underlying function) of $x_t$ can be obtained. We employ the empirical Bayes thresholding bscause we would estimate $F_s(\omega)$ as a periodogram of $x_t$ and extract $F_s(\omega)$ from $F(\omega)$ \cite{johnstone2005ebayesthresh}. We assume throughout that the observations.
$$X_i \sim N(\mu_i,1)$$
Within a Bayesian context, the notion of sparsity is naturally modeled by a suitable prior distribution for the parameters $\mu_i$. We model the $\mu_i$ as having independent prior distributions each given by the mixture
$$f_{\tt{prior}}(\mu)=(1-w)\delta_0(\mu)+w\gamma(\mu).$$
Here the function \(\gamma\) is usually chosen as Laplace density with scale parameter \(a > 0\) \(\gamma(u)=\frac{a}{2}e^{-a|u|}.\) The empirical Bayes approach estimates each \(\mu_i\) by its posterior median.
Ebayse Threshold
First, we give information about \({\cal G} = \{ V, E, {\bf f} \}\), where \(| V|\) is the number of nodes, and we denote $ E$ as edges of the graph, which contains a connection of nodes. We consider an adjacency matrix \(\bf{W} \in \mathbf{R}^{T \times T}\) by assuming that the nodes connect as it is a time-series domain and getting graph Laplacian. And then, we get \({\bf D}\), which is a diagonal degree matrix, and normalize it. It is available to decompose the graph Laplacian \({\bf \tilde L}={\bf V}{\bf \Lambda}{\bf V}^\top\). Now, we get Graph Fourier transforms of \(\bf f\) and calculate \({\bf V}^\top{\bf f}\). It allows us to obtain a periodogram of it. We want to get \(f_{\tt trimed}\) so we used Empirical Bayes Thresholding by to estimate \(f_{\tt threshed}\) as known as step function from \(f\) values. We can impute missing values on their index to calculate \(f_{\tt trimed}\).
Overview of Method
\begin{algorithm}
\caption{ITSTGNN algorithm}
\begin{algorithmic}[1]
\INPUT Graph ${\cal G}$
\ENSURE ddd
\STATE dd
\WHILE{d}
\STATE ss
\ENDWHILE
\end{algorithmic}
\end{algorithm}
When the graph signal \({\bf f}\) is given on \({\cal G} = (V, E)\) on spatio temporal data, The detailed steps of our proposed method are summarized as follows:
Experiments
Conditions
In this section, we evaluated the performance of our proposed method in different architectures by changing the missing proportion of several datasets. Real-world datasets often include a substantial number of missing values, and as the rate of missing values rises, it becomes progressively challenging for the data to follow trends. Therefore, we aim to conduct those experiments with gradually higher rates of missing values. Furthermore, we introduced incomplete data by choosing missing data randomly and in blocks to simulate real-world scenarios. The experiments were conducted under three parts:
\begin{itemize}
\item{Baseline}: We conducted with the original observed complete data(Table \ref{tab:datainfo}).
\item{Randomly Missing}: The proportion of missing values which were selected completly at random was various and that values in every node.
\item{Block Missing}: We assumed that some nodes experiences missing values during a specific interval.
\end{itemize}
Models Description
We included nine recurrent graph convolutions temporal graph neural networks methods\cite{rozemberczki2021pytorch}, which incorporates deep learning and parametric learning approaches for processing spatio temporal signals. We also used GNAR(Generalised Network AutoRegressive) proposed by \cite{knight2019generalised}. GNAR is known as a Graph Deviation Network based on the AR(Auto Regressive) model, a combination of a learning structure and a Graph Neural Network, and it predicts the new value for using the past one\citep{knight2019generalised}. That is the reason why the forecasting values of GNAR converge to zero. As a result, we employed ten types of different methods in this paper and present the explanations of each model like Table \ref{tab:modelexp} on \textbf{Appendix}. We utilized Mean Square Error (MSE) as the evaluation metric to assess the forecasting accuracy of the datasets. The results are presented as Mean $\pm$ Standard Deviation (SD).
Datasets
We carried out our experiments using several datasets having stability, which we can get on PyTorch Geometric temporal from \citet{rozemberczki2021pytorch} and call as Static Graph Temporal Signal(Table \ref{tab:datainfo}). Also, the dataset was split into training and testing subsets, allocating 80\% for training and 20\% for testing purposes.
Randomly Missing values
\begin{ex}
Figure \ref{fig:exfig1} illustrates the outcomes for the $\tt{Chickenpox}$ dataset organized based on varying levels of missing data, with the GConvLSTM employed. As the missing data rates increase, both Classic(STGNN) and Proposed models(IT-STGNN) exhibit a tendency for the mean squared error (MSE) to rise. Particularly noteworthy is the comparison between the trendlines of Classic and Proposed: as the rate of missing values increases, the MSE of the Proposed method tends to increase more gradually. In contrast, the Classic model displays a more rapid increase in MSE. This comparison suggests that as the ratio of missing values grows, our proposed methods tend to predict better compared to the Classic models. It becomes evident that as the percentage of missing data becomes higher, our proposed method performs relatively well.
\end{ex}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{figure/ex1.png}
\caption{ata.}
\label{fig:exfig1}
\end{figure}
\begin{ex}
Figure \ref{fig:exfig2} gives information about an overview of experiment outcomes across five datasets at nine archetectures, showcasing the impact of varying rates of randomly distributed missing values. It is clear that most results has the incremental rise in MSE as the outcome of randomly generated missing values increases. Specifically, Classic(STGNN) methods were hugely on the rise compared to Proposed methods(IT-STGNN).
\end{ex}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{figure/ex2.png}
\caption{ata.}
\label{fig:exfig2}
\end{figure}
\textbf{MSE ranking}
\begin{ex}
MSE rankings were established independently of missing value rates, with higher MSE values positioned towards the right(Figure \ref{fig:exfig3}). Observations across datasets consistently highlighted the proposed methods' lower MSE performance compared to the classic methods. Additionally, across the Windmillsmall dataset, all proposed methods reliably showed lower MSE values than the classic methods
\end{ex}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{figure/ex3.png}
\caption{ata.}
\label{fig:exfig3}
\end{figure}
\textbf{Ratio of datasets' description}
\begin{ex}
Figure \ref{fig:exfig4} illustrated the ratio($\frac{T}{V}$) of datasets' information based on Table \ref{tab:datainfo}($y$-axis) and MSE difference between Proposed methods and Classic methods($x$-axis). According to Figure \ref{fig:exfig4}, as the propotion($\frac{T}{V}$) grows, the MSE difference is also going up. It leads to a meaning that our proposed methods would outperform at the specific dataset condition which has much time($T$) data.
\end{ex}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{figure/ex4.png}
\caption{ata.}
\label{fig:exfig4}
\end{figure}
\subsection{Block missing values}
\begin{ex}
We highlighted the results when datasets has block missing values(Figure \ref{fig:exfig5}). There was a MSE difference slightly between Proposed methods and Classic methods respectively. Even thought All of the resules had not dramatic differences, MSE of our proposed methods was lower than Classic one generally. Additionally, it would be valuable research to experiment assuming higher block missing rates in the future.
\end{ex}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{figure/ex5.png}
\caption{ata.}
\label{fig:exfig5}
\end{figure}
Conclusion
This paper aims at getting an effective way to predict real spatio temporal datasets. First, we introduce IT-STGNN, a novel method to achieve self-consistency by updating estimates to analyze spatio temporal datasets with missing values and Ebayesthresh. Briefly, our proposed method could work better than other classic methods when we aussume that thete is the incomplete spatio temporal dataset. Overall, most experiments indicate that our method outperformed other methods, regardless of whether the data has missing data randomly or in blocks.
% Additionally, it remains for a future study to conceive joint time and graph stationary when we calculate the GSO(Graph Shift operator), especially in large datasets. It is expected a better performance of our method shift when simultaneously considering spatio temporal aspects since we consider only time information when calculating GSO in this paper. In this case, we tested on $\tt{Pedalme}$ and $\tt{Wikimath}$ datasets by getting GSO with nodes and time-series of graph signals. All settings are the same as in each dataset's experiments without calculating GSO, and the results are in Table \ref{tb:gsoone} on \textbf{Appendix} and Table \ref{tb:gsotwo} on \textbf{Appendix}. We calculate the weight matrix when we get GSO on $\tt{Padalme}$ dataset in Table \ref{tb:gsotwo} on \textbf{Appendix}. The performances of methods of incomplete data with random missing values achieve lower MSE(Mean Squared Error) compared to other methods. Specifically, GConv GRU, LRGCN, EvolveGCNH, and DCRNN show lower MSE values.
% The output with block missing data also records lower MSE than other methods. In this case, the methods that exhibit lower MSE are GLSTM, EvolveGCNO, EvolveGCNH, and DCRNN. Since the $\tt{Padalme}$ dataset covers a short time range (less than 40 periods), it would perform well if it had more time points. We consider GSO considering the spatio aspect on $\tt{Wikimath}$ dataset by supposing having missing data at the same time points. We conducted different approaches for each dataset because the $\tt{Wikimath}$ dataset has a large number of time points, requiring a more powerful CPU with ample memory. When comparing the performance using Table \ref{tb:gsotwo} on \textbf{Appendix}, it shows that considering joint time and graph stationary works better, except for EvolveGCNH.
Lastly, we only used static graph temporal signals, so we expect to explore the use of dynamic graph temporal signals in future research.
In conclusion, the IT-TGNN can successfully predict spatial and temporal features and has no limitations on the extension forecasting method.
appendix
Results
Conditions we set
% \begin{table*} % % % %
\end{table}
Time versus Time and Graph stationary
Pedalme
Wikimath
\begin{table}[H]
\centering
\small
\begin{threeparttable}[H]
\label{wikimath_GFT_table}
\begin{tabular}{lcccc}
\toprule
& \multicolumn{2}{c}{ \textbf{Random(80\%)} } & \multicolumn{2}{c}{ \textbf{The same missing(51.2\%)} } \\
& Classic & Proposed & Classic & Proposed\tnote{1} \\\midrule
GConvGRU & 0.932$\pm$0.043 & 0.687$\pm$0.021 & 0.726$\pm$0.015 & \textbf{0.533$\pm$0.003} \\
GConvLSTM & 1.423$\pm$0.121 & 0.920$\pm$0.069 & 0.963$\pm$0.098 & \textbf{0.653$\pm$0.033} \\
GCLSTM & 1.407$\pm$0.117 & 0.815$\pm$0.058 & 0.824$\pm$0.052 & \textbf{0.622$\pm$0.011} \\
LRGCN & 1.105$\pm$0.099 & 0.769$\pm$0.045 & 0.810$\pm$0.064 & \textbf{0.624$\pm$0.019} \\
DyGrEncoder & 0.770$\pm$0.045 & 0.606$\pm$0.017 & 0.626$\pm$0.027 & \textbf{0.561$\pm$0.031} \\
EvolveGCNO & 0.915$\pm$0.063 & 0.877$\pm$0.045 & 0.753$\pm$0.026 & \textbf{0.745$\pm$0.017} \\
EvolveGCNH & 0.863$\pm$0.038 & 0.780$\pm$0.027 & 0.818$\pm$0.031 & 0.794$\pm$0.031 \\
TGCN & 0.827$\pm$0.030 & 0.771$\pm$0.020 & 0.782$\pm$0.030 & \textbf{0.750$\pm$0.039} \\
DCRNN & 0.846$\pm$0.031 & 0.672$\pm$0.007 & 0.665$\pm$0.015 & \textbf{0.592$\pm$0.005} \\
\bottomrule
\end{tabular}
\begin{tablenotes}
\item[1] Joint Time and Graph Stationarity, which is considered to be jointly stationary in both the vertex and the time domain
\end{tablenotes}
\caption{
The performance of the $\tt{Wikimath}$ dataset with the Graph Shift Operator (GSO) was compared under two different situations: one that considers only time stationarity and another that includes time and graph stationarity, assuming the same index missing data at the same time points for each node.
}
\end{threeparttable}
\label{tb:gsotwo}
\end{table}