Mischa Dombrowski
Mischa Dombrowski
Advisors:
Philipp Schlieper (M.Sc.), An Ngyuen (M.Sc.), Prof. Dr. Björn Eskofier
Duration:
05/2021 – 11/2021
Abstract:
Ever since the Transformer architecture was introduced by Vaswani et al.[1] it has overtaken the natural language processing community. Many state of the art solutions like BERT [2] or GPT [3] use a Transformer-like architecture as the backbone of their deep learning model. Due to the success in the ability to capture and learn temporal dependencies for these kinds of problems the natural next step was to look at how the model performs on time series problems. The results were some first publications on the performance of the Transformer when applied to forecasting problems [4, 5]. Each of them show promising results when compared to their benchmark deep and shallow learning models. However, one problem with most of these papers is that they provide little to no insight on the characteristics of the data, but instead, they simply apply the novel architecture to the problem and analyze the performance. Therefore, it is hard to assess from these results alone, whether or not the Transformer architecture will be useful for a new dataset and valuable information could be gathered to fill this gap.
The goal of this thesis is to provide a data-focused analysis on the applicability of the Transformer architecture with a focus on univariate time series forecasting. This will be done by empirically analyzing what characteristics of the data lead to the Transformer being better or worse when compared to alternatives such as long short-term memory (LSTM) networks [8] or convolutional neural networks (CNN). This means that first, an exhaustive literature search will be conducted that looks at different publications that compare at least two of the previously mentioned models, like [6, 7, 10], to see if intuition can be built around what type of data should be most suited for each of the three models. Then a synthetic dataset will be created similar to other publications that use synthetic data [4, 9]. The difference is that there will be a focus on the ability to adjust relevant characteristics of a typical real-world dataset like the length of the signal, short- or long-term dependencies, the dynamics of the time series, and the sequence length.
Finally, the three models will be compared with respect to these characteristics and it will be evaluated if one of them can be considered superior if these characteristics are known. Additionally, the advantages and disadvantages of each model will be discussed given the results of the experiments including aspects like computational complexity.
References:
[1] Vaswani, Ashish et al.: Attention is all you need. 31st Conference on Neural Information Processing Systems, 2017.
[2] Devlin, Jacob et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Google AI Language, 2019.
[3] Brown, Tom B. et al.: Language Models are Few-Shot Learners arXiv. OpenAI, 2020.
[4] Li, Shiyang et al.: Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. 33rd Conference on Neural Information Processing Systems, 2019.
[5] Wu, Neo et al.: Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. Google LLC, 2020.
[6] Agarwal, Bushagara et al.: Deep Learning based Time Series Forecasting. 19th IEEE International Conference on Machine Learning and Applications, 2020.
[7] Koprinska, Irena et al.: Convolutional Neural Networks for Energy Time Series Forecasting.Proceedings of the International Joint Conference on Neural Networks, 2018.
[8] Hochreiter, Sepp et al.: Long Short-Term Memory. Neural Computation, 1997.
[9] Borovykh, Anastasia et al.: Conditional Time Series Forecasting with Convolutional Neural Networks. Journal of Computational Finance, Forthcoming, 2018.
[10] Idrissiet, Touria El Idrissi et al.: Deep Learning for Blood Glucose Prediction: CNN vs LSTM. International Conference on Computational Science and Its Application, 2020.