Design and implementation of database anomaly monitoring system based on AI algorithm

美團技術團隊發表於2022-11-24
The Meituan database platform R&D team is facing the increasingly urgent need for database anomaly detection. In order to discover, locate and stop losses more quickly and intelligently, we have developed a database anomaly detection service based on AI algorithms. This article introduces some of our practices and thinking from the dimensions of feature analysis, algorithm selection, model training and real-time detection, hoping to bring some inspiration or help to students engaged in related work.

1. Background

The database is widely used in the core business scenarios of Meituan, which requires high stability and has a very low tolerance for exceptions. Therefore, rapid database anomaly detection, positioning and stop loss becomes more and more important. For the problem of abnormal monitoring, the traditional fixed threshold alarm method needs to rely on expert experience for rule configuration, and cannot flexibly and dynamically adjust the threshold value according to different business scenarios, which can easily turn a small problem into a major fault.

The AI-based database anomaly detection capability can perform 7 * 24-hour inspections on key indicators based on the historical performance of the database. It can detect risks in the budding state of anomalies, expose anomalies earlier, and assist R&D personnel before problems worsen. Make positioning and stop loss. Based on these factors, the Meituan database platform R&D team decided to develop a database anomaly detection service system. Next, this article will explain some of our thinking and practice from several dimensions such as feature analysis, algorithm selection, model training and real-time detection.

2. Feature Analysis

2.1 Find out the changing law of data

Before the specific development and coding, there is a very important task, which is to find the change law of time series data from the existing historical monitoring indicators, so as to select the appropriate algorithm according to the characteristics of the data distribution. The following are some representative indicator distribution charts that we have selected from historical data:

圖1 資料庫指標形態

From the above figure, we can see that the law of the data mainly presents three states: periodic, drift and stationary [1] . Therefore, we can model the samples of these common features in the early stage, which can cover most scenarios. Next, we analyze from the three perspectives of periodicity, drift and stationarity, and discuss the process of algorithm design.

2.1.1 Periodic changes

  1. The trend component is extracted and the residual sequence is isolated. Use the moving average method to extract the long-term trend term, and make a difference with the original sequence to obtain the residual sequence (here, the periodic analysis has nothing to do with the trend, if the trend component is not separated, the autocorrelation will be significantly affected, and it is difficult to identify the cycle).
  2. Computes the Rolling Correlation sequence of residuals. After cyclically moving the residual sequence, perform a vector dot product with the residual sequence to calculate the autocorrelation sequence (circular autocorrelation can avoid delay decay).
  3. The period T is determined from the peak coordinates of the autocorrelation sequence. A series of local peaks of the autocorrelation sequence are extracted, and the interval of the abscissa is taken as the period (if the autocorrelation value corresponding to the period point is less than the given threshold, it is considered that there is no significant periodicity).

The specific process is as follows:

圖2 週期提取流程示意

2.1.2 Drift change

For the sequence to be modeled, it is usually required that it does not have an obvious long-term trend or has a global drift phenomenon, otherwise the generated model usually cannot well adapt to the latest trend of the indicator [2] . We refer to the cases where the time series has a significant change in the mean value over time or there is a global abrupt point, which is collectively referred to as a drift scenario. In order to accurately capture the latest trend of the time series, we need to judge whether there is drift in the historical data in the early stage of modeling. Global drift and periodic series mean shift, as shown in the following example:

圖3 資料漂移示意

Database metrics are affected by complex factors such as business activities, and many data have aperiodic changes, and modeling needs to tolerate these changes. Therefore, different from the classical change point detection problem, in the anomaly detection scenario, we only need to detect the historically stable situation, and then the data drift occurs. Comprehensive algorithm performance and actual performance, we use the drift detection method based on median filtering, the main process includes the following links:

  1. Median smoothing

    a. According to the size of the given window, extract the median within the window to obtain the trend component of the time series.

    b. The window needs to be large enough to avoid the influence of periodic factors and to perform filter delay correction.

    c. The reason for using median instead of mean smoothing is to avoid the influence of abnormal samples.

  2. Determine whether the smooth sequence is increasing or decreasing

    a. For the series data after median smoothing, if each point is greater than (less than) the previous point, the series is an increasing (decreasing) series.

    b. If the sequence has a strictly increasing or strictly decreasing nature, the indicator clearly has a long-term trend, and it can be terminated early.

  3. Traverse the smooth sequence and use the following two rules to determine whether there is drift

    a. The maximum value of the sequence on the left of the current sample point is less than the minimum value of the sequence on the right side of the current sample point, and there is a sudden increase drift (uptrend).

    b. The minimum value of the sequence to the left of the current sample point is greater than the maximum value of the sequence to the right of the current sample point, and there is a sudden drop drift (downward trend).

2.1.3 Stationary change

For a time series indicator, if its properties do not change with the observation time at any time, we consider this time series to be stationary. Therefore, for time series with long-term trend components or periodic components, they are not stationary. A specific example is shown in the following figure:

圖4 資料平穩示意

In view of this situation, we can use the unit root test (Augmented Dickey-Fuller Test) [3] to judge whether a given time series is stationary. Specifically, for the historical data of a given time range indicator, we believe that the time series is stable when the following conditions are met at the same time:

  1. The p-value obtained by the adfuller test for the time series data of the last 1 day is less than 0.05.
  2. The p-value obtained by the adfuller test for the time series data of the last 7 days is less than 0.05.

3. Algorithm selection

3.1 Distribution law and algorithm selection

By understanding the product introductions published by some well-known companies in the industry on time series data anomaly detection, plus our historical accumulated experience, and sampling analysis of some online actual indicators, their probability density functions conform to the following distributions:

圖5 分佈偏斜示意

For the above distributions, we investigate some common algorithms and identify boxplots, absolute medians, and extreme value theory as the final anomaly detection algorithms. The following is a comparison table of algorithms for common time series data detection:

Algorithm\Scenario Applicability under symmetric distribution Applicability under skewed distribution normality requirement exception tolerance
3Sigma high Low high Low
Absolute median difference (MAD) high Low high high
Boxplot high middle middle high
Extreme Value Theory (EVT) middle high Low Low

The main reason we did not choose 3Sigma is that it has a lower tolerance for anomalies, while the absolute median difference theoretically has a better tolerance for anomalies, so when the data presents a highly symmetrical distribution, the absolute median difference (MAD ) instead of 3Sigma for detection. We use different detection algorithms for the distribution of different data (for the principles of different algorithms, please refer to the appendix at the end of the article, and we will not elaborate too much here):

  1. Low Skewness High Symmetry Distribution : Absolute Median Difference (MAD)
  2. Moderately skewed distribution : Boxplot
  3. Highly Skewed Distributions : Extreme Value Theory (EVT)

With the above analysis, we can draw the specific process of outputting the model according to the sample:

圖6 演算法建模流程

The overall modeling process of the algorithm is shown in the figure above, which mainly covers the following branches: time series drift detection, time series stationarity analysis, time series periodic analysis and skewness calculation. The following are respectively introduced:

For example, assuming a given time series from 2022/03/01 00:00:00 to 2022/03/08 00:00:00, given a window size of 5 and a period span of one day, then for time index 30 In other words, the sample points needed to model it will come from the following time periods:

[03/01 00:25:00, 03/01 00:35:00]
[03/02 00:25:00, 03/02 00:35:00]
...
[03/07 00:25:00, 03/07 00:35:00]

  1. Skewness calculation . The time series index is converted into a probability distribution map, and the skewness of the distribution is calculated. If the absolute value of the skewness exceeds the threshold, the threshold value is modeled by the extreme value theory. If the absolute value of the skewness is less than the threshold, the output threshold is modeled by box plot or absolute median difference.

3.2 Case sample modeling

A case is selected here to show the data analysis and modeling process to facilitate a clearer understanding of the above process. Figure (a) is the original sequence, Figure (b) is the sequence folded according to the span of days, Figure (c) is the enlarged trend performance of the samples in a certain time index interval in Figure (b), Figure (d) ) in the black curve is the lower threshold corresponding to the time index in Figure (c). The following is an example of modeling historical samples of a certain time series:

圖7 建模案例

The sample distribution histogram and threshold (some abnormal samples have been removed) in the area (c) above, we can see that in this highly skewed distribution scenario, the threshold calculated by the EVT algorithm is more reasonable.

圖8 偏斜分佈閾值對比

4. Model training and real-time detection

4.1 Data flow process

In order to detect large-scale second-level data in real time, we designed the following technical solutions based on Flink-based real-time stream processing:

  1. Real-time detection part : Based on Flink real-time stream processing, Mafka (message queue component within Meituan) is consumed for online detection, the results are stored in Elasticsearch (hereinafter referred to as ES), and abnormal records are generated.
  2. Offline training part : Take Squirrel (KV database inside Meituan) as the task queue, read training data from MOD (Meituan's internal operation and maintenance data warehouse), read parameters from the configuration table, train the model, save it in ES, support automatic And manually trigger training, load and update the model by regularly reading the model library.

The following is the specific offline training and online detection technology design:

圖9 離線訓練和線上檢測技術設計

4.2 Anomaly Detection Process

The anomaly detection algorithm adopts the divide-and-conquer idea as a whole. In the model training stage, features are identified and extracted according to historical data, and an appropriate detection algorithm is selected. This is divided into two parts: offline training and online detection. Offline mainly performs data preprocessing, time series classification and time series modeling according to historical conditions. Online mainly loads models that use offline training for online real-time anomaly detection. The specific design is shown in the following figure:

圖10 異常檢測過程

5. Product Operation

In order to improve the efficiency of optimizing the iterative algorithm and continue to operate to improve the accuracy and recall rate, we use the case backtracking capability of Horae (Horae is a scalable time series data anomaly detection system within Meituan) to achieve online detection, case preservation, analysis and optimization. , result evaluation, release and online closed loop.

圖11 運營流程

Currently, the anomaly detection algorithm indicators are as follows:

  • Accuracy rate : Randomly select a part of cases in which abnormality is detected, and manually verify that the proportion is indeed abnormal, which is 81%.
  • Recall rate : According to sources such as faults and alarms, review the abnormal situation of each indicator of the corresponding instance, and calculate the recall rate according to the monitoring results, which is 82%.
  • F1-score : The harmonic mean of precision and recall, 81%.

6. Future Outlook

At present, Meituan's database anomaly monitoring capabilities have been basically built, and we will continue to optimize and expand the product in the future. The specific directions include:

  1. It has the ability to identify abnormal types. It can detect the type of abnormality, such as mean change, fluctuation change, spike, etc., support alarm subscription according to the abnormal type, and input it as a feature to the follow-up diagnosis system to improve the database autonomous ecology [4] .
  2. Build a Human-in-Loop environment. Support automatic learning based on feedback annotations to ensure continuous optimization of the model [5] .
  3. Support for multiple database scenarios. Anomaly detection capabilities are platformized to support more database scenarios, such as DB end-to-end error reporting, node network monitoring, etc.

7. Appendix

7.1 Absolute median difference

The absolute median deviation, or Median Absolute Deviation (MAD), is a robust measure of the sample deviation of univariate numerical data [6] , and is usually calculated as:

Among them, when the prior is a normal distribution, generally C is 1.4826, and k is 3. MAD assumes that the middle 50% of the samples are normal samples, and the abnormal samples fall within the 50% of the two sides. When the samples are normally distributed, the MAD indicator can better adapt to the outliers in the data set than the standard deviation. For the standard deviation, the square of the distance from the data to the mean is used. The larger the deviation, the greater the weight. The influence of outliers on the results cannot be ignored. For MAD, a small amount of outliers will not affect the experimental results. normality has higher requirements.

7.2 Box Plot

The box plot mainly describes the degree of dispersion and symmetry of the sample distribution through several statistics, including:

  • Q 0 : Minimum value (Minimum)
  • Q 1 : Lower Quartile
  • Q 2 : Median
  • Q 3 : Upper Quartile
  • Q 4 : Maximum (Maximum)

圖12 箱線圖

The distance between Q 1 and Q 3 is called the IQR, and when the sample deviates by 1.5 times the IQR of the upper quartile (or 1.5 times the IQR of the lower quartile), the sample is regarded as a distance. group point. Unlike the three times standard deviation based on the normality assumption, in general, the box plot has no assumptions about the potential data distribution of the sample, can describe the discrete situation of the sample, and has a higher probability of the potential abnormal samples contained in the sample. tolerance. For biased data, Boxplot is more consistent with the data distribution after calibration [7] .

7.3 Extreme Value Theory

Real world data is difficult to generalize with a known distribution, for example for some extreme events (anomalies) probabilistic models (eg Gaussian distribution) tend to give a probability of 0. Extreme value theory [8] is to infer the distribution of extreme events that we may observe without any distribution assumptions based on the original data, which is the extreme value distribution (EVD). Its mathematical expression is as follows (complementary cumulative distribution function formula):

Among them, t represents the empirical threshold of the sample. Different values can be set for different scenarios. $\gamma$ and $\delta$ are the shape parameters and scale parameters in the generalized Pareto distribution, respectively. When the given sample exceeds the artificially set value In the case of the empirical threshold t, the random variable Xt obeys the generalized Pareto distribution. Through the maximum likelihood estimation method, we can obtain the parameter estimates $\hat{\gamma}$ and $\hat{\delta}$, and calculate the model threshold by the following formula:

In the above formula, q represents the risk parameter, n is the number of all samples, and Nt is the number of samples that satisfy xt>0. Since there is usually no prior information for the estimation of the empirical threshold t, the sample empirical quantile can be used to replace the value t, and the value of the empirical quantile can be selected according to the actual situation.

8. References

  • [1] Ren, H., Xu, B., Wang, Y., Yi, C., Huang, C., Kou, X., ... & Zhang, Q. (2019, July). Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 3009-3017).
  • [2] Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346-2363.
  • [3] Mushtaq, R. (2011). Augmented dickey fuller test.
  • [4] Ma, M., Yin, Z., Zhang, S., Wang, S., Zheng, C., Jiang, X., ... & Pei, D. (2020). Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, 13(8), 1176-1189.
  • [5] Holzinger, A. (2016). Interactive machine learning for health informatics: when do we need the human-in-the-loop?. Brain Informatics, 3(2), 119-131.
  • [6] Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median . Journal of experimental social psychology, 49(4), 764-766.
  • [7] Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational statistics & data analysis, 52(12), 5186-5201.
  • [8] Siffer, A., Fouque, PA, Termier, A., & Largouet, C. (2017, August). Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1067-1075).

About the team

Meituan Basic Technology Department/Database R&D Center/Database Platform R&D Team is responsible for providing efficient and convenient database access for each business line of Meituan, helping Meituan DBAs to maintain the database stably and quickly, and providing an analysis and diagnosis platform to achieve database autonomy.

Read more collections of technical articles from the Meituan technical team

Frontend | Algorithm | Backend | Data | Security | O&M | iOS | Android | Testing

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.

相關文章