Skip to main content

Anomalies

In data analysis, anomaly detection (also referred to as outlier detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the data.

-- Wikipedia

re_data supports three types of anomaly detection for monitoring data. There is no one-size-fits-all choice for anomaly detection. Usually, it depends on the nature of your dataset, so it's worth exploring the methods available. A good rule of thumb when choosing a method is:

  • When the data is normally distributed and the sample size is substantial use Z Score.
  • When the sample size of the data is small or the data is not normally distributed, use Modified Z Score.
  • Boxplot helps characterize variation in many different ways. This is a good measure to use when you need a method that considers robust indications of variation and doesn't emphasize high or low values.

Z Scoreโ€‹

  • In a statistical distribution, Z-score tells you how far is a given data point from the rest of the crowd. Technically speaking, Z-score measures how many standard deviations away a given observation is from the mean
  • Z Score = xโˆ’meanStandardDeviation\dfrac{x - mean}{Standard Deviation}
  • Configuring Z Score:
# globally
vars:
re_data:anomaly_detector:
name: z_score
threshold: 3

# configuring z score as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'z_score', 'threshold': 3}
)
}}

Modified Z Score (Default)โ€‹

  • In some situations where the Z Score is not an ideal technique like when the data is not normally distributed or the sample size is small, modifications might be required to the Z score to avoid sensitivity to extreme values. This sensitivity is due to the mean which is affected by extreme values.

  • Modified Z Score = 0.6745โˆ—(xโˆ’median)MedianAD\dfrac{0.6745 * (x - median)}{MedianAD}

  • If MedianAD = 0 then the modified z score is defined by xโˆ’median1.253314โˆ—MeanAD\dfrac{x - median}{1.253314*MeanAD}

  • Configuring Modified Z Score:

# globally
vars:
re_data:anomaly_detector:
name: modified_z_score
threshold: 3

# configuring modified z score as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3}
)
}}

Boxplotโ€‹

  • Boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles.
  • In this context, boxplot as an anomaly detector does not draw any graph, rather the upper bounds and lower bounds are calculated. Any point less than the lower bound or greater than the upper bound is considered an outlier.
  • lower_bound=first_quartileโˆ’(whisker_boundary_multiplierโˆ—inter_quartile_range)lower\_bound = first\_quartile - (whisker\_boundary\_multiplier * inter\_quartile\_range)
  • upper_bound=third_quartile+(whisker_boundary_multiplierโˆ—inter_quartile_range)upper\_bound = third\_quartile + (whisker\_boundary\_multiplier * inter\_quartile\_range)
  • The whisker boundary multiplier is commonly used as 1.5 but this is configurable
# globally
vars:
re_data:anomaly_detector:
name: boxplot
whisker_boundary_multiplier: 1.5

# configuring boxplot as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'boxplot', whisker_boundary_multiplier: 1.5}
)
}}