Skip to main content

Anomalies

In data analysis, anomaly detection (also referred to as outlier detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the data.

-- Wikipedia

re_data supports three types of anomaly detection for monitoring data. There is no one-size-fits-all choice for anomaly detection. Usually, it depends on the nature of your dataset, so it's worth exploring the methods available. A good rule of thumb when choosing a method is:

  • When the data is normally distributed and the sample size is substantial use Z Score.
  • When the sample size of the data is small or the data is not normally distributed, use Modified Z Score.
  • Boxplot helps characterize variation in many different ways. This is a good measure to use when you need a method that considers robust indications of variation and doesn't emphasize high or low values.

Z Scoreโ€‹

  • In a statistical distribution, Z-score tells you how far is a given data point from the rest of the crowd. Technically speaking, Z-score measures how many standard deviations away a given observation is from the mean
  • Z Score = xโˆ’meanStandardDeviation\dfrac{x - mean}{Standard Deviation}
  • Configuring Z Score:
# globally
vars:
re_data:anomaly_detector:
name: z_score
threshold: 3
direction: both

# configuring z score as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'z_score', 'threshold': 3, 'direction': 'both'}
)
}}

Modified Z Score (Default)โ€‹

  • In some situations where the Z Score is not an ideal technique like when the data is not normally distributed or the sample size is small, modifications might be required to the Z score to avoid sensitivity to extreme values. This sensitivity is due to the mean which is affected by extreme values.

  • Modified Z Score = 0.6745โˆ—(xโˆ’median)MedianAD\dfrac{0.6745 * (x - median)}{MedianAD}

  • If MedianAD = 0 then the modified z score is defined by xโˆ’median1.253314โˆ—MeanAD\dfrac{x - median}{1.253314*MeanAD}

  • Configuring Modified Z Score:

# globally
vars:
re_data:anomaly_detector:
name: modified_z_score
threshold: 3
direction: both

# configuring modified z score as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3, 'direction': 'both'}
)
}}

Boxplotโ€‹

  • Boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles.
  • In this context, boxplot as an anomaly detector does not draw any graph, rather the upper bounds and lower bounds are calculated. Any point less than the lower bound or greater than the upper bound is considered an outlier.
  • lower_bound=first_quartileโˆ’(whisker_boundary_multiplierโˆ—inter_quartile_range)lower\_bound = first\_quartile - (whisker\_boundary\_multiplier * inter\_quartile\_range)
  • upper_bound=third_quartile+(whisker_boundary_multiplierโˆ—inter_quartile_range)upper\_bound = third\_quartile + (whisker\_boundary\_multiplier * inter\_quartile\_range)
  • The whisker boundary multiplier is commonly used as 1.5 but this is configurable
# globally
vars:
re_data:anomaly_detector:
name: boxplot
whisker_boundary_multiplier: 1.5
direction: both

# configuring boxplot as anomaly detector per model
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'boxplot', 'whisker_boundary_multiplier': 1.5, 'direction': 'both'}
)
}}

Directionโ€‹

The re_data_anomaly_detector also takes in an optional "direction" argument. If the direction is set to "up" then only anomalies due to increases will occur and vice versa when set to "down". When set to "both" then both anomalies due to increases or decreases will be included. Defaults to "both" if not specified. This can be useful when monitoring metrics where the implications are different for an increase versus a decrease. An example could be the number of paying customers, a big increase isn't a problem for the business but a big decrease would be a huge problem.