All Collections
User Guides
Plugins: Anomaly Detector
Plugins: Anomaly Detector

Detect outliers in time series data using statistical methods such as Z-score and MAD (Median Absolute Deviation).

Written by Santiago Pachon Robayo
Updated over a week ago

In today's data-driven world, the ability to monitor and analyze time series data is paramount for businesses and organizations across various domains. However, amidst the vast and continuous stream of data, identifying anomalies—those unexpected deviations from the norm—can be an arduous task, often akin to finding needles in a haystack.

Ubidots has recognized the critical need for automated anomaly detection to empower its users to quickly spot irregularities in their time series data. This article delves into a solution offered by Ubidots: the Anomaly Detection Plugin.

# 1. Understanding anomalies

In the context of statistics, an anomaly, also known as an outlier, refers to a data point or observation that significantly deviates from the expected or typical behavior of a dataset. Anomalies are values that are noticeably different from the majority of data points in a given dataset, and they can occur for various reasons, including errors in data collection, measurement inaccuracies, or genuinely unusual events or phenomena. Anomalies are important to identify and understand because they can have a significant impact on statistical analysis and the conclusions drawn from data.

# 2. The power of algorithms: Z-score and MAD

Ubidots has integrated advanced anomaly detection algorithms into its platform. Z-score and MAD (Median Absolute Deviation)—two notable algorithms—stand out for their effectiveness in identifying anomalies.

These algorithms offer a statistical approach to quantifying deviations from the expected data patterns. The Z-score measures how many standard deviations a data point is from the mean, while MAD gauges the median of absolute deviations from the median.

# 3. Z-score (standard score)

The Z-score, also known as the "standard score", quantifies how many standard deviations a data point is away from the mean of the dataset. It's calculated using the formula:

Where:

Z is the Z-score of the data point.

X is the data point.

μ is the mean (average) of the dataset.

σ is the standard deviation of the dataset.

A high positive or negative Z-score indicates that the data point is far from the mean, making it a potential anomaly. This algorithm is beneficial when dealing with normal distribution data.

# 4. Common thresholds for Z-score

Threshold of ±2 or ±3 standard deviations:

• One of the most common thresholds for Z-score-based anomaly detection is considering data points with Z-scores greater than ±2 or ±3 as potential outliers.

• Data points falling outside this range are considered significantly different from the mean and are treated as anomalies.

• The choice between ±2 and ±3 depends on the desired level of sensitivity to outliers. ±3 is more conservative and may capture fewer anomalies, while ±2 is less conservative and may include more potential outliers.

# 5. MAD (Median Absolute Deviation)

The Median Absolute Deviation (MAD) is a robust measure of statistical dispersion that relies on the median rather than the mean. It quantifies the median of the absolute deviations of data points from the dataset's median. The formula for MAD is as follows:

Where:

MAD is the Median Absolute Deviation.

X represents the data points in the dataset.

The median and MAD are robust central tendency and dispersion measures, respectively.

# 6. Common thresholds for MAD (Median Absolute Deviation)

Threshold of ±2 or ±3 MADs:

• Similar to the Z-score, a common threshold for MAD-based anomaly detection is considering data points with deviations from the median greater than ±2 or ±3 MADs as potential outliers.

• Data points falling outside this range are regarded as anomalies.

• As with the Z-score, the choice between ±2 and ±3 depends on the desired level of sensitivity.

In the following sections, we will explore the Ubidots Anomaly Detection Plugin and how it harnesses the power of these algorithms to help users uncover hidden insights and actionable intelligence from their time series data by using an example using energy prices to spot anomalies.

# 7. What algorithm to use?

The choice of algorithm should be guided by the characteristics of the data and here's why it's useful:

• Data distribution matters:

The underlying distribution of your data can significantly impact the performance of anomaly detection algorithms. Different algorithms have different assumptions about data distribution and using the algorithm that aligns with your data's distribution can lead to more accurate results.

• Consider Data Characteristics (skewness and kurtosis):

Examine the skewness (a measure of asymmetry) and kurtosis (a measure of tail heaviness) of your data. If your data is highly skewed or exhibits heavy tails, MAD may be a better choice.

## 7.1. Z-score for normally distributed data:

The Z-score is well-suited for data that follows a normal distribution (bell-shaped curve) or approximately normal distribution. In a normal distribution, the majority of data points cluster around the mean, and extreme values are relatively rare.

## 7.2. MAD for non-normally distributed data:

The Median Absolute Deviation (MAD) is robust and suitable for data that may not follow a normal distribution. It is less affected by outliers and extreme values and is more appropriate for skewed or heavy-tailed distributions.

In practice, to effectively analyze and detect anomalies in datasets, it's often beneficial to consider both ends of the distribution. This is where the concepts of 'Right MAD' and 'Left MAD' come into play. These variations of the MAD calculation allow us to assess the spread or dispersion of data points on both the right and left sides of the central measure, which is typically the median.

Right MAD: Right MAD focuses on data points that are greater than or to the right of the median. It quantifies the dispersion in the right tail of the dataset, helping to identify values significantly higher than the median. In the context of the anomaly detection plugin, the 'Right MAD' corresponds to the upper boundary for anomaly detection, which signifies the threshold above which data points are considered anomalies.

Left MAD: Conversely, Left MAD considers data points that are less than or to the left of the median. It quantifies the dispersion in the left tail of the dataset, aiding in the identification of values significantly lower than the median. In the context of the anomaly detection plugin, the 'Left MAD' corresponds to the lower boundary for anomaly detection, indicating the threshold below which data points are flagged as anomalies.

By utilizing both Right MAD and Left MAD, we ensure comprehensive coverage of the data distribution, allowing us to identify and manage anomalies at both ends of the spectrum. This approach aligns with the upper and lower boundaries output by the plugin, providing a holistic view of the anomalies within the dataset.

# 8. Installing the anomaly detection plugin

Step 1. In your Ubidots account, go to the "Devices" tab, click on "Plugins", then click on the "+" icon to create a new Plugin. Search for the Anomaly Detector, click on it, and follow the on-screen steps:

Step 2. After reading the plugin descriptions, the plugin configuration options will appear as follows:

Device label: Label of the device that contains the variable of interest for anomaly detection.

Variable label: Label of the variable that holds the time series of interest for anomaly detection.

Reference dataset: Time interval utilized for historic data, which determines how much data will be considered to calculate the upper and lower reference values to tell what an anomaly is.

Threshold (optional): There you may enter a positive integer or float value, normal thresholds are 2 or 3. It's set by default to 2.

Sliding window: Number of dots to be analyzed to detect anomalies.

Save anomalies (optional): Options: "yes", "no". Set by default to "no". When set to "yes" the plugin saves the outliers found while training into a new variable called "anomaly_value".

Evaluation period: Frequency at which new analysis will be run.

Your Ubidots token: Select the Ubidots token you'd like to use for this plugin.

# 9. Plotting the anomalies detected

In this example, we can detect the values for energy prices that may be considered anomalies. Having in mind the distribution of the energy prices over the previous 6 months, we may use either Z-score or MAD.

After installing the plugin and choosing the Z-score algorithm with a threshold of 2 standard deviations above and below the mean of the previous 6 months of data, we have these additional variables created as follows.

In a dashboard, use a Line Chart widget to plot both the original variable and the anomaly_values variable created by the plugin. Configuring one to show as a line and the anomalies as dots helps to better identify the anomalies. Also, it is possible to set the upper and lower thresholds by adding horizontal lines to your chart.