Categories
Outlier

Outlier

June 3,2026 in AI&ChatGPT | 0 Comments

An outlier is a point that appears unusual compared with the rest of the data. It may be a rare but valid observation, a measurement error, a data quality problem, an anomaly or a signal that something important is happening.

In machine learning, outliers matter because they can strongly affect statistics, models, visualisations and business decisions. A single unusual value can distort an average, influence a regression line, change clustering results or make a model learn patterns that do not generalise well.

An outlier is not automatically bad. Sometimes it is an error that should be fixed or removed. Sometimes it is the most important point in the dataset. Fraud, equipment failure, rare disease patterns, unusual customer behaviour or security incidents can all appear as outliers.

An outlier is a data point that looks unusual compared with the rest of the dataset. It may be noise, an error, a rare valid case or a meaningful anomaly.

What an outlier means

An outlier is an observation that differs noticeably from other observations. In a simple chart, it may appear far away from the main group of points. In a table, it may be an extremely high or low value. In a machine learning model, it may be an example that does not behave like most of the training data.

For example, if most customers spend between 20 and 200 dollars, a customer spending 20 000 dollars may be an outlier. If most sensor readings are between 70 and 90 degrees, a value of 500 degrees may be an outlier. If most website sessions last under 10 minutes, a session lasting 20 hours may need inspection.

The important question is not only whether the value is unusual. The important question is why it is unusual.

A simple example of an outlier

Imagine a dataset of house prices in one city district. Most houses cost between 250 000 and 600 000 dollars. One property costs 8 000 000 dollars.

That property is an outlier. But it may have several possible explanations:

  • valid rare case – it is a luxury property with unusually large land,
  • data entry error – one zero was added by mistake,
  • wrong category – it is a commercial building, not a normal house,
  • different population – it belongs to a different market segment,
  • fraud or manipulation – the recorded price may not reflect a normal transaction.

The same numerical outlier can mean very different things depending on context.

Outlier vs anomaly

The terms outlier and anomaly are often used together, but they are not always identical.

Outlier usually means a data point that is statistically unusual compared with the rest of the data.

Anomaly often means an unusual point that may indicate something important, suspicious or abnormal in the real world.

For example, a very large order may be an outlier. It becomes an anomaly if it suggests fraud, system misuse or an operational issue.

In practice, many machine learning systems use the terms almost interchangeably. But the distinction is useful: an outlier is unusual in the data; an anomaly is unusual and potentially meaningful.

Outlier vs noise

Noise is random variation or unwanted disturbance in data. An outlier can be caused by noise, but not every outlier is noise.

For example, a faulty sensor may randomly produce extreme readings. Those values are outliers caused by noise or measurement error.

But a rare equipment failure may also create an unusual reading. That reading is an outlier, but it is not meaningless noise. It is a real signal.

This distinction matters because noise should often be reduced or filtered, while meaningful outliers should be investigated.

An outlier should not be removed automatically. First ask whether it is an error, noise, a rare valid case or an important signal.

Outlier vs data error

A data error is a wrong value. It may be caused by manual entry, broken tracking, wrong units, duplicate records, corrupted files, failed sensors or bad integration between systems.

A data error may look like an outlier. But a valid outlier is not an error.

For example:

  • Data error – customer age is recorded as 350 years.
  • Valid outlier – customer age is 101 years.
  • Data error – order value is 10 000 because cents were treated as dollars.
  • Valid outlier – order value is 10 000 because it was a genuine enterprise purchase.

The right treatment depends on the cause. A data error should be corrected or removed. A valid outlier should usually be kept or handled carefully.

Why outliers matter

Outliers matter because they can change analysis results.

They can:

  • distort averages – one extreme value can move the mean strongly,
  • affect standard deviation – extreme values increase spread,
  • influence regression models – a few points can pull the fitted line,
  • change clustering – unusual points can distort group structure,
  • hurt model training – a model may learn unstable patterns,
  • hide real signals – rare but important events may be treated as noise,
  • reveal data quality problems – outliers can show broken tracking or input errors,
  • identify risk – fraud, failures and attacks often appear as unusual behaviour.

Outliers are therefore both a problem and an opportunity.

Outliers in statistics

In statistics, outliers are observations that lie unusually far from other observations. They can affect descriptive statistics and model estimates.

The mean is especially sensitive to outliers. If most values are small and one value is extremely large, the mean can become misleading. The median is usually more robust because it depends on the middle value, not the magnitude of extreme values.

For example, if a company looks at average order value and one customer places an unusually large order, the average may rise. But that does not mean typical customers are spending more. The median may better describe the usual customer.

This is why analysts often compare mean, median, percentiles and distribution plots.

Outliers in machine learning

In machine learning, outliers can affect both training and evaluation.

A model trained on data with strong outliers may learn patterns that do not represent normal behaviour. This can increase overfitting, reduce generalisation and make predictions unstable.

Some models are more sensitive to outliers than others. Linear regression, k-nearest neighbors and distance-based clustering can be strongly affected. Tree-based models are often more robust, but they are not immune.

The effect depends on:

  • the model type,
  • the number of outliers,
  • how extreme they are,
  • whether they are errors or valid cases,
  • whether the target variable is affected,
  • whether outliers appear in training, validation or production data.

Outliers and overfitting

Outliers can contribute to overfitting when a model learns rare details from the training data as if they were reliable patterns.

For example, a model may learn that a very specific combination of features predicts high revenue because one unusual customer had that combination. If the pattern does not repeat in future data, the model will perform poorly.

This is especially risky when the dataset is small or high-dimensional. With many features, the model has more opportunities to find accidental relationships around unusual points.

Outlier handling is therefore part of model robustness.

Outliers can make a model look smarter than it is. The model may learn rare accidents from the training data instead of stable patterns that work on new data.

Outliers and data leakage

Outliers can also reveal data leakage.

For example, a churn model may contain a feature that is normally zero, but has an extreme value after a customer cancels. If this feature is included during training, the model may appear extremely accurate because it is using information from the future.

An outlier in feature importance, model explanation or prediction score can be a warning sign. It may indicate that one variable is unrealistically powerful.

This is why outliers should be checked not only in raw values, but also in model behaviour.

Types of outliers

Outliers can appear in several forms.

Point outlier – one observation is unusual compared with the rest of the data. Example: one extremely high transaction amount.

Contextual outlier – a value is unusual only in a specific context. Example: 30 degrees Celsius may be normal in summer but unusual in winter.

Collective outlier – a group of observations is unusual together, even if individual points do not look extreme. Example: many small login attempts from different locations in a short time.

These types matter because different detection methods may be needed.

Point outliers

A point outlier is the simplest type. One data point stands far away from the rest.

For example:

  • a transaction amount much higher than usual,
  • a sensor reading far outside normal range,
  • a customer with extremely high lifetime value,
  • a website session with unusually long duration,
  • a product with unusually high return rate.

Point outliers are often easy to spot in simple charts, but they become harder to detect in high-dimensional data.

Contextual outliers

A contextual outlier depends on the situation.

For example, a website traffic spike may be normal during a campaign but unusual on a quiet weekend. A high electricity usage value may be normal during winter heating but strange in summer. A high order value may be normal for B2B customers but unusual for ordinary retail customers.

This is why context matters. A value can be normal in one segment and abnormal in another.

Good outlier detection should consider time, geography, customer type, product category, seasonality and business rules.

Collective outliers

A collective outlier is a group pattern that looks unusual, even if each individual point is not extreme.

For example:

  • many small transactions from one account,
  • repeated login attempts from different countries,
  • a sudden group of similar support tickets,
  • a cluster of sensor readings that slowly drift together,
  • a sudden change in user behaviour after a product update.

Collective outliers are important in fraud detection, cybersecurity, monitoring and operational analytics.

How to detect outliers visually

Visualisation is often the first step.

Useful charts include:

  • box plot – shows median, quartiles and extreme values,
  • histogram – shows distribution shape and extreme tails,
  • scatter plot – shows unusual points in two dimensions,
  • time-series plot – shows spikes, drops and drift over time,
  • UMAP or t-SNE plot – can help inspect unusual points in high-dimensional representations.

Visual methods are useful because they help people notice patterns quickly. But they are not enough for large or complex datasets.

How to detect outliers statistically

Statistical outlier detection uses rules or thresholds.

Common approaches include:

  • z-score – flags values far from the mean in standard deviation units,
  • IQR method – flags values outside a range based on quartiles,
  • percentile thresholds – flags values above or below chosen percentiles,
  • robust statistics – uses median and median absolute deviation,
  • distribution-based methods – compare values to an assumed probability distribution.

These methods are simple and useful, but they depend on assumptions. A z-score works better when the distribution is roughly normal. Percentile thresholds may be better for skewed business data.

Z-score method

A z-score measures how many standard deviations a value is from the mean.

If a value has a very high or very low z-score, it may be considered an outlier.

For example, in a roughly normal distribution, values more than three standard deviations from the mean are often treated as unusual. But this rule should not be used blindly.

The z-score method can fail when the data is heavily skewed, has multiple groups or contains many extreme values that distort the mean and standard deviation.

IQR method

The IQR method uses the interquartile range. The interquartile range is the distance between the first quartile and the third quartile.

A common rule flags values that are far below the first quartile or far above the third quartile.

This method is more robust than z-score because it uses quartiles instead of mean and standard deviation. It is often useful for skewed data.

However, it still needs context. In business data, high-value customers or large transactions may be valid and important, even if they appear beyond the IQR threshold.

Machine learning methods for outlier detection

Machine learning can detect outliers when simple statistical rules are not enough.

Common methods include:

  • Isolation Forest – isolates unusual points through random splits,
  • Local Outlier Factor – compares local density around a point with its neighbours,
  • One-Class SVM – learns a boundary around normal data,
  • DBSCAN – can identify low-density points as noise,
  • autoencoders – use reconstruction error to detect unusual inputs,
  • robust covariance methods – detect points far from a multivariate distribution.

These methods are useful, but they are not magic. They still require good features, appropriate scaling, validation and domain review.

Isolation Forest

Isolation Forest is a common outlier detection method. It works from the idea that unusual points are often easier to isolate than normal points.

The algorithm randomly splits the data. Outliers tend to be separated with fewer splits because they are different from the rest.

Isolation Forest is useful for tabular data and can handle larger datasets reasonably well. However, results depend on feature quality, contamination settings and preprocessing.

It should be validated against known examples, business rules or manual review where possible.

Local Outlier Factor

Local Outlier Factor, often shortened to LOF, detects points that have lower local density than their neighbours.

This is useful because some datasets contain regions with different densities. A point may be normal in a sparse region but unusual in a dense region.

LOF compares each point with nearby points. If a point is much more isolated than its neighbours, it receives a higher outlier score.

This makes LOF useful for local anomaly detection, but it can be sensitive to distance metrics and high-dimensional data.

One-Class SVM

One-Class SVM learns a boundary around normal data. Points outside that boundary can be treated as unusual.

This method can be useful when you have mostly normal examples and want to detect future observations that do not match them.

However, One-Class SVM can be sensitive to parameters and scaling. It may also become difficult to use on very large datasets.

As with other methods, the result should be evaluated in the context of the actual task.

Autoencoders for outlier detection

An autoencoder can be trained to reconstruct normal data. If the model reconstructs an input poorly, that input may be unusual.

This is often used in anomaly detection for images, signals, sensors and complex numerical data.

The basic idea is:

  • train the autoencoder on mostly normal data,
  • measure reconstruction error,
  • flag inputs with unusually high reconstruction error,
  • investigate flagged cases.

A high reconstruction error is not final proof of an anomaly. It is a signal that the input differs from what the model learned as normal.

Outliers in high-dimensional data

Outlier detection becomes harder in high-dimensional data.

In high-dimensional space, distances can become less meaningful. Many points may appear similarly far from each other. Irrelevant or noisy dimensions can hide the true structure.

This is connected to the curse of dimensionality. As the number of dimensions grows, the space becomes sparse and local neighbourhoods become harder to trust.

For high-dimensional outlier detection, it is often useful to:

  • remove irrelevant features,
  • use feature selection,
  • apply dimensionality reduction,
  • choose a meaningful distance metric,
  • validate results with domain knowledge,
  • avoid relying only on visual position.

Outliers and dimensionality reduction

Dimensionality reduction can help inspect outliers, but it can also hide or distort them.

Methods such as PCA, UMAP and t-SNE can make high-dimensional data easier to visualise. Unusual points may appear separated from the main group.

However, a point that looks isolated in a 2D plot is not automatically a true outlier in the original data. Projection can distort distances and neighbourhoods.

Dimensionality reduction should be used as an exploratory tool. The outlier should still be checked in the original feature space.

A point that looks unusual in a UMAP or t-SNE plot may be worth investigating, but the plot alone is not enough. Always check the original data.

Outliers and scaling

Scaling matters because many models and detection methods depend on distance.

If one feature has values from 0 to 1 and another has values from 0 to 1 000 000, the large-scale feature can dominate distance calculations.

This can make outlier detection misleading. A point may look unusual only because one feature has a larger numerical scale.

Common scaling methods include:

  • standard scaling – centers data around mean and standard deviation,
  • min-max scaling – maps values to a fixed range,
  • robust scaling – uses median and quartiles, often better when outliers exist,
  • log transformation – can reduce the effect of heavy-tailed distributions.

The right transformation depends on the data distribution and the model.

Outliers and skewed data

Business data is often skewed. Revenue, order value, page views, customer lifetime value and transaction counts often have long tails.

In skewed data, extreme values may be normal. A few customers may spend much more than the rest. A few pages may receive most traffic. A few products may produce most revenue.

Treating every extreme value as an error would be wrong.

For skewed data, it is useful to inspect:

  • log-transformed values,
  • percentiles,
  • segment-level distributions,
  • business rules,
  • time trends,
  • whether the value is possible and valid.

Outliers in time series

In time-series data, an outlier is not only about value size. Timing matters.

A value may be normal during one period and abnormal during another.

For example:

  • a traffic spike may be normal during Black Friday,
  • high electricity consumption may be normal in winter,
  • a sales drop may be expected after a campaign ends,
  • a sensor spike may be normal during startup but abnormal during stable operation.

Time-series outlier detection should consider trend, seasonality, calendar effects, known events and operational context.

Outliers in classification

In classification tasks, outliers can appear as unusual examples within a class or as mislabeled examples.

For example, a model may classify emails as spam or not spam. A mislabeled spam email in the normal class can confuse the model. A rare but valid email type may also look unusual.

Outliers in classification can:

  • increase training noise,
  • hurt boundary learning,
  • reveal mislabeled data,
  • show rare subgroups,
  • identify edge cases that need special handling.

Manual review of unusual training examples can improve dataset quality.

Outliers in regression

Regression models can be very sensitive to outliers, especially when outliers appear in the target variable.

For example, if a model predicts house prices, a few extremely expensive properties can strongly affect the learned relationship between features and price.

Possible treatments include:

  • checking whether the value is valid,
  • using log transformation of the target,
  • using robust regression methods,
  • training separate models for different segments,
  • winsorizing extreme values where justified,
  • evaluating with metrics less sensitive to extremes.

The best approach depends on whether the extreme values are errors or genuine cases.

Outliers in clustering

Clustering can be strongly affected by outliers. Some algorithms may create small clusters around outliers. Others may pull cluster centers away from dense regions.

For example, k-means can be influenced by extreme points because it uses distances to cluster centers. DBSCAN can mark low-density points as noise, which may be useful for outlier detection.

When clustering data, it is useful to check:

  • whether outliers distort cluster centers,
  • whether small clusters are meaningful or noise,
  • whether different scaling changes the result,
  • whether the clusters make business or domain sense.

Outliers in embeddings

Embeddings represent objects such as texts, images, products or users as vectors. Outliers in embedding space can reveal unusual content or representation problems.

For example:

  • a document may appear far from all other documents,
  • an image may have an unusual visual style,
  • a product may be placed near the wrong category,
  • a user profile may not match normal behavioural groups,
  • a prompt may be unlike the rest of the evaluation set.

Embedding outliers should be reviewed carefully. They may reveal rare useful cases, data errors, mislabeled items or model weaknesses.

Outliers in business analytics

In business analytics, outliers can appear in many places:

  • unusually high order value,
  • sudden traffic spike,
  • large drop in conversion rate,
  • customer with very high refund rate,
  • campaign with abnormal cost per lead,
  • product with unusual return rate,
  • salesperson with unusually high or low performance,
  • website session with strange behaviour.

These outliers should not be ignored. They can reveal data tracking problems, operational issues, fraud, seasonality, business opportunities or segment differences.

Outliers in marketing data

Marketing data often contains outliers because user behaviour and campaign performance are uneven.

Examples include:

  • one campaign generating most conversions,
  • one keyword spending too much budget,
  • one landing page receiving unusually high traffic,
  • one source producing many low-quality leads,
  • one day showing abnormal conversion rate,
  • one customer segment behaving very differently.

Marketing outliers can be valuable. They may show a winning opportunity, broken tracking, bot traffic, fraud, seasonality or a one-off event.

The right response is investigation, not automatic removal.

Outliers in fraud detection

Fraud detection often focuses on unusual behaviour.

A transaction may be suspicious because it is much larger than usual, occurs from an unusual location, happens at an unusual time or combines several small unusual signals.

Fraud outliers are often contextual. A large transaction may be normal for one customer and suspicious for another. A login from another country may be normal for a frequent traveller and unusual for someone else.

This is why fraud systems often combine rules, machine learning, anomaly detection, user history and human review.

Outliers in sensor data

Sensor data can contain outliers because of measurement errors, hardware issues, environmental changes or real equipment problems.

For example:

  • a temperature sensor may produce a sudden impossible spike,
  • a vibration sensor may show early signs of machine failure,
  • a pressure reading may drift slowly over time,
  • a flow meter may produce missing or frozen values.

Some sensor outliers are noise. Others are early warning signs. The difference can matter operationally.

How to handle outliers

There is no universal rule for handling outliers. The right approach depends on cause and task.

Common approaches include:

  • investigate – understand why the value is unusual,
  • correct – fix clear data errors where the correct value is known,
  • remove – delete values that are clearly invalid and harmful,
  • winsorize – cap extreme values at a chosen threshold,
  • transform – use log or robust scaling to reduce extreme influence,
  • segment – model unusual groups separately,
  • flag – create a feature indicating that a value is unusual,
  • keep – retain valid outliers when they are important for the task.

The worst approach is automatic removal without understanding.

When to remove outliers

Removing outliers can be appropriate when they are clear errors or irrelevant to the modelling goal.

For example:

  • impossible values, such as negative age,
  • unit conversion mistakes,
  • duplicate records created by system errors,
  • test transactions in production data,
  • bot traffic when modelling human behaviour,
  • sensor failures that do not represent real operation.

Even then, removal should be documented. The team should know which rule was used and why.

When not to remove outliers

Outliers should not be removed when they represent real and relevant cases.

For example:

  • fraudulent transactions in a fraud detection model,
  • high-value customers in revenue modelling,
  • rare disease cases in medical analysis,
  • equipment failure signals in predictive maintenance,
  • security incidents in cybersecurity detection,
  • important edge cases in AI evaluation.

Removing these points may make the dataset cleaner but less useful.

If the outlier is part of the real problem the model must handle, removing it may make the model worse, not better.

Outlier treatment and model evaluation

Outlier handling must be included correctly in the evaluation pipeline.

If outliers are removed using information from the full dataset before splitting into training and test sets, this can create data leakage. The model may benefit from decisions that used test data.

A safer process is:

  • split the data first where appropriate,
  • learn thresholds or transformations on training data,
  • apply the same rules to validation and test data,
  • evaluate how the model performs with and without outlier treatment,
  • document the decision.

This is especially important when outlier treatment is automated.

Outlier flags as features

Sometimes the best solution is not to remove the outlier, but to flag it.

For example, a model may benefit from knowing that a transaction amount is unusually high for that customer. Instead of deleting the transaction, the system can create a feature such as „amount_is_unusual“.

This is useful when unusual behaviour carries predictive signal.

Outlier flags can help models distinguish between normal and unusual cases while still preserving the original information.

Robust methods

Robust methods are designed to be less sensitive to outliers.

Examples include:

  • median instead of mean,
  • interquartile range instead of standard deviation,
  • robust scaling,
  • robust regression,
  • tree-based models,
  • quantile-based methods,
  • Huber loss in regression.

Robust methods do not make outlier analysis unnecessary. They simply reduce the risk that a few extreme values dominate the result.

Outliers and model explainability

Model explainability can help identify whether outliers are influencing model behaviour.

For example, feature importance may show that one unusual variable has too much influence. Local explanations may show why a model produced an extreme prediction for one case. Error analysis may reveal that the model fails mainly on rare examples.

Outliers are often useful during model debugging because they show where assumptions break.

However, explanations should be interpreted carefully. A model may explain an outlier prediction using features that are correlated, noisy or affected by data leakage.

Outliers and AI systems

Modern AI systems often work with high-dimensional data, embeddings and complex model outputs. Outliers can appear not only in raw data, but also in prompts, retrieved documents, embeddings, tool calls and generated outputs.

For example, an AI agent may receive a user request that is very different from normal requests. A retrieval system may find a document that is far from expected content. A monitoring system may detect an unusual pattern in tool usage.

These outliers can indicate:

  • edge cases,
  • misuse attempts,
  • prompt injection,
  • data drift,
  • system errors,
  • new user needs,
  • gaps in evaluation.

Outliers and prompt injection

In systems based on large language models, unusual inputs can sometimes be security-relevant.

For example, a prompt that looks very different from normal user requests may be a harmless edge case. It may also be an attempt at prompt injection or jailbreaking.

This does not mean every unusual prompt is malicious. It means that outlier monitoring can be useful in AI safety and security.

A good AI system should log and review unusual requests, especially when they involve tool use, sensitive data or attempts to override instructions.

Outliers and monitoring

Outlier detection is often used in monitoring systems.

Examples include:

  • traffic spikes,
  • conversion rate drops,
  • server latency increases,
  • unusual API usage,
  • unexpected model prediction drift,
  • abnormal tool calls by AI agents,
  • unusual cost increases,
  • rare error patterns.

Monitoring outliers helps teams detect problems early. But alerts should be designed carefully. Too many false alarms can make people ignore the system.

Common mistakes with outliers

Outliers are easy to mishandle.

Common mistakes include:

  • removing outliers automatically – unusual values may be valid and important,
  • ignoring outliers completely – they may distort analysis or reveal problems,
  • using one threshold blindly – different segments may need different rules,
  • not checking data errors – impossible values should be corrected or removed,
  • confusing rare with wrong – rare events can be real,
  • using mean when median is better – averages can be distorted by extremes,
  • forgetting time context – seasonal spikes may be normal,
  • creating data leakage – outlier treatment must be part of a valid pipeline.

When outliers are useful

Outliers can be extremely useful because they reveal what normal data hides.

They can show:

  • fraud,
  • system failures,
  • rare customer segments,
  • high-value opportunities,
  • bad tracking implementation,
  • new market behaviour,
  • edge cases for AI evaluation,
  • early warning signs in operations.

In many projects, the outliers are not the data to throw away. They are the data to understand first.

How to document outlier handling

Outlier treatment should be documented so that the analysis can be reviewed and repeated.

Good documentation should include:

  • which variables were checked,
  • which method was used,
  • which threshold was applied,
  • how many values were affected,
  • whether values were removed, capped, transformed or flagged,
  • why the decision was made,
  • how the decision affected model performance,
  • whether domain experts reviewed the result.

This is important for trust, auditing and future model maintenance.

How to remember outlier

An outlier can be compared to one person standing far away from a crowd. That person may be lost, suspicious, important or simply different. You cannot know only from distance.

The same is true in data. An outlier is a signal to investigate, not an automatic instruction to delete.

Outlier = an unusual data point. The key question is why it is unusual and whether it should be corrected, removed, transformed, flagged or investigated.

Related terms

  • Anomaly – an unusual observation or pattern that may indicate something important or abnormal.
  • Noise – random variation or unwanted disturbance in data.
  • Data error – an incorrect value caused by measurement, entry, tracking or processing problems.
  • Outlier detection – methods used to identify unusual observations in a dataset.
  • Novelty detection – detecting new observations that differ from normal training data.
  • Isolation Forest – a machine learning method for detecting unusual points by isolating them through random splits.
  • Local Outlier Factor – a method that detects points with lower local density than their neighbours.
  • One-Class SVM – a method that learns a boundary around normal data.
  • Autoencoder – a neural network that can detect unusual data through reconstruction error.
  • Overfitting – a situation where a model learns training data too closely and performs poorly on new data.
  • Feature selection – selecting useful original variables and removing irrelevant or noisy ones.
  • Curse of dimensionality – a problem where high-dimensional spaces become sparse and harder to learn from reliably.
  • Embedding – a numerical representation of content or objects, often used for similarity search and clustering.
  • UMAP – a nonlinear dimensionality reduction method often used for visualising high-dimensional data.
  • t-SNE – a method used mainly for visualising high-dimensional data in two or three dimensions.
  • Machine learning – the broader field in which systems learn patterns from data and use them for prediction, classification or analysis.
  • Prompt injection – an attack or failure mode where external content tries to manipulate an AI system’s instructions.
  • Jailbreaking – attempts to bypass a model’s safety rules or restrictions.

Sources and further reading

  • What are outliers in the data? – itl.nist.gov – June 2026 – defines an outlier as an observation that lies an abnormal distance from other values in a random sample.
  • Detection of Outliers – itl.nist.gov – June 2026 – explains why identifying potential outliers is important in exploratory data analysis.
  • Novelty and Outlier Detection – scikit-learn.org – June 2026 – explains outlier detection and novelty detection as anomaly detection tasks.
  • IsolationForest – scikit-learn.org – June 2026 – documents Isolation Forest as an algorithm for detecting outliers by isolating observations.
  • LocalOutlierFactor – scikit-learn.org – June 2026 – documents Local Outlier Factor for detecting observations with abnormal local density.
  • OneClassSVM – scikit-learn.org – June 2026 – documents One-Class SVM for unsupervised outlier detection.
  • What Is Anomaly Detection? – ibm.com – June 2026 – explains anomaly detection and the use of machine learning for identifying unusual patterns.
  • Anomaly Detection in Machine Learning – ibm.com – June 2026 – explains supervised, unsupervised and semi-supervised anomaly detection approaches.
  • Numerical data: Normalization – developers.google.com – June 2026 – explains normalization, log scaling and the effect of outliers in numerical data.
  • Preparing and curating your data for machine learning – cloud.google.com – June 2026 – discusses data preparation steps including transformations and replacing outliers depending on the model.

Was this article helpful?

Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!

Reaction to comment: Cancel reply

What do you think about this article?

Your email address will not be published. Required fields are marked.