Dimensionality reduction
Dimensionality reduction is a process in which data with a large number of variables is transformed into a simpler form with fewer dimensions. The goal is not to make the data smaller at any cost, but to preserve as much useful information as possible and remove what is redundant, irrelevant or unhelpful for the specific task.
In machine learning, dimensionality reduction is used mainly when a dataset has too many input variables, often called features. These can be columns in a table, pixels in an image, numerical representations of words, document embeddings or signals from different measurements. The more such variables a model receives, the more difficult training, computation and interpretation can become.
Dimensionality reduction can simplify the model, reduce computational cost, remove noise from the data and sometimes improve the model’s ability to generalise to new data. But it is not an automatic step that is always correct. If used poorly, it can also remove information that is important for the final result.
Dimensionality reduction means making overly complex data easier to work with. Instead of hundreds or thousands of variables, the goal is to keep the ones that matter most for the task, or to create a smaller number of new variables that still preserve the key information.
What dimension means in data
A dimension in data is one property, one direction of description or one column in a table. If you have a customer table with columns such as age, city, number of purchases, order value and last website visit, you are working with five variables. If the same table has 300 columns, the data already has a high number of dimensions.
With ordinary tables, this is fairly easy to imagine. Each column is one dimension. With other types of data, it is less obvious. An image can be described by millions of pixels. Text can be converted into a numerical vector. A single document can have an embedding made of hundreds or thousands of numbers. Each of these numbers is one dimension in the space where the model works with the data.
People can easily imagine a two-dimensional space, such as a chart with an X and Y axis. A three-dimensional space is still understandable. Computers, however, can work with spaces that have hundreds or thousands of dimensions. This is possible mathematically, but it creates practical problems.
When a table has 10 columns, a person can still roughly understand what is happening in the data. When it has 10 000 columns, it becomes almost impossible to see manually which relationships matter, which repeat the same information and which are just noise.
Example: a table of houses
Imagine a dataset about houses. Each house is described by one hundred columns.
The table includes floor area, location, building age, heating type, energy class, land size, number of rooms, roof condition, distance to school, distance to public transport, window material, facade colour, door handle type and garden orientation.
If you want to estimate house price, some variables will be very important. Location, floor area, technical condition, land size, energy efficiency and access to services may be strong signals. Other variables may matter less. Facade colour may influence the impression of a buyer in some cases, but it is unlikely to be as important as location or property size for a basic price estimate.
Dimensionality reduction in this case is similar to the work of an experienced analyst who looks at the table and says: „These one hundred variables are too much for this task. The model mainly needs the variables that actually help explain the price.“
For house price prediction, floor area, location, technical condition and energy efficiency may be important. Door handle type, fence colour or decorative details may only be weak signals. If we keep them in the dataset without checking their usefulness, they can unnecessarily burden the model.
Why too many variables can be a problem
At first glance, it may seem that the more data a model receives, the better. In machine learning, this is not always true. More variables do not automatically mean more useful information. They can also mean more noise, more duplicated signals, more accidental relationships and a higher risk of overfitting.
The model may start learning relationships that exist only in the training data but do not repeat in the real world. This is similar to a student memorising answers from one test without understanding the subject. The student may pass the same test, but fail when the questions change.
Datasets with many dimensions also increase computational demands. Model training can be slower, storage can become more expensive and tuning can be more complicated. It also becomes harder to explain why the model made a particular decision.
- Computational cost increases – the model has to process more input values.
- The risk of overfitting increases – the model may learn accidental patterns from training data.
- The data becomes harder to understand – people have more difficulty seeing which variables really matter.
- More noise enters the model – some variables may hurt the model more than help it.
- Similarity becomes harder to measure – distances in high-dimensional spaces can behave less intuitively.
More columns are not automatically an advantage. If a model receives many weak, duplicated or random variables, it may learn worse than with a smaller but cleaner set of inputs.
The curse of dimensionality
The concept of the curse of dimensionality is closely connected with dimensionality reduction.
It describes a situation where the complexity of the space in which the model learns grows rapidly as the number of dimensions increases. Data becomes sparse in such a space, and the model needs far more examples to find stable relationships.
A simple example: if you want to separate points on a line, you work with one dimension. If you want to separate them on a plane, you work with two dimensions. If you want to separate them in space, you work with three dimensions. With every additional dimension, the number of possible combinations increases. If the number of data examples does not grow at a similar pace, the model has less reliable information about how to make decisions in that space.
In practice, this means that the model may find patterns in the training data that look convincing but are actually random. A smaller number of high-quality variables can produce a more stable result than a huge number of weak inputs.
The more dimensions you have in the data, the larger the space the model has to search. If there are not enough examples, the model can easily invent relationships that do not generalise.
Two main approaches to dimensionality reduction
Dimensionality reduction is usually divided into two main approaches. The first is feature selection. The second is feature extraction. Both have the same goal – to simplify the data. They differ in how they work with the original variables.
Feature selection
Feature selection means keeping only the most important variables from the original dataset. The original columns do not change. The method only decides which of them should be used and which should be left out.
If you have a table with 100 columns and the analysis selects the 20 most important ones, you are still working with original variables. You are just using fewer of them. This is useful especially when you need to explain the result to a person. You still know that the model works with location, floor area, property condition or energy class.
Feature selection can be done in different ways. Sometimes it is based on correlation, sometimes on feature importance in a model, sometimes on statistical tests and sometimes on expert knowledge of the domain. In practice, the best approach is often to combine technical analysis with human judgement.
The advantage of feature selection: the resulting data remains understandable. If you keep original columns, you can more easily explain why the model uses those specific inputs.
Feature extraction
Feature extraction works differently. It does not only select some of the original variables. Instead, it creates new variables as combinations of the original ones. These new variables can summarise information from several columns at once.
Imagine you have several variables describing the size of a property: living area, number of rooms, land size, number of floors and built-up area. These variables may be related. Feature extraction can create a new combined variable that expresses the general size or spatial complexity of the property.
The advantage is that this method can simplify the data significantly. The disadvantage is lower interpretability. A newly created variable may not have a simple human meaning. It can be mathematically useful but harder to explain.
Feature selection keeps some of the original columns. Feature extraction creates new columns that summarise the original data.
PCA – principal component analysis
PCA, or Principal Component Analysis, is one of the best-known dimensionality reduction methods. It is a linear method that looks for new axes in the data so that they capture as much variance in the original variables as possible.
In simple terms, PCA tries to find directions in which the data differs the most. The first principal component captures the largest part of the variance. The second component captures another important part, but in a different direction. The process can continue further, and the user then decides how many components to keep.
PCA is often used when you have many numerical variables that are related to each other. It can reduce the number of inputs before model training, remove redundancy or prepare data for visualisation.
- Before modelling – PCA can reduce the number of inputs for an algorithm.
- Data visualisation – data with many variables can be converted into two or three components.
- Removing redundancy – if variables carry similar information, PCA can compress it.
- Faster computation – fewer inputs often mean faster training.
PCA is not a universal solution. It preserves mainly variance, not necessarily what matters most for a specific prediction task. If a variable is important from a domain perspective but has low variance, PCA may weaken it. That is why the result must always be checked against the actual goal.
PCA works with mathematical variance in the data. This does not mean it automatically preserves every variable that is important from a business, medical, legal or expert perspective.
t-SNE – visualising complex data
t-SNE is a method often used for visualising high-dimensional data. Its goal is to convert complex data into two or three dimensions so that clusters and similarities between points can be seen more easily.
In practice, t-SNE is used with images, text embeddings, biological data and other datasets where we want to see whether similar objects naturally group together. It can help reveal that some groups of data are close to each other, while others are clearly different.
It is important to understand that t-SNE is mainly a tool for exploration and visualisation. The resulting chart can be very useful, but it should not be read too literally. Distances between clusters, cluster sizes and exact point positions do not always have a simple meaning.
If you have thousands of text documents converted into embeddings, t-SNE can help display them on a map. Similar documents may appear close to each other and form thematic clusters.
Important warning: a nice t-SNE chart is not proof that the data has exactly the structure you see. It is an analytical aid, not a final explanation of reality.
UMAP – fast nonlinear dimensionality reduction
UMAP, or Uniform Manifold Approximation and Projection, is another method used for dimensionality reduction and data visualisation. Like t-SNE, it helps convert complex high-dimensional data into a more readable form, often into two or three dimensions.
UMAP is popular mainly because it is often fast and practical even for larger datasets. It is commonly used with embeddings, biological data, image data and customer segments. It can be used for visualisation, but also as a more general nonlinear dimensionality reduction method.
As with t-SNE, the result must be interpreted carefully. UMAP can show interesting structure, but it does not mean that every cluster automatically corresponds to a real category. The chart is the beginning of analysis, not the end.
t-SNE and UMAP are most useful when you want to look at complex data. They can convert thousands of dimensions into a readable map, but their output should be verified with further analysis.
Autoencoders and neural networks
Dimensionality reduction can also be done with neural networks. A typical example is an autoencoder. It is a model that learns to compress data and then reconstruct it. Inside the model, a narrowed representation appears, capturing important information in fewer values.
An autoencoder can be imagined as a system that receives a complex input, such as an image, and tries to turn it into a shorter internal representation. It then tries to reconstruct the original input from that internal representation. If it does this well, it means it has learned to capture important features of the data.
Autoencoders can be useful for complex data where simple linear methods are not enough. They are used with images, audio, text, sensor data and anomaly detection. Their disadvantage is greater complexity, higher data requirements and lower explainability.
An autoencoder is like a person who makes short notes from a long article and then tries to retell the article from those notes. If the notes contain what is important, the result will still make sense.
Dimensionality reduction and embeddings
Modern artificial intelligence often works with embeddings. An embedding is a numerical representation of content. It can represent a word, sentence, document, image, product or user. Such an embedding can have hundreds or thousands of dimensions.
With embeddings, dimensionality reduction is mainly used for two reasons. The first is visualisation. If you want to see how similar documents, images or products are, you can convert their embeddings into a two-dimensional map. The second is optimisation. Smaller vectors can be faster to store, search and compare.
However, it is important to make sure that reducing an embedding does not remove too much semantic information. A smaller vector can be more practical, but if it no longer distinguishes important differences between objects, the model or search system can become worse.
An e-commerce store can convert every product into an embedding. This makes it possible to find similar products by meaning, not only by the same name. Dimensionality reduction can help visualise these products on a similarity map or speed up work with large numbers of vectors.
How to decide how many dimensions to keep
There is no universal number that works for every dataset. The right number of dimensions depends on what you are trying to solve, how much data you have, how good the input variables are and what model you use.
With PCA, explained variance is often monitored. If the first few components capture most of the variance in the data, it may make sense to work only with them. But this should not be the only criterion. It is also important to check how the final model behaves on validation or test data after dimensionality reduction.
In practice, the key questions are:
- Have we preserved enough important information?
- Did the model or computation become faster?
- Did performance improve on new data?
- Did we lose variables that are important from a domain perspective?
- Is the result still explainable?
- Did we create only a nicer chart without practical value?
Dimensionality reduction makes sense when it simplifies the data without making the model worse at solving the real task. If it only makes the table smaller but worsens the result, it is not a good reduction.
Where dimensionality reduction is used
Dimensionality reduction is used in data analytics, machine learning, artificial intelligence, bioinformatics, marketing, image processing and text analysis. It is not just an academic technique, but a practical tool for working with large and complex datasets.
- Machine learning – reducing the number of input variables before model training.
- Data visualisation – converting complex data into two or three dimensions.
- Image processing – compressing visual information and finding important visual features.
- Text analytics – working with embeddings, topics and semantic similarity.
- Recommendation systems – simplifying relationships between users, products and behaviour.
- Bioinformatics – analysing datasets with very large numbers of variables, such as gene expression data.
- Anomaly detection – finding unusual patterns after simplifying the data space.
- Marketing analytics – segmenting customers based on many different signals.
Typical mistakes in dimensionality reduction
Dimensionality reduction can be very useful, but only if it is used carefully. The biggest mistake is treating it as an automatic technical step without considering the purpose of the analysis. Data should not be reduced only because a smaller table looks cleaner.
- Reducing dimensions before splitting the data – if the method is fitted on the whole dataset before creating training and test sets, data leakage can occur.
- Removing an important signal – some variables may be rare but essential for the problem.
- Incorrect scaling – with methods such as PCA, variables with larger ranges can dominate the result if the data is not scaled properly.
- Trusting the chart too much – visualisations from t-SNE or UMAP can be useful, but they may not describe reality precisely.
- Losing interpretability – new components may work mathematically, but people may not understand their meaning.
- Ignoring domain context – a purely technical choice of variables may ignore business, legal, medical or other expert knowledge.
An analyst may reduce the number of variables, speed up the model and produce a nice chart. But unless performance is checked on new data, it is not clear whether dimensionality reduction actually helped or only hid a problem.
When dimensionality reduction should not be used
Dimensionality reduction is not always useful. If you have a small number of well-chosen and well-understood variables, there is no need to transform them at any cost. Sometimes it is better to work with the original columns because they are clear and easy to inspect.
Caution is especially important when model decisions must be explained. If a model helps decide about credit, health risk, insurance claims or another sensitive outcome, it can be a problem if you cannot explain what a newly created component actually means.
Dimensionality reduction should not replace understanding of the data. It should be a tool that helps, not a magic button that automatically fixes poor-quality inputs.
The goal is not to have as few variables as possible. The goal is to have enough useful variables for good model decisions while still allowing people to review the result.
Dimensionality reduction in everyday life
A similar principle appears outside machine learning as well. When choosing a phone, you do not compare every individual component. You care about price, camera quality, battery life, screen size, performance and brand. Dozens of technical parameters are simplified into a few main criteria.
When a recruiter evaluates a candidate, they also do not treat every detail in the CV equally. Education, experience, specific skills, language knowledge and relevant projects usually matter more than minor details. Some information may be interesting, but not decisive for the position.
Dimensionality reduction works in a similar way. It does not say that other information does not exist. It tries to separate what matters from what does not have enough weight for a specific decision.
How dimensionality reduction relates to data quality
Data quality is essential for machine learning. Even the best algorithm will not help if it receives inaccurate, incorrect or poorly prepared inputs. Dimensionality reduction can remove some noise, but it cannot fix a poor data collection process, wrong labels or a misunderstood business problem.
If the dataset contains duplicates, extreme values, missing values or incorrectly created variables, these issues need to be handled before dimensionality reduction or together with it. Otherwise, the method may only compress poor-quality data into a smaller but still poor-quality form.
Dimensionality reduction does not replace data cleaning. If the inputs are poor, reducing the number of dimensions will not solve the problem by itself.
How to remember dimensionality reduction
You can think of dimensionality reduction as simplifying a complex map. A world map could contain every road, every elevation point, every land boundary and every surface type. But if you only want to find a route from Prague to Brno, you do not need that much detail. You need a simpler map that keeps what matters for your specific purpose.
The same applies to data. For one task, location, price and size may be important. For another task, user behaviour, visit time and device type may matter more. Dimensionality reduction helps find a form of data that is simpler but still useful.
Dimensionality reduction means selecting or creating a smaller number of variables that preserve the most important information for the task.
Related terms
- Machine learning – the broader field in which systems learn from data instead of relying only on fixed rules. Dimensionality reduction is one way to prepare data in a simpler and more useful form.
- Embedding – a numerical representation of content such as text, images or documents. Embeddings often have hundreds or thousands of dimensions, so dimensionality reduction can help with visualisation, similarity analysis or optimisation.
- Large language model (LLM) – a language-focused AI model. It is related indirectly because modern language models work with vectors, embeddings and large numerical representations of text.
- Feature selection – choosing the most important original variables. It is one of the two main approaches to dimensionality reduction because it reduces the number of columns by leaving some of them out.
- Feature extraction – creating new variables from original data. Unlike feature selection, it does not only keep original columns, but creates a new compact representation.
- PCA – principal component analysis. One of the best-known dimensionality reduction methods, often used with numerical data to reduce variables while preserving a large part of the variance.
- t-SNE – a method used mainly for visualising high-dimensional data. It helps convert complex representations into two or three dimensions so that clusters and similarities can be seen.
- UMAP – a method for nonlinear dimensionality reduction, often used in a similar way to t-SNE, especially for visualising embeddings, biological data or larger datasets.
- Overfitting – a situation where a model learns training data too closely and performs worse on new data. Dimensionality reduction can reduce this risk by removing some unnecessary or noisy inputs.
- Autoencoder – a neural network that learns to compress data and reconstruct it. It can be used as a more advanced dimensionality reduction method for complex data such as images or signals.
Sources and further reading
- Unsupervised dimensionality reduction – scikit-learn.org – June 2026 – explains the basic principles of unsupervised dimensionality reduction and why these methods are used in data preprocessing.
- PCA – Principal Component Analysis – scikit-learn.org – June 2026 – describes PCA, its parameters and its use in the scikit-learn library.
- Visualizing Data using t-SNE – jmlr.org – June 2026 – the original paper on t-SNE, a method used mainly for visualising high-dimensional data.
- UMAP: Uniform Manifold Approximation and Projection – joss.theoj.org – June 2026 – an academic article describing UMAP as a method for dimensionality reduction and visualising complex data structures.
- What is dimensionality reduction? – ibm.com – June 2026 – a general explanation of dimensionality reduction, its purpose and its role in machine learning workflows.
Was this article helpful?
Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!
Reaction to comment: Cancel reply