Categories
PCA (principal component analysis)

PCA (principal component analysis)

May 22,2026 in AI&ChatGPT | 0 Comments

PCA, short for principal component analysis, is one of the best-known dimensionality reduction methods. It is often used with numerical data to reduce the number of variables while preserving a large part of the variance in the original dataset.

In machine learning, datasets often contain many input variables. Some of them carry useful information, some repeat similar information and some mostly add noise. PCA helps transform such data into a simpler form by creating a smaller number of new variables called principal components.

The purpose of PCA is not to make data smaller at any cost. The purpose is to keep the main structure of the data while reducing complexity. If the original dataset has 100 numerical columns, PCA may transform it into 5, 10 or 20 principal components that still preserve most of the important variation.

PCA transforms many numerical variables into fewer new variables called principal components. These components are built so that the first ones capture the largest part of the variance in the data.

What PCA means

Principal component analysis is a statistical method for simplifying complex numerical data. It looks for directions in the data where values vary the most. These directions become principal components.

A principal component is not usually one original column from the dataset. It is a new variable created as a mathematical combination of the original variables. This is why PCA is different from simply choosing a few columns and deleting the rest.

Imagine a dataset with many measurements about houses: floor area, land size, number of rooms, number of floors, built-up area and storage space. These variables may all describe the broader idea of property size. PCA can create one component that captures much of this shared size-related information.

The original variables remain the source, but the model or analyst can work with fewer components instead of many overlapping columns.

A simple example of PCA

Imagine that you have a dataset about customers. Each customer is described by many columns: number of purchases, total spend, average order value, number of website visits, newsletter clicks, discount usage, time since last purchase, number of product categories viewed and support interactions.

Some of these variables may be connected. A customer who buys often may also have high total spend. A customer who reads newsletters may also visit the website more often. A customer who reacts to discounts may behave differently from a customer who buys without promotions.

PCA can reduce these many related variables into a smaller number of components. One component may represent general purchasing activity. Another may represent engagement. Another may represent discount sensitivity.

The exact meaning of each component is not automatic. It must be interpreted from the original variables and their loadings. But the basic idea is clear: PCA turns many related measurements into fewer summary dimensions.

PCA is useful when many variables describe similar or related parts of the same reality. Instead of forcing the model to process every column separately, PCA creates a smaller representation of the main patterns.

Why PCA is used

PCA is used when a dataset has many numerical variables and the analyst wants to simplify it. This can make data easier to visualise, easier to process and sometimes easier to use in a machine learning model.

Too many variables can create practical problems. Training may be slower. The model may become more sensitive to noise. Some variables may be redundant. The data may be harder to understand. Visualisation may be almost impossible if the dataset has hundreds or thousands of dimensions.

PCA can help with:

  • dimensionality reduction – reducing many variables into fewer components,
  • data visualisation – projecting high-dimensional data into two or three dimensions,
  • noise reduction – removing weaker components that may contain less useful variation,
  • faster computation – working with fewer input dimensions,
  • removing redundancy – compressing correlated variables into fewer components,
  • exploratory analysis – finding structure, clusters or outliers in complex data.

PCA and dimensionality reduction

PCA is one of the classic methods of dimensionality reduction. Dimensionality reduction means transforming data with many variables into a simpler form with fewer dimensions.

If a dataset has 200 numerical variables, it may not be practical to use all of them directly. Some may carry almost the same information. Some may be weak. Some may only increase computational cost. PCA can create a reduced representation that keeps the strongest patterns.

This does not mean PCA always improves the model. It means PCA gives the analyst a way to reduce complexity while trying to preserve the most important variation.

PCA is not the only dimensionality reduction method, but it is one of the most common starting points because it is well understood, widely implemented and useful for many numerical datasets.

What a principal component is

A principal component is a new axis in the data. It is created from the original variables and arranged so that it captures as much variance as possible.

The first principal component captures the largest possible amount of variance in the data. The second principal component captures the largest remaining variance, while being independent from the first one. The third component captures the next part, and so on.

This ordering matters. The first few components are usually the most important. Later components often capture smaller and weaker patterns.

In practice, the analyst usually keeps only some of the components. If the first 10 components explain most of the variance, the remaining components may be ignored, depending on the task.

PCA and variance

Variance describes how much values differ from each other. If values are spread out, variance is high. If values are close together, variance is low.

PCA is built around variance. It tries to find directions in the data where the values change the most. These directions are considered important because they capture strong structure in the dataset.

However, high variance does not automatically mean high usefulness for every task. A variable or component may vary a lot without being relevant to the target you care about. Another weak-looking component may contain information that matters for prediction.

PCA preserves variance, not meaning. A component can explain a lot of variance and still be less useful for a specific business, medical or predictive task than expected.

Explained variance

Explained variance tells you how much of the original variation is captured by each principal component.

For example, the first component may explain 45 % of the variance, the second 20 %, the third 10 % and the fourth 5 %. Together, the first four components would explain 80 % of the variance.

This information helps decide how many components to keep. If the first few components explain most of the variance, PCA can reduce the dataset strongly. If many components are needed to explain enough variance, the data may not compress well into only a few dimensions.

A common mistake is to choose the number of components mechanically. Keeping 95 % of the variance may sound reasonable, but it is not always the best rule. The right number depends on the purpose of the analysis, the model performance, the need for interpretability and the cost of losing information.

Scree plot

A scree plot is a chart that shows how much variance is explained by each principal component. It helps the analyst see where the added value of further components begins to decline.

In many datasets, the first few components explain a large part of the variance. Later components explain less and less. The point where the curve begins to flatten is often called the elbow.

The elbow can help choose how many components to keep, but it should not be treated as a perfect rule. A scree plot is a guide, not a final decision. The chosen number of components should still be checked against the actual goal of the analysis.

PCA vs feature selection

PCA is often confused with feature selection. Both can reduce complexity, but they work differently.

Feature selection keeps some of the original variables and removes others. If a dataset has 100 columns, feature selection may keep 20 of them. The meaning of the selected variables remains clear.

PCA creates new variables from the original variables. If a dataset has 100 columns, PCA may create 20 principal components. These components are mathematical combinations of the original columns.

This makes PCA powerful, but sometimes harder to explain. Feature selection says: “These original columns are important.” PCA says: “These new combinations of columns capture most of the variation.”

Feature selection keeps selected original variables. PCA creates new variables from the original ones. That is the key difference.

Feature extraction and PCA

PCA belongs to feature extraction. Feature extraction means creating new variables from the original data. These new variables are meant to capture useful information in a more compact form.

This is different from feature selection, where the original variables stay unchanged. PCA changes the representation of the dataset. It produces principal components that may be useful for modelling, visualisation or analysis, but they are no longer the original input columns.

This distinction matters in practice. If you need interpretability, feature selection may be easier to explain. If you need compression, denoising or visualisation, PCA may be more useful.

How PCA works in simplified steps

The mathematical details of PCA can be complex, but the practical logic can be explained in a few steps.

  1. Prepare the data – select numerical variables and handle missing or invalid values.
  2. Center the data – subtract the mean so that variables are centred around zero.
  3. Scale the data if needed – standardise variables when they use different units or ranges.
  4. Find directions of maximum variance – PCA identifies the axes where the data varies the most.
  5. Create principal components – the original variables are transformed into new components.
  6. Choose how many components to keep – only the most useful components are retained.
  7. Use the reduced data – the selected components can be used for visualisation, analysis or modelling.

In technical terms, PCA is usually computed through eigenvectors and eigenvalues of the covariance matrix or through singular value decomposition. In practical work, analysts usually use statistical software or machine learning libraries rather than calculating PCA manually.

Why scaling matters in PCA

Scaling is one of the most important practical issues in PCA. PCA is sensitive to the scale of variables. If one variable is measured in thousands and another in small decimal values, the larger-scale variable can dominate the result.

For example, if a dataset contains income in euros and age in years, income may have much larger numerical variance simply because of its unit. PCA may then treat income as more important even if age is also meaningful.

This is why variables are often standardised before PCA. Standardisation usually means subtracting the mean and dividing by the standard deviation. After that, variables are on a comparable scale.

If variables use different units or ranges, PCA can be misleading without proper scaling. A large numerical range can look like importance even when it is only a measurement effect.

Scores and loadings

Two important terms in PCA are scores and loadings.

Scores describe where each observation appears in the new component space. If you reduce a dataset to two principal components, each row in the dataset receives a score on component 1 and component 2. These scores can be used to create a two-dimensional chart.

Loadings describe how strongly each original variable contributes to each principal component. They help interpret what a component may represent. If floor area, room count and land size all have strong loadings on the first component, that component may be related to property size.

Scores are mainly about observations. Loadings are mainly about variables. Together, they help connect the reduced PCA space back to the original dataset.

When PCA is useful

PCA is useful when the dataset contains many numerical variables and some of them are correlated. It is especially helpful when the goal is to simplify the data, visualise patterns or reduce redundancy before modelling.

PCA is commonly used in:

  • data visualisation – reducing many variables to two or three components for plotting,
  • exploratory data analysis – finding the main structure in complex datasets,
  • preprocessing for machine learning – reducing input size before modelling,
  • image processing – compressing image information and finding major patterns,
  • bioinformatics – analysing gene expression or other high-dimensional biological data,
  • finance – summarising correlated market or risk variables,
  • marketing analytics – reducing many customer behaviour signals into fewer patterns,
  • recommendation systems – simplifying high-dimensional user or product data.

PCA for visualisation

One of the most common uses of PCA is visualisation. High-dimensional data is difficult to imagine. PCA can reduce it to two or three components so that the dataset can be plotted on a chart.

This can help reveal clusters, outliers, trends or group differences. For example, if customer segments separate clearly on the first two principal components, the analyst may see that the dataset contains meaningful behavioural groups.

However, PCA visualisations must be interpreted carefully. A two-dimensional chart is only a simplified view of the original data. If the first two components explain only a small part of the variance, the chart may hide important structure.

PCA and embeddings

Modern AI systems often work with embeddings. An embedding is a numerical representation of content, such as a word, sentence, document, image, product or user. Embeddings can have hundreds or thousands of dimensions.

PCA can be used to reduce the dimensionality of embeddings for analysis, visualisation or storage optimisation. For example, document embeddings can be projected into two dimensions to create a map of similar documents. Product embeddings can be reduced to inspect whether related products cluster together.

But reducing embeddings can remove semantic information. A smaller vector may be faster to process, but it may also become less precise. Therefore, PCA on embeddings should be tested against the real task, such as search quality, recommendation quality or clustering usefulness.

PCA and RAG systems

In RAG systems, documents are often converted into embeddings and searched by similarity. PCA can be useful for analysing those embeddings, visualising document clusters or checking whether different types of content form clear groups.

For example, an internal knowledge base may contain product manuals, support tickets, contracts and technical documentation. PCA can help show whether these content groups are separated in embedding space or whether the retrieval system may confuse them.

However, PCA should not be treated as proof that a RAG system works correctly. Retrieval quality must still be evaluated with realistic questions, source checks and user feedback.

PCA and large language models

PCA is not usually a direct training method for large language models. LLMs are trained with different architectures and objectives. Still, PCA can be useful when analysing high-dimensional representations related to language models.

For example, researchers or engineers may use PCA to inspect embeddings, activation patterns, model outputs or clusters of documents. PCA can help turn complex numerical representations into a simpler view that humans can examine.

This does not mean PCA explains everything inside a language model. It is only one analytical tool that can help inspect patterns in high-dimensional data.

PCA and model performance

PCA can improve model performance in some cases, but it is not guaranteed. It may help when the original data has many correlated variables, noise or unnecessary dimensions. It may also speed up training by reducing the number of inputs.

At the same time, PCA can remove information that matters for prediction. PCA preserves variance, not necessarily predictive usefulness. A component that explains little variance may still contain information that is important for a specific target variable.

That is why PCA should be evaluated on validation or test data. It should not be used only because it is a well-known method.

PCA and overfitting

Overfitting happens when a model learns the training data too closely and performs poorly on new data. PCA can sometimes reduce overfitting by reducing the number of input dimensions and removing part of the noise.

This can be useful when a dataset has many correlated variables or too many features compared with the number of examples. A simpler representation can make it harder for the model to memorise accidental details.

But PCA is not a guaranteed cure for overfitting. If the model is poorly validated, if the data is biased or if the components remove important predictive information, PCA may not help. The result must always be checked on unseen data.

PCA and model explainability

PCA can make data smaller, but it can also make the result less interpretable. Original variables usually have clear names: price, age, income, floor area, purchase count or click rate. Principal components are combinations of these variables.

A component may be partly related to income, partly to purchase behaviour and partly to engagement. This can be useful mathematically but harder to explain to a business user.

This is why PCA has a complicated relationship with model explainability. It may simplify the data structure, but the resulting components may be less intuitive than the original features.

PCA often improves mathematical compactness, but it can reduce human interpretability. This trade-off matters in business, healthcare, finance and other high-impact areas.

PCA and data leakage

Data leakage is a modelling error where the model receives information during training that would not be available in real use. PCA can contribute to leakage if it is fitted before the train-test split.

This may sound surprising because PCA does not use the target variable directly. But PCA still learns the structure of the data. If it learns that structure from the entire dataset, the test set has already influenced the transformation.

The correct workflow is simple: split the data first, fit PCA only on the training data and then apply the learned transformation to validation or test data. In cross-validation, PCA should be fitted separately inside each training fold.

In a machine learning workflow, PCA must be fitted only on training data. Fitting PCA on the full dataset before evaluation can create data leakage.

PCA and bagging or ensemble models

PCA can sometimes be used before more complex modelling methods, including ensemble methods. For example, it may reduce a high-dimensional input space before training a classifier or regression model.

However, not every model needs PCA. Some methods can already handle many variables well. Tree-based models and ensemble methods such as bagging may be less sensitive to correlated inputs than some linear models, although this depends on the task and data.

The practical question is not whether PCA is theoretically elegant. The practical question is whether it improves validation performance, stability, speed, interpretability or cost for the specific project.

How to choose the number of components

Choosing the number of principal components is one of the most important PCA decisions. There is no single correct number for all datasets.

Several approaches are common:

  • explained variance threshold – keep enough components to explain, for example, 90 % or 95 % of the variance,
  • scree plot – look for the point where additional components add much less information,
  • model validation – choose the number of components that works best on validation data,
  • interpretability – keep a number of components that can still be explained and documented,
  • operational constraints – choose fewer components if speed, storage or simplicity matters.

For prediction tasks, validation performance is especially important. A component count that explains a lot of variance is not always the component count that produces the best predictive model.

Limitations of PCA

PCA is powerful, but it has several limitations. These limitations are important because PCA is sometimes treated as a universal solution for complex data.

  • PCA is linear – it works best when important structure can be captured by linear combinations of variables.
  • PCA focuses on variance – high variance does not always mean high relevance for the prediction task.
  • PCA can reduce interpretability – components are combinations of original variables.
  • PCA is sensitive to scaling – variables with larger numerical ranges can dominate if data is not standardised.
  • PCA can be affected by outliers – extreme values can distort the direction of components.
  • PCA does not understand business meaning – it only follows statistical structure in the data.
  • PCA can cause data leakage if applied incorrectly – it must be fitted only on training data in modelling workflows.

Common mistakes when using PCA

PCA is widely used, but it is also easy to misuse. Many mistakes happen because PCA is treated as a technical shortcut instead of a method that requires judgement.

  • using PCA without scaling – variables with larger ranges can dominate the components,
  • keeping too few components – important information may be removed,
  • keeping too many components – the data may not become meaningfully simpler,
  • interpreting components too literally – components are mathematical combinations, not automatically real-world factors,
  • using PCA on unsuitable data – categorical variables, missing values or non-linear patterns may require other methods,
  • applying PCA before train-test split – this can cause data leakage,
  • assuming PCA improves every model – some models work better with original variables,
  • confusing variance with importance – variance is not the same as predictive value or causal relevance.

When PCA should not be used

PCA is not always the right choice. If a dataset has a small number of clear, meaningful variables, reducing them may only make the analysis harder to explain.

PCA may also be a poor fit when the important structure is strongly non-linear. In those cases, methods such as t-SNE, UMAP, autoencoders or other representation learning approaches may be more useful, depending on the goal.

PCA should also be used cautiously in high-impact decisions where explanations must be clear. If a model affects credit, health, insurance or legal outcomes, replacing original variables with abstract components may make the decision harder to justify.

PCA in everyday language

You can think of PCA as summarising a detailed report into a few main themes. The report may contain dozens of related metrics. PCA tries to find the main patterns that explain how those metrics vary together.

For example, many housing variables may be summarised into components related to property size, location quality and building condition. Many customer behaviour variables may be summarised into components related to engagement, purchase activity and discount sensitivity.

This analogy is not perfect, because PCA is mathematical and the components are not always easy to name. But it captures the basic idea: PCA turns many related measurements into fewer summary dimensions.

How to remember PCA

PCA can be remembered as a method that rotates the view of the data. Instead of looking at the dataset through the original columns, PCA finds new directions where the data spreads the most. Then it keeps the most informative directions and discards the weaker ones.

The method is useful because complex data often contains repeated structure. PCA can compress that structure into fewer components. But it must be used carefully, because the components preserve variance, not necessarily meaning, causality or business importance.

PCA is a dimensionality reduction method that creates new components from original numerical variables. It is useful for simplifying data, but the result must always be checked against the real task.

Related terms

  • Machine learning – the broader field in which systems learn patterns from data and use them for predictions, classifications or recommendations.
  • Dimensionality reduction – reducing the number of dimensions in a dataset while preserving useful information. PCA is one of the best-known methods in this group.
  • Feature selection – choosing a subset of original variables instead of creating new components.
  • Embedding – a numerical representation of content such as text, images or documents. PCA can be used to analyse or visualise high-dimensional embeddings.
  • Large language model (LLM) – a language-focused AI model. PCA is not usually a direct LLM training method, but it can help analyse high-dimensional representations related to language models.
  • RAG – retrieval-augmented generation. PCA can help analyse embedding spaces used in retrieval systems, but it does not replace retrieval evaluation.
  • Model explainability – the ability to understand why a model produced a certain output or prediction. PCA can make this harder if components are difficult to interpret.
  • Overfitting – a situation where a model learns training data too closely and performs poorly on new data.
  • Data leakage – a modelling error where information from evaluation data or future data influences training or preprocessing.
  • Bagging – an ensemble method that trains multiple model versions and combines their outputs.
  • Feature extraction – creating new variables from original data. PCA is a classic example of feature extraction.
  • Principal component – a new variable created by PCA as a combination of original variables.
  • Explained variance – the amount of original variation captured by a principal component or a set of components.
  • Scree plot – a chart used to inspect how much variance is explained by each principal component.
  • Loadings – values showing how strongly original variables contribute to principal components.
  • Scores – transformed values showing where observations fall in the new PCA component space.
  • Singular value decomposition – a matrix factorisation method commonly used to compute PCA in practice.
  • Curse of dimensionality – a problem where high-dimensional spaces become sparse and harder for models to learn from reliably.

Sources and further reading

  • PCA – Principal Component Analysis – scikit-learn.org – June 2026 – technical documentation describing PCA components, explained variance and implementation in scikit-learn.
  • Decomposing signals in components – scikit-learn.org – June 2026 – explains PCA as a method that decomposes a multivariate dataset into orthogonal components explaining maximum variance.
  • Principal component analysis – nature.com – June 2026 – a primer explaining PCA workflow, dimensionality reduction and the role of principal components in explaining variance.
  • What Is Principal Component Analysis (PCA)? – ibm.com – June 2026 – explains PCA as a method that reduces large datasets into principal components that retain much of the original information.
  • What is Dimensionality Reduction? – ibm.com – June 2026 – describes PCA as a common dimensionality reduction method and as a form of feature extraction.
  • Principal component analysis – repositori.upf.edu – June 2026 – academic text explaining PCA, variance, standardisation and interpretation of principal components.

Was this article helpful?

Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!

Reaction to comment: Cancel reply

What do you think about this article?

Your email address will not be published. Required fields are marked.