Categories
Feature selection

Feature selection

June 12,2026 in AI&ChatGPT | 0 Comments

Feature selection is a machine learning technique used to choose the most useful input variables from an original dataset. The goal is not to create new variables, but to keep the features that actually help the model solve a specific task and remove those that are irrelevant, redundant or noisy.

Feature selection is one of the main forms of dimensionality reduction. If a dataset has tens, hundreds or thousands of columns, a model does not necessarily need all of them. Some features carry a strong signal, some are weak, some repeat information already present elsewhere, and some only add noise.

In machine learning, more input data does not automatically mean a better model. A larger number of variables can give the model more information, but it can also make training slower, increase the risk of overfitting and make the final model harder to explain.

Feature selection means selecting a smaller set of useful input variables from the original dataset. The original meaning of the selected columns remains unchanged – the method only decides which columns should stay and which should be removed.

What does feature mean?

A feature is one input variable used by a model. In tabular data, a feature is usually one column. In other types of data, such as text, images or signals, a feature may be a numerical property extracted from the original input.

If a model predicts the price of a house, features may include floor area, location, number of rooms, building age, heating type and energy efficiency. If a model predicts whether a customer will buy again, features may include previous purchases, order value, source of traffic, newsletter interaction or time since the last website visit.

So a feature is not a mysterious concept. It is simply one piece of information the model can use when learning patterns from data.

Why feature selection matters

At first glance, it may seem logical that a model should receive as many variables as possible. In reality, the model learns from everything it is given – useful signals, weak signals, duplicated information and random noise.

Feature selection helps the model work with a smaller, cleaner and more meaningful set of inputs. This can make the model faster, easier to understand and sometimes more accurate on new data. It can also reduce the chance that the model learns accidental patterns that only exist in the training dataset.

Feature selection is mainly used to:

  • simplify the model – the model does not need to process variables that do not help,
  • reduce computational cost – fewer inputs often mean faster training and prediction,
  • remove noise – weak or random variables can make the model less stable,
  • reduce overfitting – the model has fewer opportunities to learn accidental patterns,
  • improve explainability – people can better understand what the model is using,
  • reduce data collection effort – there is no reason to collect inputs that add no value.

The goal of feature selection is not to remove as many columns as possible. The goal is to keep the variables that are most useful for the specific task.

Example: predicting house prices

Imagine you have a dataset about houses. Each house is described by one hundred variables. The table contains floor area, location, number of rooms, property condition, land size, roof age, heating type, energy efficiency, distance to school, public transport access, window material, facade colour, door handle type and number of outdoor lights.

If the goal is to predict house price, some variables will probably matter a lot. Location, floor area, technical condition, land size, energy efficiency and access to services may be strong signals. Other variables may be much weaker. Facade colour or door handle type may influence buyer perception in some situations, but they are unlikely to be as important as location or usable floor area.

Feature selection helps decide which variables should remain in the model. Instead of working with one hundred inputs, the model may work with twenty variables that explain most of what is useful for the prediction task.

Relevant, irrelevant and redundant features

Not all variables in a dataset have the same value. Some help the model. Some do not. Others repeat information that is already available in another column.

A relevant feature helps the model make better decisions. For house price prediction, this may be location, floor area or property condition. For purchase prediction, it may be previous order count or time since the last visit.

An irrelevant feature has little or no relationship with the target variable. If the model predicts house price, a randomly generated row ID will probably not carry useful information.

A redundant feature may contain information, but that information is already present elsewhere. For example, floor area in square metres and the same area converted into another unit may tell the model almost the same thing.

A redundant feature is not always useless, but it often adds complexity without adding new information. This can make the model harder to interpret and sometimes less stable.

Feature selection vs feature extraction

Feature selection is often confused with feature extraction. Both techniques can reduce the dimensionality of a dataset, but they work in different ways.

Feature selection keeps a subset of the original variables. If a table has 100 columns, a feature selection method may keep 20 of them and remove the rest. The meaning of the selected columns remains the same.

Feature extraction creates new variables from the original data. It does not simply choose existing columns. It transforms the original variables into a new representation. Principal Component Analysis, usually shortened to PCA, is a common example.

Feature selection is usually easier to explain because the model still uses original columns. If the selected features are location, floor area and energy efficiency, a business user can understand what the model is using. With feature extraction, the new variables may be mathematically useful but harder to interpret.

Main types of feature selection methods

Feature selection methods are commonly divided into three groups: filter methods, wrapper methods and embedded methods. They differ mainly in how they evaluate variables and whether they use a specific model during the selection process.

Filter methods

Filter methods evaluate features using statistical criteria before the model is trained. They may look at correlation, mutual information, chi-square tests, variance or other measures that estimate how useful a variable may be.

Their advantage is speed. They are useful as an early step when you have many variables and need to remove obviously weak, empty or unsuitable columns.

The limitation is that filter methods often evaluate features separately. A variable that looks weak on its own may still be useful in combination with another variable. A simple filter method may not detect that.

Wrapper methods

Wrapper methods evaluate different combinations of variables by training and testing a specific model. Instead of judging features only by a statistical score, they look at how the model performs with different feature sets.

For example, a wrapper method may gradually add variables or remove them. The model is repeatedly trained, and the method checks which combination produces the best result.

The advantage is that wrapper methods are tied to the actual model and task. The disadvantage is computational cost. If you have thousands of variables, testing many combinations can be very slow.

Embedded methods

Embedded methods perform feature selection during model training. Feature selection is not a separate step before modelling, but part of the learning process itself.

Examples include regularised models such as Lasso, which can push some feature weights towards zero, or tree-based models that estimate feature importance based on how much each variable improves decision splits.

Filter methods are fast, wrapper methods test features through a specific model, and embedded methods select features during training. In practice, technical selection is often combined with domain knowledge.

How feature selection works in practice

Feature selection is not a single button that automatically fixes a dataset. A good selection process combines understanding of the task, data quality checks, technical analysis and validation on data the model has not seen during training.

First, you need to understand what the model should do. A useful feature set for price prediction may be very different from a useful feature set for fraud detection, churn prediction, text classification or recommendation.

Then the dataset needs to be checked. Empty columns, technical identifiers, duplicates, extreme values and variables unavailable in real operation should be reviewed before any formal feature selection method is applied.

A practical workflow often looks like this:

  1. define the target – what the model should predict or classify,
  2. check data quality – missing values, duplicates, outliers and invalid variables,
  3. remove obviously unsuitable variables – empty columns, meaningless IDs or leakage-prone fields,
  4. apply a feature selection method – correlation, feature importance, recursive elimination or another approach,
  5. train and test the model – check whether the selected features actually help,
  6. review the result with domain knowledge – the selection should make practical sense.

Data leakage in feature selection

One of the biggest risks in feature selection is data leakage. Data leakage occurs when the model receives information during training that would not be available in real use.

For example, imagine you want to predict whether a customer will cancel a contract. If the dataset contains a column called cancellation date, the model is not really predicting churn. It is using information that already reveals the outcome.

A similar problem can occur if feature selection is performed on the entire dataset before the data is split into training and test sets. In that case, the test set has indirectly influenced which features were selected, so the evaluation may look better than it really is.

Feature selection, scaling and other learned transformations should be fitted only on training data. Test data should remain separate until final evaluation.

Feature selection and overfitting

Overfitting happens when a model adapts too closely to the training data. It learns not only general patterns, but also random details that do not hold in new data.

A large number of features can increase this risk. The model has more opportunities to find accidental combinations that look useful in the training set but do not represent stable relationships.

Feature selection can reduce overfitting by removing weak, duplicated or noisy inputs. The model then works with a smaller set of variables and has less space to learn accidental patterns.

Feature selection and model explainability

One of the strongest advantages of feature selection is better explainability. If a model uses 12 understandable variables instead of 800 columns, it is much easier to explain why it makes a certain prediction.

This matters in areas where model decisions have real consequences – credit scoring, insurance, healthcare, business forecasting, risk management or automated prioritisation. In these cases, a good score is not enough. People need to understand what the model is using and whether those inputs are appropriate.

Feature selection can also help identify variables that are unstable, legally problematic, ethically sensitive or indirectly connected to information the model should not use.

When feature selection helps the most

Feature selection is most useful when the number of variables is high, some of them are redundant or noisy, and the model needs to remain understandable.

It is often useful for tabular data, marketing analytics, medical datasets, financial models, customer segmentation, text analysis and anomaly detection.

Feature selection is especially valuable when:

  • the dataset has many columns – hundreds or thousands of input variables,
  • there are few rows compared with the number of variables – this increases overfitting risk,
  • features repeat similar information – multiple columns describe almost the same thing,
  • the data contains noise – some inputs are weak, random or unreliable,
  • the model needs to be explainable – fewer variables are easier to inspect,
  • data collection is expensive – there is no reason to collect inputs that do not help.

When feature selection may not be worth it

Feature selection is not always necessary. If you already have a small number of well-understood, high-quality and domain-relevant variables, further reduction may not help.

It can also be risky when a variable looks weak by itself but becomes useful in combination with other variables. Removing it too aggressively can make the model worse.

Feature selection should therefore not be a mechanical deletion of columns based on one score. It should be a tested process that improves or at least preserves model performance while making the model simpler and more understandable.

The goal is not to have the smallest possible number of variables. The goal is to keep enough useful information for the model to perform well and for humans to review the result.

Feature selection in marketing and e-commerce

In marketing and e-commerce, feature selection can be used for purchase prediction, customer segmentation, customer lifetime value estimation, churn prediction, lead scoring or personalisation.

A model may have many potential inputs: number of website visits, device type, traffic source, number of orders, average basket value, email engagement, viewed categories, time since last purchase or discount usage.

Not all of these signals are equally useful. For repeat purchase prediction, the time since the last order, purchase frequency and favourite product categories may be more useful than minor technical details about the first website visit.

Feature selection for text data

Text data can produce a very large number of features. If a model works with words, tokens or n-grams, the vocabulary can contain tens of thousands of items. Not every word helps the model recognise meaning, topic or category.

Feature selection can remove words or expressions that are too common, too rare or not informative enough. In text classification, it can help identify terms that actually distinguish one category from another.

With modern embeddings, the situation is different because text is represented as a numerical vector rather than a simple list of words. Still, the underlying problem is similar: the system needs a representation that keeps useful information and avoids unnecessary noise.

How to evaluate whether feature selection helped

The key question is not how many variables were removed. The key question is whether the model became better, faster, more stable or easier to explain.

The result should be evaluated on validation or test data. Training performance alone is not enough because an overfitted model can look good on training data and fail on new data.

Evaluation should consider:

  • model performance – did the relevant metric improve or stay stable?
  • stability – does the model work on new data?
  • speed – did training or prediction become faster?
  • explainability – is it clearer what the model is using?
  • data effort – can unnecessary collection and cleaning be reduced?
  • domain sense – does the selected feature set make practical sense?

Common mistakes in feature selection

Feature selection can improve a model, but it can also damage it. The most common mistake is relying only on an automatic method without checking whether the result makes technical and practical sense.

Typical mistakes include:

  • selecting features on the whole dataset – this can cause data leakage into the test set,
  • removing weak-looking variables blindly – some variables matter only in combination with others,
  • ignoring domain knowledge – a technical method may miss business, medical or legal meaning,
  • confusing correlation with causation – a feature may be related to the target but not explain it,
  • keeping leakage variables – some columns may reveal the answer directly,
  • reducing too aggressively – useful signals can be removed,
  • evaluating only one model – a feature set good for one model may not be best for another.

Bad feature selection can create a simpler but worse model. Every removed variable should be justified by validation, domain sense or both.

How to remember feature selection

Feature selection can be compared to choosing the most important clues in an investigation. An investigator may have hundreds of pieces of information, but not all of them matter equally. Some point directly to the answer, some are background details and some may be misleading.

A model does not need every variable just because we can measure it. It needs the variables that help it make better decisions in a specific task.

Feature selection means choosing a smaller set of useful input variables from the original dataset. The data becomes simpler, but the meaning of the selected columns remains clear.

Related terms

  • Dimensionality reduction – a broader term for reducing the number of dimensions in a dataset. Feature selection is one form of dimensionality reduction because it reduces the number of original input variables.
  • Machine learning – the broader field in which systems learn from data instead of relying only on fixed manual rules. Feature selection is part of preparing useful data for machine learning models.
  • Embedding – a numerical representation of content such as text, images or documents. Embeddings can have many dimensions, which makes representation quality and dimensionality important.
  • Large language model (LLM) – a language-focused AI model. Feature selection is not usually applied to LLMs in the same direct tabular way, but both topics relate to how models use signals and representations.
  • Feature extraction – a method that creates new variables from original data instead of selecting existing ones.
  • Feature engineering – the process of creating, modifying or preparing variables so that they are more useful for modelling.
  • Overfitting – a situation where a model learns training data too closely and performs poorly on new data.
  • Data leakage – a situation where the model receives information during training that would not be available in real use.
  • Model explainability – the ability to understand why a model produced a certain output or prediction.

Sources and further reading

  • Feature selection – scikit-learn.org – June 2026 – explains feature selection methods in scikit-learn, including univariate methods, recursive feature elimination and model-based selection.
  • What is Feature Selection? – ibm.com – June 2026 – provides a clear overview of feature selection, why it is used and how it can improve model performance and efficiency.
  • Feature selection – wikipedia.org – June 2026 – gives a general overview of feature selection terminology, motivations and method categories.
  • Feature Engineering – kaggle.com – June 2026 – introduces practical feature preparation concepts that help explain how feature selection relates to broader data preprocessing.

Was this article helpful?

Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!

Reaction to comment: Cancel reply

What do you think about this article?

Your email address will not be published. Required fields are marked.