Polychoric Correlation in R

2 min readJun 3, 2022

Understanding Polychoric Correlation for Ordinal Data Analysis

Working with ordinal data presents unique challenges in data analysis, particularly when it comes to finding correlations between features. Polychoric correlation offers a robust methodology for addressing these complexities.

Feature selection

Feature selection is a critical step in the model-building process, involving the identification of a relevant subset of features for use in predictive modeling. Effective feature selection can significantly impact model performance by:

Simplifying models for improved interpretability
Reducing training times
Mitigating the curse of dimensionality
Enhancing data compatibility with various learning algorithms
Encoding inherent symmetries in the input space

Several strategies exist for assessing feature importance or reducing the feature set:

Unsupervised Methods: These approaches do not utilize the target variable, focusing instead on eliminating redundant variables.
Supervised Methods: In contrast, supervised methods leverage the target variable to identify and remove irrelevant features.
Dimensionality Reduction: This technique projects input data into a lower-dimensional feature space, preserving essential relationships while reducing complexity.

Correlation analysis is a straightforward, unsupervised approach to discovering relationships between features. However, this method is primarily applicable to nominal or categorical data. In instances where features are ordinal, polychoric correlation becomes particularly valuable.

Ordinal Data

Ordinal data consists of ordered categories, such as disease staging (e.g., advanced, moderate, mild) or levels of pain (e.g., severe, moderate, mild, none). Unlike nominal data, where categories lack a specific order, ordinal data provides a clear hierarchy among categories.

Traditional methods of correlation may yield misleading results when applied to ordinal data. Polychoric correlation offers a more accurate representation of the relationship between two ordinal variables, making it a valuable tool in this context.

Polychoric correlation

While R is renowned for its advanced statistical capabilities, Python currently lacks built-in support for polychoric correlation and in its packages too. To demonstrate the implementation of polychoric correlation, The dataset which is used here is known as dermatology data. Often dermatologists has to fight with diagnosing the erythemato-squamous diseases (group). Diagnosing these skin conditions poses several challenges, including:

The imprecision of clinical test results
Similiar feature individuals

To facilitate analysis, we can convert continuous clinical test results into ordinal categories:

1–50: Low (0 level)
51–100: Medium (1 level)
101–150: High (2 level)
151 and above: Very High (3 level)

With our ordinal data structured, we can proceed to code the polychoric correlation analysis.

R language — Polychor

Upon executing the analysis, we will obtain correlation values for each feature, ranging from -1 to 1. This output allows us to filter and identify the clinical tests most relevant for diagnosing erythemato-squamous diseases. By leveraging polychoric correlation, we can gain deeper insights into the relationships between ordinal features, ultimately improving our analytical capabilities.

Polychoric Correlation in R

Feature selection

Ordinal Data

Polychoric correlation

Written by Mohammed Shammeer

No responses yet