Polychoric Correlation in R

Mohammed Shammeer
2 min readJun 3, 2022

--

Understanding Polychoric Correlation for Ordinal Data Analysis

Polychoric correlation

Working with ordinal data presents unique challenges in data analysis, particularly when it comes to finding correlations between features. Polychoric correlation offers a robust methodology for addressing these complexities.

Feature selection

Feature selection is a critical step in the model-building process, involving the identification of a relevant subset of features for use in predictive modeling. Effective feature selection can significantly impact model performance by:

  • Simplifying models for improved interpretability
  • Reducing training times
  • Mitigating the curse of dimensionality
  • Enhancing data compatibility with various learning algorithms
  • Encoding inherent symmetries in the input space

Several strategies exist for assessing feature importance or reducing the feature set:

  1. Unsupervised Methods: These approaches do not utilize the target variable, focusing instead on eliminating redundant variables.
  2. Supervised Methods: In contrast, supervised methods leverage the target variable to identify and remove irrelevant features.
  3. Dimensionality Reduction: This technique projects input data into a lower-dimensional feature space, preserving essential relationships while reducing complexity.

Correlation analysis is a straightforward, unsupervised approach to discovering relationships between features. However, this method is primarily applicable to nominal or categorical data. In instances where features are ordinal, polychoric correlation becomes particularly valuable.

Ordinal Data

Ordinal data consists of ordered categories, such as disease staging (e.g., advanced, moderate, mild) or levels of pain (e.g., severe, moderate, mild, none). Unlike nominal data, where categories lack a specific order, ordinal data provides a clear hierarchy among categories.

Traditional methods of correlation may yield misleading results when applied to ordinal data. Polychoric correlation offers a more accurate representation of the relationship between two ordinal variables, making it a valuable tool in this context.

Polychoric correlation

While R is renowned for its advanced statistical capabilities, Python currently lacks built-in support for polychoric correlation and in its packages too. To demonstrate the implementation of polychoric correlation, The dataset which is used here is known as dermatology data. Often dermatologists has to fight with diagnosing the erythemato-squamous diseases (group). Diagnosing these skin conditions poses several challenges, including:

  • The imprecision of clinical test results
  • Similiar feature individuals

To facilitate analysis, we can convert continuous clinical test results into ordinal categories:

  • 1–50: Low (0 level)
  • 51–100: Medium (1 level)
  • 101–150: High (2 level)
  • 151 and above: Very High (3 level)

With our ordinal data structured, we can proceed to code the polychoric correlation analysis.

R language — Polychor

Upon executing the analysis, we will obtain correlation values for each feature, ranging from -1 to 1. This output allows us to filter and identify the clinical tests most relevant for diagnosing erythemato-squamous diseases. By leveraging polychoric correlation, we can gain deeper insights into the relationships between ordinal features, ultimately improving our analytical capabilities.

--

--

Mohammed Shammeer
Mohammed Shammeer

Written by Mohammed Shammeer

Chapter Lead at Geek Community. Machine Learning Enthusiast.

No responses yet