A probabilistic method for detecting multivariate extreme outliers

Shafiu Jibrin, Irwin S. Pressman, Matias Salibian-Barrera

Research output: Contribution to journalArticle

Abstract

Given a data set arising from a series of observations, an outlier is a value that deviates substantially from the natural variability of the data set as to arouse suspicions that it was generated by a different mechanism. We call an observation an extreme outlier if it lies at an abnormal distance from the "center" of the data set. We introduce the Monte Carlo SCD algorithm for detecting extreme outliers. The algorithm finds extreme outliers in terms of a subset of the data set called the outer shell. Each iteration of the algorithm is polynomial. This could be reduced by preprocessing the data to reduce its size. This approach has an interesting new feature. It estimates a relative measure of the degree to which a data point on the outer shell is an outlier (its "outlierness"). This measure has potential for serendipitous discoveries in data mining where unusual or special behavior is of interest. Other applications include spatial filtering and smoothing in digital image processing. We apply this method to baseball data and identify the ten most exceptional pitchers of the 1998 American League. To illustrate another useful application, we also show that the SCD can be used to reduce the solution time of the D-optimal experimental design problem.

Original languageEnglish (US)
Pages (from-to)157-170
Number of pages14
JournalInternational Journal of Nonlinear Sciences and Numerical Simulation
Volume5
Issue number2
StatePublished - 2004

Fingerprint

Multivariate Extremes
Probabilistic Methods
Outlier
data mining
spatial filtering
preprocessing
Extremes
Set theory
smoothing
Design of experiments
set theory
iteration
Data mining
image processing
polynomials
Image processing
Polynomials
Shell
Optimal Experimental Design
Spatial Filtering

Keywords

  • D-optimal design
  • Extreme outliers
  • Monte Carlo
  • Outlierness
  • Redundancy
  • Semidefinite programming

ASJC Scopus subject areas

  • Engineering (miscellaneous)
  • Computational Mechanics
  • Mechanics of Materials
  • Applied Mathematics
  • Modeling and Simulation
  • Physics and Astronomy(all)
  • Statistical and Nonlinear Physics

Cite this

A probabilistic method for detecting multivariate extreme outliers. / Jibrin, Shafiu; Pressman, Irwin S.; Salibian-Barrera, Matias.

In: International Journal of Nonlinear Sciences and Numerical Simulation, Vol. 5, No. 2, 2004, p. 157-170.

Research output: Contribution to journalArticle

@article{93dfa58f337a42a99547277282f8dcc9,
title = "A probabilistic method for detecting multivariate extreme outliers",
abstract = "Given a data set arising from a series of observations, an outlier is a value that deviates substantially from the natural variability of the data set as to arouse suspicions that it was generated by a different mechanism. We call an observation an extreme outlier if it lies at an abnormal distance from the {"}center{"} of the data set. We introduce the Monte Carlo SCD algorithm for detecting extreme outliers. The algorithm finds extreme outliers in terms of a subset of the data set called the outer shell. Each iteration of the algorithm is polynomial. This could be reduced by preprocessing the data to reduce its size. This approach has an interesting new feature. It estimates a relative measure of the degree to which a data point on the outer shell is an outlier (its {"}outlierness{"}). This measure has potential for serendipitous discoveries in data mining where unusual or special behavior is of interest. Other applications include spatial filtering and smoothing in digital image processing. We apply this method to baseball data and identify the ten most exceptional pitchers of the 1998 American League. To illustrate another useful application, we also show that the SCD can be used to reduce the solution time of the D-optimal experimental design problem.",
keywords = "D-optimal design, Extreme outliers, Monte Carlo, Outlierness, Redundancy, Semidefinite programming",
author = "Shafiu Jibrin and Pressman, {Irwin S.} and Matias Salibian-Barrera",
year = "2004",
language = "English (US)",
volume = "5",
pages = "157--170",
journal = "International Journal of Nonlinear Sciences and Numerical Simulation",
issn = "1565-1339",
publisher = "Walter de Gruyter GmbH & Co. KG",
number = "2",

}

TY - JOUR

T1 - A probabilistic method for detecting multivariate extreme outliers

AU - Jibrin, Shafiu

AU - Pressman, Irwin S.

AU - Salibian-Barrera, Matias

PY - 2004

Y1 - 2004

N2 - Given a data set arising from a series of observations, an outlier is a value that deviates substantially from the natural variability of the data set as to arouse suspicions that it was generated by a different mechanism. We call an observation an extreme outlier if it lies at an abnormal distance from the "center" of the data set. We introduce the Monte Carlo SCD algorithm for detecting extreme outliers. The algorithm finds extreme outliers in terms of a subset of the data set called the outer shell. Each iteration of the algorithm is polynomial. This could be reduced by preprocessing the data to reduce its size. This approach has an interesting new feature. It estimates a relative measure of the degree to which a data point on the outer shell is an outlier (its "outlierness"). This measure has potential for serendipitous discoveries in data mining where unusual or special behavior is of interest. Other applications include spatial filtering and smoothing in digital image processing. We apply this method to baseball data and identify the ten most exceptional pitchers of the 1998 American League. To illustrate another useful application, we also show that the SCD can be used to reduce the solution time of the D-optimal experimental design problem.

AB - Given a data set arising from a series of observations, an outlier is a value that deviates substantially from the natural variability of the data set as to arouse suspicions that it was generated by a different mechanism. We call an observation an extreme outlier if it lies at an abnormal distance from the "center" of the data set. We introduce the Monte Carlo SCD algorithm for detecting extreme outliers. The algorithm finds extreme outliers in terms of a subset of the data set called the outer shell. Each iteration of the algorithm is polynomial. This could be reduced by preprocessing the data to reduce its size. This approach has an interesting new feature. It estimates a relative measure of the degree to which a data point on the outer shell is an outlier (its "outlierness"). This measure has potential for serendipitous discoveries in data mining where unusual or special behavior is of interest. Other applications include spatial filtering and smoothing in digital image processing. We apply this method to baseball data and identify the ten most exceptional pitchers of the 1998 American League. To illustrate another useful application, we also show that the SCD can be used to reduce the solution time of the D-optimal experimental design problem.

KW - D-optimal design

KW - Extreme outliers

KW - Monte Carlo

KW - Outlierness

KW - Redundancy

KW - Semidefinite programming

UR - http://www.scopus.com/inward/record.url?scp=2442456474&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442456474&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:2442456474

VL - 5

SP - 157

EP - 170

JO - International Journal of Nonlinear Sciences and Numerical Simulation

JF - International Journal of Nonlinear Sciences and Numerical Simulation

SN - 1565-1339

IS - 2

ER -