Quantifying comparison of large detrital geochronology data sets

Joel E Saylor, Kurt E. Sundell

Research output: Contribution to journalArticle

49 Citations (Scopus)

Abstract

The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: The Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.

Original languageEnglish (US)
Pages (from-to)203-220
Number of pages18
JournalGeosphere
Volume12
Issue number1
DOIs
StatePublished - 2016
Externally publishedYes

Fingerprint

geochronology
comparison
spreadsheet
visualization
artifact
test
divergence
sampling

ASJC Scopus subject areas

  • Geology
  • Stratigraphy

Cite this

Quantifying comparison of large detrital geochronology data sets. / Saylor, Joel E; Sundell, Kurt E.

In: Geosphere, Vol. 12, No. 1, 2016, p. 203-220.

Research output: Contribution to journalArticle

Saylor, Joel E ; Sundell, Kurt E. / Quantifying comparison of large detrital geochronology data sets. In: Geosphere. 2016 ; Vol. 12, No. 1. pp. 203-220.
@article{b2a017071fc5446b87c5571f7d1482a7,
title = "Quantifying comparison of large detrital geochronology data sets",
abstract = "The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: The Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.",
author = "Saylor, {Joel E} and Sundell, {Kurt E.}",
year = "2016",
doi = "10.1130/GES01237.1",
language = "English (US)",
volume = "12",
pages = "203--220",
journal = "Geosphere",
issn = "1553-040X",
publisher = "Geological Society of America",
number = "1",

}

TY - JOUR

T1 - Quantifying comparison of large detrital geochronology data sets

AU - Saylor, Joel E

AU - Sundell, Kurt E.

PY - 2016

Y1 - 2016

N2 - The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: The Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.

AB - The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: The Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.

UR - http://www.scopus.com/inward/record.url?scp=84958232313&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84958232313&partnerID=8YFLogxK

U2 - 10.1130/GES01237.1

DO - 10.1130/GES01237.1

M3 - Article

AN - SCOPUS:84958232313

VL - 12

SP - 203

EP - 220

JO - Geosphere

JF - Geosphere

SN - 1553-040X

IS - 1

ER -