Delta: A new measure of agreement between two raters
Abstract
The most common measure of agreement for categorical data is the coefficient kappa. However, kappa performs poorly when the marginal distributions are very asymmetric, it is not easy to interpret, and its definition is based on hypothesis of independence of the responses (which is more restrictive than the hypothesis that kappa has a value of zero). This paper defines a new measure of agreement, delta, ‘the proportion of agreements that are not due to chance’, which comes from model of multiple‐choice tests and does not have the previous limitations. The paper shows that kappa and delta generally take very similar values, except when the marginal distributions are strongly unbalanced. The case of the 2 × 2 tables (which admits very simple solutions) is considered in detail.
Citing Literature
Number of times cited according to CrossRef: 46
- Chikara Honda, Tetsuji Ohyama, Homogeneity score test of AC1 statistics and estimation of common AC1 in multiple or stratified inter-rater agreement studies, BMC Medical Research Methodology, 10.1186/s12874-019-0887-5, 20, 1, (2020).
- Antonio Martín Andrés, María Álvarez Hernández, Hubert's multi‐rater kappa revisited, British Journal of Mathematical and Statistical Psychology, 10.1111/bmsp.12167, 73, 1, (1-22), (2019).
- Alexandra De Raadt, Matthijs J. Warrens, Roel J. Bosker, Henk A. L. Kiers, Kappa Coefficients for Missing Data, Educational and Psychological Measurement, 10.1177/0013164418823249, 79, 3, (558-576), (2019).
- Laura Nuño, Georgina Guilera, Michaela Coenen, Emilio Rojo, Juana Gómez-Benito, Maite Barrios, Functioning in schizophrenia from the perspective of psychologists: A worldwide study, PLOS ONE, 10.1371/journal.pone.0217936, 14, 6, (e0217936), (2019).
- Silvia Postigo-Zegarra, Konstanze Schoeps, Inmaculada Montoya-Castilla, What Does It Mean to Be Popular in Spain? Mixed-Method Analysis of Popularity as Perceived by Teenagers and Their Teachers, Youth & Society, 10.1177/0044118X19855899, (0044118X1985589), (2019).
- Rosario Delgado, Xavier-Andoni Tibau, Why Cohen’s Kappa should be avoided as performance measure in classification, PLOS ONE, 10.1371/journal.pone.0222916, 14, 9, (e0222916), (2019).
- Nova J. Bradford, Moin Syed, Transnormativity and Transgender Identity Development: A Master Narrative Approach, Sex Roles, 10.1007/s11199-018-0992-7, (2019).
- Anthony J. Conger, Kappa and Rater Accuracy: Paradigms and Parameters, Educational and Psychological Measurement, 10.1177/0013164416663277, 77, 6, (1019-1047), (2016).
- Larissa de Souza Siqueira, Hosana Alves Gonçalves, Lilian Cristine Hübner, Rochele Paz Fonseca, Development of the Brazilian version of the Child Hayling Test, Trends in Psychiatry and Psychotherapy, 10.1590/2237-6089-2016-0019, 38, 3, (164-174), (2016).
- Ana Ordóñez, Vicente J. Prado-Gascó, Lidón Villanueva, Remedios González, Propiedades psicométricas del Cuestionario de conciencia emocional en población infantil española, Revista Latinoamericana de Psicología, 10.1016/j.rlp.2015.09.012, 48, 3, (183-190), (2016).
- Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, The Exact Variance of Weighted Kappa with Multiple Raters, Psychological Reports, 10.2466/pr0.101.2.655-660, 101, 2, (655-660), (2016).
- A Martín Andres, P Femia Marzo, Chance-corrected measures of reliability and validity in K K tables, Statistical Methods in Medical Research, 10.1191/0962280205sm412oa, 14, 5, (473-492), (2016).
- Caroline de Oliveira Cardoso, Nicolle Zimmermann, Camila Borges Paraná, Gigiane Gindri, Ana Paula Almeida de Pereira, Rochele Paz Fonseca, Brazilian adaptation of the Hotel Task: a tool for the ecological assessment of executive functions, Dementia & Neuropsychologia, 10.1590/1980-57642015DN92000010, 9, 2, (156-164), (2015).
- Monisha Pasupathi, Jacob Billitteri, Cade D. Mansfield, Cecilia Wainryb, Grace E. Hanley, Kiana Taheri, Regulating emotion and identity by narrating harm, Journal of Research in Personality, 10.1016/j.jrp.2015.07.003, 58, (127-136), (2015).
- Mônica de Souza Kerr, Karina Carlesso Pagliarin, Fabiola Schwengber Casarin, Ana Mineiro, Perrine Ferré, Yves Joanette, Rochele Paz Fonseca, Adaptação da Bateria Montreal de Avaliação da Comunicação para o Português Europeu, Audiology - Communication Research, 10.1590/S2317-64312015000200001563, 20, 3, (274-284), (2015).
- Moin Syed, Sarah C. Nelson, Guidelines for Establishing Reliability When Coding Narrative Data, Emerging Adulthood, 10.1177/2167696815587648, 3, 6, (375-387), (2015).
- Viswanathan Shankar, Shrikant I Bangdiwala, Observer agreement paradoxes in 2x2 tables: comparison of agreement measures, BMC Medical Research Methodology, 10.1186/1471-2288-14-100, 14, 1, (2014).
- Pedro Femia-Marzo, Antonio Martín-Andrés, Multiple Choice Tests: Inferences Based on Estimators of Maximum Likelihood, Open Journal of Statistics, 10.4236/ojs.2014.46045, 04, 06, (466-483), (2014).
- Edgardo Etchezahar, Vicente Prado-Gascó, Luis Jaume, Silvina Brussino, Validación argentina de la Escala de Orientación a la Dominancia Social, Revista Latinoamericana de Psicología, 10.1016/S0120-0534(14)70004-4, 46, 1, (35-43), (2014).
- Nicolle Zimmermann, Ana Paula Almeida de Pereira, Rochele Paz Fonseca, Brazilian Portuguese version of the Patient Competency Rating Scale (PCRS-R-BR): semantic adaptation and validity, Trends in Psychiatry and Psychotherapy, 10.1590/2237-6089-2013-0021, 36, 1, (40-51), (2014).
- Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, 1980–2000, A Chronicle of Permutation Statistical Methods, 10.1007/978-3-319-02744-9, (275-362), (2014).
- Anna F. Schewe, Ute R. Hülsheger, Günter W. Maier, Metaanalyse–praktische Schritte und Entscheidungen im Umsetzungsprozess, Zeitschrift für Arbeits- und Organisationspsychologie A&O, 10.1026/0932-4089/a000165, 58, 4, (186-205), (2014).
- Rochele Paz Fonseca, Yves Joanette, Hélène Côté, Bernadette Ska, Francine Giroux, Jandyra Maria Guimarães Fachel, Gabriela Damasceno Ferreira, Maria Alice de Mattos Pimenta Parente, Brazilian Version of the Protocole Montréal d'Évaluation de la Communication (Protocole MEC): Normative and Reliability Data, The Spanish Journal of Psychology, 10.1017/S1138741600004686, 11, 2, (678-688), (2014).
- Manon Guay, Marie-France Dubois, Johanne Desrosiers, Can home health aids using the clinical algorithm Algo choose the right bath seat for clients having a straightforward problem?, Clinical Rehabilitation, 10.1177/0269215513494027, 28, 2, (172-182), (2013).
- Moin Syed, College Students’ Storytelling of Ethnicity-Related Events in the Academic Domain, Journal of Adolescent Research, 10.1177/0743558411432633, 27, 2, (203-230), (2012).
- Marta Ortega, Marc J. Metzger, Robert G.H. Bunce, Thomas Wrbka, Anna Allard, Rob H.G. Jongman, Ramón Elena-Rosselló, The potential for integration of environmental data from regional stratifications into a European monitoring framework, Journal of Environmental Planning and Management, 10.1080/09640568.2011.575698, 55, 1, (39-57), (2012).
- Adolfo Díez, Cristina Carbonell, Joaquín Calaf, Maria Teresa Caloto, Gonzalo Nocea, Observational study of treatment compliance in women initiating antiresorptive therapy with or without calcium and vitamin D supplements in Spain, Menopause: The Journal of The North American Menopause Society, 10.1097/gme.0b013e318223bd6b, 19, 1, (89-95), (2012).
- Patrizio E. Tressoldi, Francesco Sella, Max Coltheart, Carlo Umiltà, Using functional neuroimaging to test theories of cognition: A selective survey of studies from 2007 to 2011 as a contribution to the Decade of the Mind Initiative, Cortex, 10.1016/j.cortex.2012.05.024, 48, 9, (1247-1250), (2012).
- Norma J. Perez-Brena, Jeffrey T. Cookston, William V. Fabricius, Delia Saenz, Patterns of Father Self-Evaluations Among Mexican and European American Men and Links to Adolescent Adjustment, Fathering: A Journal of Theory, Research, and Practice about Men as Fathers, 10.3149/fth.1002.213, 10, 2, (213-235), (2012).
- M. Fisher, A. Storfer‐Isser, R. J. Shaw, R. S. Bernard, S. Drury, S. Ularntinon, S. M. Horwitz, Inter‐rater reliability of the Pediatric Transplant Rating Instrument (P‐TRI): Challenges to reliably identifying adherence risk factors during pediatric pre‐transplant evaluations, Pediatric Transplantation, 10.1111/j.1399-3046.2010.01428.x, 15, 2, (142-147), (2011).
- Matthijs J. Warrens, A Formal Proof of a Paradox Associated with Cohen’s Kappa, Journal of Classification, 10.1007/s00357-010-9060-x, 27, 3, (322-332), (2010).
- Manon Guay, Johanne Desrosiers, Marie-France Dubois, Does the clinical context affect the validity of bathroom recommendations made by home health aides?, International Journal of Industrial Ergonomics, 10.1016/j.ergon.2009.08.007, 40, 1, (82-89), (2010).
- Wiebke Fenske, Sebastian K.G. Maier, Anne Blechschmidt, Bruno Allolio, Stefan Störk, Utility and Limitations of the Traditional Diagnostic Approach to Hyponatremia: A Diagnostic Study, The American Journal of Medicine, 10.1016/j.amjmed.2010.01.013, 123, 7, (652-657), (2010).
- Beau Abar, Eric Loken, Peirce's and Cohen's for Measures of Rater Reliability , Journal of Probability and Statistics, 10.1155/2010/480364, 2010, (1-10), (2010).
- Manon Guay, Johanne Desrosiers, Marie-France Dubois, Criterion validity of a home health aide's algorithm for recommending bathroom equipment, Canadian Journal of Occupational Therapy, 10.1177/000841740907600s10, 76, 1_suppl, (246-256), (2009).
- Thomas Hadjistavropoulos, Philippe Voyer, Donald Sharpe, René Verreault, Michèle Aubin, Assessing Pain in Dementia Patients with Comorbid Delirium and/or Depression, Pain Management Nursing, 10.1016/j.pmn.2007.12.004, 9, 2, (48-54), (2008).
- Antonio Martín Andrés, Pedro Femia-Marzo, Chance-Corrected Measures of Reliability and Validity in 2 × 2 Tables, Communications in Statistics - Theory and Methods, 10.1080/03610920701669884, 37, 5, (760-772), (2008).
- Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, Resampling Probability Values for Weighted Kappa with Multiple Raters, Psychological Reports, 10.2466/pr0.102.2.606-613, 102, 2, (606-613), (2008).
- Thomas R. Fanshawe, Andrew G. Lynch, Ian O. Ellis, Andrew R. Green, Rudolf Hanka, Assessing Agreement between Multiple Raters with Missing Rating Information, Applied to Breast Cancer Tumour Grading, PLoS ONE, 10.1371/journal.pone.0002925, 3, 8, (e2925), (2008).
- Estimation for the Change of Daily Maxima Temperature, Korean Journal of Applied Statistics, 10.5351/KJAS.2007.20.1.001, 20, 1, (1-9), (2007).
- A New Measure of Agreement to Resolve the Two Paradoxes of Cohen's Kappa, Korean Journal of Applied Statistics, 10.5351/KJAS.2007.20.1.117, 20, 1, (117-132), (2007).
- PAUL W. MIELKE, THE EXACT VARIANCE OF WEIGHTED KAPPA WITH MULTIPLE RATERS, Psychological Reports, 10.2466/PR0.101.6.655-660, 101, 6, (655), (2007).
- A New Method of Yielding the GDP of Korea Small Business: Conversion of the Statistics of Workplace Units to Industrial Units, Korean Journal of Applied Statistics, 10.5351/KJAS.2007.20.1.011, 20, 1, (11-22), (2007).
- J. A. Roldán Nofuentes, J. D. Luna del Castillo, The Effect of Verification Bias in the Naïve Estimators of Accuracy of a Binary Diagnostic Test, Communications in Statistics - Simulation and Computation, 10.1080/03610910701539369, 36, 5, (959-972), (2007).
- E. C. Ellis, H. Wang, Estimating area errors for fine‐scale feature‐based ecological mapping, International Journal of Remote Sensing, 10.1080/01431160600735632, 27, 21, (4731-4749), (2007).
- Andreas Prinzing, Walter Durka, Stefan Klotz, Roland Brandl, How to characterize and predict alien species? A response to Pysek et al. (2004), Diversity and Distributions, 10.1111/j.1366-9516.2005.00138.x, 11, 1, (121-123), (2005).




