Educational and Psychological Measurement: A Comprehensive Overview

Introduction

Educational and psychological measurement encompasses the theories and techniques used to quantify psychological attributes, often focusing on latent constructs that are not directly observable. This field, known as psychometrics, is vital in psychology and education, covering testing, assessment, and related activities. It aims to provide objective evaluations of various traits, skills, and knowledge.

What is Psychometrics?

Psychometrics is a specialized field within psychology and education devoted to testing, measurement, assessment, and related activities. It is concerned with the objective measurement of latent constructs that cannot be directly observed. Psychometricians, who are typically psychologists with advanced graduate training in psychometrics and measurement theory, develop, evaluate, and improve psychological tests. They work in academic institutions, testing organizations like Pearson and the Educational Testing Service, and as independent consultants.

The word psychometry derives from Greek: ψυχή, psukhē, "spirit, soul" and μέτρον, metron, "measure").

The Role of Psychometricians

A psychometrician is an individual with a theoretical knowledge of measurement techniques who is qualified to develop, evaluate, and improve psychological tests. These professionals often focus on constructing and validating assessment instruments, including surveys, scales, and questionnaires.

Historical Foundations

The rational approach to psychological testing emerged from two primary streams of thought:

Measurement of Individual Differences: Stemming from the work of Darwin, Galton, and Cattell.
Psychophysical Measurements: Originating from Herbart, Weber, Fechner, and Wundt.

Darwin, Galton, and the Measurement of Individual Differences

Charles Darwin's theory of natural selection, outlined in his 1859 book On the Origin of Species, highlighted individual differences within species and their adaptive significance. Francis Galton, inspired by Darwin, explored these differences in his 1869 book Hereditary Genius. Galton's work emphasized characteristics that make some individuals more "fit" than others, laying the groundwork for measuring sensory and motor functions, such as reaction time, visual acuity, and physical strength. Galton, often referred to as "the father of psychometrics," devised and included mental tests among his anthropometric measures. James McKeen Cattell extended Galton's work, further solidifying the field.

Herbart, Weber, Fechner, and Psychophysical Measurements

The origin of psychometrics also has connections to the related field of psychophysics. E.H. Weber built upon Herbart's work and tried to prove the existence of a psychological threshold, saying that a minimum stimulus was necessary to activate a sensory system. After Weber, G.T. Fechner expanded upon the knowledge he gleaned from Herbart and Weber, to devise the law that the strength of a sensation grows as the logarithm of the stimulus intensity. Wilhelm Wundt, credited with founding the science of psychology, followed Weber and Fechner.

L. L. Thurstone, founder and first president of the Psychometric Society, developed and applied the law of comparative judgment, closely connected to the psychophysical theory of Ernst Heinrich Weber and Gustav Fechner. Spearman and Thurstone also contributed significantly to factor analysis.

Evolution of Measurement in Social Sciences

The definition of measurement in the social sciences has a long history. Stevens's definition of measurement was put forward in response to the British Ferguson Committee, whose chair, A. Ferguson, was a physicist. The committee was appointed in 1932 by the British Association for the Advancement of Science to investigate the possibility of quantitatively estimating sensory events. The committee's report highlighted the importance of the definition of measurement. While Stevens's response was to propose a new definition, which has had considerable influence in the field, this was by no means the only response to the report. Measurement in psychology and physics are in no sense different. Physicists can measure when they can find the operations by which they may meet the necessary criteria; psychologists have to do the same. They need not worry about the mysterious differences between the meaning of measurement in the two sciences (Reese, 1943, p.

These divergent responses are reflected in alternative approaches to measurement. Methods based on covariance matrices are typically employed on the premise that numbers, such as raw scores derived from assessments, are measurements. Such approaches implicitly entail Stevens's definition of measurement, which requires only that numbers are assigned according to some rule. On the other hand, when measurement models such as the Rasch model are employed, numbers are not assigned based on a rule. Instead, in keeping with Reese's statement above, specific criteria for measurement are stated, and the goal is to construct procedures or operations that provide data that meet the relevant criteria.

Read also: Blue Sea Consulting Services

Key Areas of Application

Psychometric theory has found applications in measuring personality, attitudes, beliefs, and academic achievement. These latent constructs cannot truly be measured, and much of the research and science in this discipline has been developed in an attempt to measure these constructs as close to the true score as possible.

Intelligence Testing

The first psychometric instruments were designed to measure intelligence. Alfred Binet and Theodore Simon developed the Test Binet-Simon [fr] in France, which was later adapted for use in the U.S.

Personality Testing

Personality testing is another significant area in psychometrics. While there is no universally accepted theory, instruments like the Minnesota Multiphasic Personality Inventory, the Five-Factor Model (or "Big 5"), Personality and Preference Inventory, and the Myers-Briggs Type Indicator are widely used.

Core Measurement Theories

Psychometricians have developed several measurement theories, including:

Classical Test Theory (CTT)
Item Response Theory (IRT)
Rasch Model for Measurement

Classical Test Theory (CTT)

Classical Test Theory (CTT) is a foundational concept in psychometrics, focusing on understanding and improving the reliability and validity of psychological tests and assessments. CTT posits that every test score is composed of two components: the true score (the actual ability or trait being measured) and the error score (random error that affects the measurement). The primary goal of CTT is to minimize the impact of error and estimate the true score as accurately as possible.

Read also: Shaping the Future of Translation

Item Response Theory (IRT)

Item response theory models the relationship between latent traits and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population.

Rasch Model

An approach that seems mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features, is represented by the Rasch model for measurement.

Statistical Methods

Psychometricians use various statistical methods to analyze data, including:

Factor analysis
Multidimensional scaling
Cluster analysis
Structural equation modeling
Path analysis
Bi-factor analysis

Factor Analysis

Factor analysis is a method of determining the underlying dimensions of data. One of the main challenges faced by users of factor analysis is a lack of consensus on appropriate procedures for determining the number of latent factors. A usual procedure is to stop factoring when eigenvalues drop below one because the original sphere shrinks.

Multidimensional Scaling

Multidimensional scaling is a method for finding a simple representation for data with a large number of latent dimensions.

Key Concepts: Reliability and Validity

Key concepts in classical test theory are reliability and validity. A reliable measure is one that measures a construct consistently across time, individuals, and situations. A valid measure is one that measures what it is intended to measure. Both reliability and validity can be assessed statistically.

Reliability

Reliability refers to the consistency and stability of a measurement. A reliable measure produces similar results under consistent conditions.

Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed split-half reliability; the value of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the Spearman-Brown prediction formula to correspond to the correlation between two full-length tests. Perhaps the most commonly used index of reliability is Cronbach's α, which is equivalent to the mean of all possible split-half coefficients.

Validity

Validity refers to the accuracy of a measurement. A valid measure assesses what it is intended to measure. There are a number of different forms of validity. Criterion-related validity refers to the extent to which a test or scale predicts a sample of behavior, i.e., the criterion, that is "external to the measuring instrument itself." That external sample of behavior can be many things including another test; college grade point average as when the high school SAT is used to predict performance in college; and even behavior that occurred in the past, for example, when a test of current psychological symptoms is used to predict the occurrence of past victimization (which would accurately represent postdiction). When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to measures of other constructs as required by theory. Content validity is a demonstration that the items of a test do an adequate job of covering the domain being measured.

Standards for Educational and Psychological Testing

The considerations of validity and reliability typically are viewed as essential elements for determining the quality of any test. However, professional and practitioner associations frequently have placed these concerns within broader contexts when developing standards and making overall judgments about the quality of any test as a whole within a given context. In 2014, the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) published a revision of the Standards for Educational and Psychological Testing, which describes standards for test development, evaluation, and use. The Standards cover essential topics in testing including validity, reliability/errors of measurement, and fairness in testing. The book also establishes standards related to testing operations-including test design and development, scores, scales, norms, score linking, cut scores, test administration, scoring, reporting, score interpretation, test documentation, and rights and responsibilities of test takers and test users.

Joint Committee on Standards for Educational Evaluation

In the field of evaluation, and in particular educational evaluation, the Joint Committee on Standards for Educational Evaluation has published three sets of standards for evaluations. Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing, and improving the identified form of evaluation. Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic.

Criticisms and Controversies

Because psychometrics is based on latent psychological processes measured through correlations, there has been controversy about some psychometric measures. Critics, including practitioners in the physical sciences, have argued that such definition and quantification is difficult, and that such measurements are often misused by laymen, such as with personality tests used in employment procedures. For example, the Myers-Briggs Type Indicator (MBTI) has questionable validity and has been the subject of much criticism.

The Role of Technology and Process Data

The increasing use of computer-based testing and learning environments is leading to a significant reform on the traditional form of measurement, with tremendous extra available data collected during the process of learning and assessment (Bennett et al., 2007, 2010). The recent advances in computer technology enhance the convenient collection of process data in computer-based assessment. One such example is time-stamped action data in an innovative item which allow for the interaction between a respondent and the item. When a respondent attempts an interactive item, his/her actions are recorded, in the form of an ordered sequence of multi-type, time-stamped events. These sorts of data stored in log files, referred to as process data in this book, provide information beyond response data that typically show response accuracy only. With the availability of process data in addition to response data, the measurement field is becoming increasingly interested in borrowing additional auxiliary information from the responding process to serve different assessment purposes. For instance, recently researchers proposed different models for response time and the joint modeling of responses and response time (e.g., Bolsinova and Molenaar; Costa et al.; Wang et al.).

Applications of Process Data

This Research Topic (formed in this edited e-book) intends to explore the forefront of responding to the needs in modeling new data sources and incorporating process data in the statistical modeling of multiple possible assessment data. This edited book presents the cutting-edge research related to utilizing process data in addition to product data such as item responses in educational and psychological measurement for enhancing accuracy in ability parameter estimation (e.g., Bolsinova and Molenaar; De Boeck and Jeon; Engelhardt and Goldhammer; Klotzke and Fox; Liu C. et al.; Park et al.; Schweizer et al.; Wang et al.; Zhang and Wang), cognitive diagnosis facilitation (e.g., Guo and Zheng; Guo et al.; Jiang and Ma; Zhan, Liao et al.; Zhan, Jiao et al.), and aberrant responding behavior detection (e.g., Liu H. Throughout the book, the methods for analyzing process data in technology-enhanced innovative items in large-scale assessment for high-stakes decisions are addressed (e.g., Lee et al.; Stadler et al.). Further, the methods for the extraction of useful information in process data in assessments such as serious games and simulations were also discussed (e.g., Liao et al.; Kroehne et al.; Ren et al.; Yuan et al.). The interdisciplinary studies that borrow data-driven methods from computer science, machine learning, artificial intelligence, and natural language processing are also highlighted in this Research Topic (e.g., Ariel-Attali et al.; Chen et al.; Hao and Mislevy; Qiao and Jiao; Smink et al.), which provide new perspectives in data exploration in educational and psychological measurement. Most importantly, the models presenting the integration of the process data and the product data in this book are of critical significance to link the traditional test data with the new features extracted from the new data sources.

Statistical Modeling of Innovative Assessment Data

The book chapters demonstrate the use of process data and the integration of process and product data (item responses) in educational and psychological measurement. The chapters address issues in adaptive testing, problem-solving strategy, validity of test score interpretation, item pre-knowledge detection, cognitive diagnosis, complex dependence in joint modeling of responses and response time, and multidimensional modeling of these data types. The originality of this book lies in the statistical modeling of innovative assessment data such as log data, response time data, collaborative problem-solving tasks, dyad data, change process data, testlet data, and multidimensional data. Further, new statistical models are presented for analyzing process data in addition to response data such as transition profile analysis, the event history analysis approach, hidden Markov modeling, conditional scaling, multilevel modeling, text mining, Bayesian covariance structure modeling, mixture modeling, and multidimensional modeling.

The Future of Process Data in Assessment

As more and more data are being collected in computer-based testing, process data will become a very important source of information to validate and facilitate measuring response accuracy and provide supplementary information in understanding test-takers' behaviors, the reasons of missing data, and links with motivation studies. There is no doubt that there is high demand of such research in the large-scale assessment, both high-stake and low-stake, as well as in the personalized learning and assessment to tailor the best source and methods to help people learn and grow. This book is a timely addition to the current literature on psychological and educational measurement.

Analogies and Metaphors in Educational and Psychological Measurement

Educational and psychological measurement is often compared to physical measurement, using tools like rulers and scales. Testing, as the embodiment of educational and psychological measurement, is sometimes likened to medical instruments. However, it's important to recognize that tests are not as precise as thermometers.

Testing as a Blood Oximeter

Educational and psychological tests can be compared to blood oximeters, which measure oxygen saturation in the blood as an indicator of respiratory health. Similarly, tests measure math achievement as an indicator of college readiness. The traditional, ultra-standardized, multiple-choice test is a lot like the pulse ox, developed - for convenience and efficiency - based on a majority group of test takers without fully considering the unique needs of underserved and minoritized students.

Social Responsibility

Koljatic et al. (2021) compare the testing industry to the sporting apparel industry, arguing that we need to accept more responsibility with respect to the social impacts of our products. The company recently released a new shoe that can be put on and taken off hands-free, extending their lineup of more accessible footwear (Newcomb, 2021). This innovation is regarded as a major step forward, so to speak, in inclusive and individualized design (Patrick & Hollenbeck, 2021). However, concerns have been raised about accessibility in terms of high cost and limited availability (Weaver, 2021). Admission tests, like other large-scale assessments, have historically been inaccessible to students, by design, until the moment of administration.

Testing as Pharmaceuticals

Testing resembles the pharmaceutical industry, where standardized tests are like drugs. In both cases, the product can take years to develop and at great expense. Both target practical issues faced by lots of people - for example, ulcerative colitis or pandemic learning loss. Both are designed in laboratory settings. And the countless - sometimes absurd - side effects make us question whether the potential benefits are worth the costs and risks.

tags: #educational #and #psychological #measurement #definition