Test Reliability and Validity


any variable used to predict a criterion (outcome)

Assessing the Quality of Predictors

Psychometric criteria = Reliability and Validity


Is the predictor stable (reliable) over time?

You shouldn’t score differently when taking the test again

determined by the test itself.

consistency, stability, or equivalence of a measure

Test-Retest Reliability

Same test, same people, 2 separate times

simplest method

correlate scores of both test times for each person

Coefficient of stability – correlation of how stable test is over time

Equivalent-Form Reliability

Same people, 2 separate tests on the same construct

very difficult, least popular

correlate scores of both test forms for each person

higher the r = high equivalent-form reliability

Coefficient of equivalence

Internal-Consistency Reliability

1 test (later broken into smaller tests)

homogeneity of the test

Two ways to test for internal-consistency reliability

Split-half reliability

Cronbach’s alpha or Kuder-Richardson 20

Inter-rater reliability

1 test rated by 2 or more researchers

objective and subjective

r = agreement (consistency) among ratings


Can we accurately draw inferences from predictor (test) scores?

Does it measure what we want it to measure

determined by the use of the test

Types of validity

Construct Validity

Convergent validity coefficients

comparison of your test to pre-existing tests of construct

Divergent validity coefficients

comparison of your test to a test that measures something else

Criterion-Related Validity

Predictor relates to criterion (construct)

Two times (related to time)

Concurrent criterion-related validity

relationship of predictor and criterion at same time,

Predictive criterion-related validity

relationship of predictor and criterion in the future

High school GPA and college performance

Validity coefficient

will accept slightly lower score bc it’s hard to get. Like .3 or so.

Content Validity

Does test adequately over the intended construct

From the opinion of Subject matter experts (test developers)

No r coefficient, minimal rigor

less rigorous than the others

Face Validity

Test appears to cover the intended construct

From the opinions of test takers

No r coefficient, minimal rigor

less rigourous than the others

without this or content validity, going through the other steps is meaningless

Predictor development

Two dimensions to classify predictors

Psychological tests and inventories

 History of psychological testing

Sir Francis Galton – first scientist to devise a way of systematically measuring people

Cattell – introduced the term mental test. He devised an early intelligence test based on sensory discrimination and reaction time.

Ebbinghaus – German psychologist developed math and sentence completion tests. In 1897 he reported that performance on sentence completion test was related to school performance

Binet – French psychologist developed test of intelligence. It consisted of 30 problems covering judgment, comprehension, and reasoning.

Terman – Continued Binet’s research and developed the concept of IQ (Intelligence Quotient)

Types of tests

Test vs Inventory

Test – answers are right or wrong (quiz)

Inventory – no right or wrong answers (ASVAB)

Speed versus Power Tests

speed – easy items with short time limit (typing)

power – difficult items with no time limit (Final)

Individual versus Group Tests

individual – one test taker (IQ test)

group – several takers (Quiz)

Paper-and-Pencil versus Performance Tests

P&P – no physical task (quiz)

performance – requires physical skill (driving test)

Ethical standards in testing

APA code of professional ethics

Test user qualifications

Invasion of privacy

asking questions unrelated to construct or that are inherently intrusive

test reveals more information than is needed

Favorite sports team


religion, sexual orientation, pregnancy


who has access to the data/scores

confidential unless written release given by the test-taker

 Sources of information about testing

Mental Measurements Yearbook (MMY) – big book of tests published every two years

Tests in Print VII – bibliographies and helps locate tests in the MMY

Test content

Intelligence tests

complex construct, multiple types of IQ

g” = general mental ability

single best predictor of performance (r = 0.4 – 0.6)

Example Question

Four years ago, Jane was twice as old as Sam. Four years from now, Sam will be 3/4 of Jane’s age. How old is Jane now? (mensa.org)

Sternbrg’s triarchic theory of intelligence

Mechanical aptitude tests

Recognition of mechanical principles

concepts include: sound & heat conductance, velocity, gravity, and force

predictive feature of performance in manufacturing and production jobs

Ex question: car and ball thrown off 100 foot cliff, which will hit ground at same time.

Personality inventories

No right or wrong answers, level of agreement

Predictive of job success

Myers-Briggs Type Indicator

Big five theory of personality  (more scientific than others)

neuroticism – stability vs. instability

extraversion – sociable, assertive, outgoing

openness to experience – curious, imaginative

agreeableness – cooperative, helpful, easy going

conscientiousness – purposeful, organized


Integrity test

Asses honesty, integrity, and character

Used to identify those who might steal or perform CWBs (ex: absenteeism)

Overt integrity tests

Personality-based measures

these work the best

mainly tests conscientiousness and emotional stability personality factors

Physical abilities

assess strength , endurance and movement quality

predictive of job performance in physically demanding jobs

static strength

the ability to use muscle to lift, push, pull, or carry objects

explosive strength

the ability to use short bursts of muscle force to propel oneself or an object

gross body coordination

the ability to coordinate the movement of arms, legs, and torso where the whole body is in motion


the ability of the lungs and the circulatory systems of the body to perform efficiently over time

Situational judgment tests

All answers are plausible, only 1 is appropriate in the given situation.

Computerized adaptive testing

like the GRE

each question is different depending on how you answer previously

adapts to test-taker in terms of difficulty

tailored testing (very popular, but expensive)

Testing on the Internet

Faster and cheaper

test security is a major issue

proctored: take the test at a specific site with a proctor

not proctored: take the test anywhere you internet access and have your buddies help you take it.


social interactions between interviewer and applicant

can be biased by shared similarities, nonverbal behavior, and verbal cues

unstructured – ask applicants different questions

focus on g, education, interests, and work experience

Structured – questions are consistent across applicants

focus on job knowledge, interpersonal and social skills, and problem solving

more valid, predict job performance and mental ability, apply fairness to applicants

Situational – like SJTs, focus on:

experiences you’ve likely had (“how did you handle that”

unforeseen situations that might arise (“how would you handle that?”)

Experience-based v. situational questions

“Illusion of validity” – We are not good judges of people, but we think we are.

Assessment centers

Assess applicants via standardized group oriented exercises evaluated by raters

rate the performance of applicant

General characteristics

Assess management-level personnel

appraise individuals in goups (10 to 20)

Performance rated by trained observers

Use multiple methods to assess performance (group exercise, personality, etc)

Sources of criterion contamination

Work sample (High-fidelity simulations)

candidate performs actual task or representative task and is evaluated on proficiency

fidelity = high realism

typically used in “blue collar” physical jobs, not those that involve social aspects

Situational exercises (problem presented and asked how you would solve it).

In-basket Exercise

Leaderless Group Discussion

more white collar, decision making ability

Low-fidelity simulations

Biographical information

Predictor of promotion, salary, absenteeism, and productivity

issues of fairness

Legal implications

Letters of recommendation

Very often used but least valid

Primarily positive

Drug testing

Best used when danger to self or others is present (ie. forklift operator, truck driver)

Screening test

Confirmation test

Newer controversial methods of assessment

Polygraphs or Lie Detection

used in government agencies


used in other countries

predictive of affective states (ex.:stress)

Emotional Intelligence

ability to manage emotional responses in social situations

scientific status still unknown

Test reliability and validity must be defensible in court when used to make personnel decisions.


Leave a Reply

Your email address will not be published. Required fields are marked *