A Short Course on Evaluation Basics
John W. Evans, PhD
Most NSU students face the joint task of implementing an intervention for their dissertation project and determining its effectiveness.
Beyond the dissertation, the ability to develop programs and to assess their effectiveness is something all educators must be able to do throughout their careers.
This course will enable you to:
Develop an understanding of what kinds of programs call for what type of evaluation.
Learn the characteristics of good evaluations and the pitfalls of bad ones.
Learn what kinds of evaluation are and are not appropriate for your dissertation.
What is the purpose of program evaluation?
To determine the extent to which a program or intervention is effective, i.e., to determine if it is successful or how well it meets its objectives.
Why is good evaluation important?
Validly determining whether an educational program works does two things:
If the evaluation evidence shows that a program
works, we are justified in continuing it (if it's an on-going, operating
program), or proceeding to full-scale implementation (if it's a pilot program).
In other words, solid evaluation evidence that a program is effective tells us
our educational efforts are working. We can be confident we are making a real
contribution to educating our students. This is not something we can take for
granted. Many programs do not, in fact, work.
On the other hand, if the evidence shows the program is not working, that is not a bad result, despite the fact that it may make your life politically or bureaucratically difficult. It's a good result because it prevents us from wasting time and money on something that is not achieving our objectives, and it provides the impetus and the basis for improving the program or substituting another one for it.
What are the characteristics of a good evaluation?
It is
objective.
Self-assessments and subjective judgments of
those responsible for a program have low credibility.
It is replicable.
Someone else should be able to re-do your evaluation and get the same results.
It is as methodologically strong as circumstances
will permit.
We want to have confidence in the evaluation's
findings; and we want the evaluation to be able to resist criticism and attack.
Most educational and social action programs-and the evaluations of them-have
their political supporters and detractors.
Its results are generalizable.
The results should apply to the broad range of students, classrooms, schools,
and situations to which the program is aimed, not just an atypical population or
situation.
Program Theory
Before we take even the first step toward evaluating the effectiveness of a program or intervention (which terms we'll use interchangeably), we need first to be very clear on the assumptions and the logic of the program we're proposing to evaluate:
ØDoes it credibly address the problem?
ØHow is it supposed to produce an impact on the problem?
A good way to answer these questions is to construct a program theory diagram like the one in Figure 1.
Figure 1. Example of a Program Theory Diagram.




Types of Evaluation
Evaluations can be categorized in many ways, but in this course we will focus on two main types-formative and summative-and the important differences in their purposes and methods.
Formative Evaluations
(Sometimes known as process, implementation, or monitoring
evaluations.)
Sample questions:
Does the program exist?
Is it operating as it is supposed
to?
If not, what changes are needed to
make it operational?
Are funds being appropriately spent?
Has the program hired and retained
competent staff?
Are eligible participants being
recruited?
Are they receiving the appropriate services?
Summative Evaluations
If we have satisfactory answers to all the formative
evaluation questions, that tells us the program is up and running properly and
delivering its services. That is extremely valuable information and something
that cannot be taken for granted. But it does not tell us whether the program
is effective. We must now ask different questions that require different
information and methods.
Is the program effective? Is it achieving its objectives? Is it producing the intended results?
Can any changes we observe in outcome or performance measures, results, or effectiveness criteria be confidently attributed to the program rather than other factors and conditions?
Alternative Summative Evaluation Methods
Self-Assessment. (The weakest and least credible type of summative evaluation, but sometimes it's all that can be done.)
Program director judgments.
Participant judgments.
Third Party Judgments and Observations. (Better than self-assessment.)
External evaluator judgments.
Expert professional judgments.
Data from non-participant observations.
Experimental and Quasi-Experimental Designs. (The strongest and therefore the most credible types of summative evaluation.)
Random assignment, pre-post measures on both treatment and control groups. (Clinical trials.)
Random assignment, post tests only on both treatment and control groups.
Pre-post measures on treatment and non-equivalent control group.
Statistical control of non-equated variables.
Interrupted time series: multiple historical measures on a treatment group before and after its exposure to the program.
a. Random assignment, pre-post measures on both treatment and control groups. (Clinical trials.)
From a large group of students, form two smaller
groups by randomly assigning students to one or the other of the two groups.
One will be the experimental or treatment group; it will receive the program.
The other will be the control group. It will not receive the program. (Why is
random assignment such a powerful technique?)
Administer the pre-test to both groups.
Expose the treatment group to the program while
withholding it from the control group.
Administer the post-test to both groups.
If there is a difference which favors the treatment group, you can be reasonably confident it was due to the program. If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is not effective. (It is unlikely but not inconceivable that the treatment group would do significantly worse than the control group.)
Form Two Treatment Group
Groups by Gets Program.
Random Control Group
Assignment Does Not.
T Pre-Test X Post-Test
C Pre-Test Post-Test
b. Random assignment, post- tests only on both treatment and control groups.
From a large group of students, form two smaller
groups by randomly assigning students to one or the other of the two groups.
One will be the experimental or treatment group; it will receive the program.
The other will be the control group. It will not receive the program.
Pre-tests are omitted on the assumption that the scores of
the two groups are equal as a result of random assignment. This design is
useful in cases where a pre-test is impossible to administer or might alert the
participants to the program's intended effects.
Expose the treatment group to the program while
withholding it from the control group.
Administer the post-test to both groups.
If there is a difference which favors the treatment group, you can be reasonably confident it was due to the program. If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is not effective.
Form Two Treatment Group
Groups by Gets Program.
Random Control Group
Assignment Does Not.
T X Post-Test
C Post-Test
c. Pre-post measures on treatment and non-equivalent control group.
The treatment group is composed of students already in the program through self- or external selection. Random assignment to treatment and control groups is not possible. Another group that is as similar to the treatment group as possible is selected for a control group. It does not receive the program.
Administer the pre-test to both groups. Differences can be adjusted later by statistical means.
Expose the treatment group to the program while withholding it from the control group.
Administer the post-test to both groups.
If there is a difference which favors the treatment group, you can be fairly confident (though less so than with random assignment) that it was due to the program. If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is probably not effective.
Treatment Group
Gets Program.
Control Group
Does Not.
Treatment Group T Pre-Test X Post-Test
Already Determined.
Select Similar C Pre-Test Post-Test
Control Group.
d. Controlling non-equated variables through statistical analysis (or making non-equivalent groups similar).
Oftentimes we want to evaluate the effectiveness of a program that is already in place, and we are not able to construct a treatment and a control group.
For example, suppose we wanted to evaluate the effectiveness of public schools vs. private schools on academic achievement. Let's say we looked at the average NAEP scores for 4th grade students in public and private schools, and we found the following:
Figure 2. Mean Mathematics NAEP Scores of 4th Grade Public and Private School Students.

So, should we conclude that private schools do a better job than public schools at producing higher student achievement?
You will see newspaper columnists and others reaching exactly that conclusion from the data in Figure 2. But clearly that would be a very pre-mature judgment because we know that: (1) achievement is related to socioeconomic status, and (2) there are more higher SES students in private schools than public schools.
So what does the picture look like when we control for SES?
Figure 3. Mean Mathematics NAEP Scores of 4th Grade Public and Private School Students Controlling For Socioeconomic Status.

When we compare public and private students of the same SES, we find there is little difference in their achievement. But because there are more high SES students in private schools, the overall comparison (Figure 2) is misleading. (For the precise data on these comparisons, see Lubienski and Lubienski, "A New Look at Public and Private Schools." Phi Delta Kappan, May 2005.)
What kind of analysis like this would you do to evaluate the effect of traditionally vs. alternatively certified teachers on student achievement?
e. Interrupted time series: multiple historical measures on a treatment group before and after its exposure to the program.
In situations where a control group is not possible, if (1) data on the treatment group can be obtained for several periods both before and after the participants are exposed to the program, and (2) if there is a sudden change in scores, and (3) if there is a continuation of the change, that is considered to be good evidence of the program's impact.
Time
1 2 3 4 5 6
Experimental Group 0 0 0 X X X X
New program is introduced.
(See Campbell's example of the effect of Connecticut speeding crackdown on traffic fatalities.)
I

II

III

Caution: Unlike treatment-and-control-group designs, where the same individuals are being compared on a pre-post basis, in the typical interrupted time series design, the historical data will usually be on different individuals. For example, to determine if a new reading program improved the achievement scores of 4th graders using the ITS design, you would compare the 4th grade scores for several years before and after the implementation of the new program. If, during this time, there has been a significant change in the quality of teachers or the demographic characteristics of the students, those changes, rather than the new program, could account for any improvement in reading scores. Thus, when using this design, it is important to insure that the population being studied, and other factors that could affect the outcome, have not undergone major change.
Threats to The Validity of Experiments
(Summarized from Campbell, Cook, Klein, and Borg et al.)Researchers describe two types of design flaws or threats to validity that can affect the confidence we have in experiments: internal and external. Internal validity refers to the extent to which to which extraneous variables have been controlled by the researcher, so that any observed effect can be attributed to the treatment variable. External validity is the extent to which the findings of an experiment can be generalized beyond the sample and the setting in which it was carried out.
Threats to Internal Validity
History. Experimental treatments extend over a
period of time, providing the opportunity for other related, external events to
affect the outcome. If, for example, you were carrying out an experiment to
determine the effectiveness of a course designed to increase civic and political
awareness, and at the same time there was a widely publicized controversy over
local election issues that did not occur in other communities, that, rather than
the political awareness course, could account for any increase in awareness that
occurred.
Maturation. Any biological, physiological, or
psychological processes which change within the subjects of a study are
considered a threat of maturation to the internal validity of the study. You
might study the effects of a program of strength training in physical education
by comparing the scores before and after a semester of rigorous training for 7th
grade students. It should be apparent that the children were also 4 or 5 months
older at the end of the study and their normal gains in strength due to
maturation would need to be taken into account as you assessed the effects of
your study.
Testing.
The effects of taking a test on the
outcomes of subsequent administration of the same or a highly related test can
affect the internal validity of a study. For example, if you were studying the
effects of a political awareness program, the pretest might sensitize subjects
in both experimental and control groups to be more aware of issues which could
influence their scores on the posttest. Additionally, the practice effect of
taking the same or a similar test might increase the scores as well.
Instrumentation. A learning gain might be
observed from pretest to posttest because the measuring instrument has changed.
Statistical Regression. This is the tendency
for individuals who have been selected to participate in a program on the basis
of their extremely low or high scores to score closer to the mean on a posttest
(i.e., "regress" to the mean) regardless of the effect of the program. Extreme
scores are often the result of factors and conditions that make them unstable.
Differential Selection. If random assignment of
subjects to treatment and control group was not used, and the way in which the
individuals for the treatment and control group were sleeted resulted in one of
the groups being significantly different from the other on variables related to
the outcome, e.g., ability or motivation, then those initial differences rather
than the intervention program could account for any pre-post differences.
Experimental Mortality. Also know simply as
attrition, this refers to the fact that participants in both the experimental
and treatment groups may be lost for various reasons. Not only does this reduce
the overall numbers, which becomes a problem in testing for statistical
significance between the two groups if the numbers become too small, the more
important consideration is that attrition is often non-random, i.e., those who
drop out are likely to be different from those who don't in ability, motivation,
severity of the problem, and other factors related to the outcome of the
treatment. Thus, at the posttest, the two groups are likely to be different on
these influential variables.
Experimental Treatment Diffusion. If, for
example, the teachers in the control group learn about what is going on in the
treatment class and begin to introduce elements of the treatment program in
their classes, and that in turn leads to greater achievement gains in the
control group than would not otherwise have occurred, the control group has been
"contaminated," thereby reducing the differences between the two groups on the
posttest, and leading to the mistaken conclusion that the program was not
effective.
Compensatory Rivalry By the Control Group. This
extraneous variable involves a situation in which control group participants
perform beyond their usual level because they believe they are in competition
with or threatened by the treatment group. It is sometimes called the John
Henry effect after a legendary, low-wage laborer who worked on railroad tunnels
after the Civil War using pile drivers and hand drills to make holes for
dynamite. When a steam powered drill was introduced, Henry said he was faster
and challenged it to a contest. He outperformed the steam drill but died in the
attempt.
Resentful Demoralization of the Control Group.
This is the opposite of the John Henry effect and occurs when the performance of
participants in the control group is subnormal because they become discouraged,
believing they cannot compete with the advantages provided to the treatment
group.
Hawthorne Effect. This refers to situations in which the mere fact that individuals in the treatment group are aware that they are part of an experiment and feel they are expected to perform better. These expectations by themselves, not the treatment variable, may be what produce improvements in performance.
Threats to External Validity
Ecological Validity and Population Validity. If the treatment effects can be obtained only under a special set of conditions (a particular classroom, school, or community) or only on a narrowly defined population, they clearly have limited applicability
An Example Showing the Most Common Pitfall in Evaluating the Most Common Type of Dissertation:
The Intervention Project
8th grade math scores on the statewide
assessment test are below average. (Reading scores are average.)
The decision is made to revise the 8th grade
math course and use a new text with computerized instruction.
How should we evaluate this new program?
First, we must determine if the new math
program was in fact implemented-not something we can take for granted. We must
carry out a formative evaluation.
Second, we must determine if the new project
was effective in improving the math assessment scores. We must do a summative
(impact or effectiveness) evaluation.
So, at the end of the year, we administer the same test and compare the results with last year's scores. If the scores went up, we can conclude that the program was effective. If they stayed the same or went down, the program was not effective.
OK?
NOT OK, because:
We don't know what would have happened to those scores if we hadn't introduced the new program.
The math scores could have risen (or fallen) as a result of factors and conditions other than our program.
Like what?
Selective attrition or non-random dropouts. This could produce a change in the demographic composition of the 8th grade math students from the beginning to the end of the year that could result in a higher (or lower) percent of educationally disadvantaged students.
A change in math teachers from less well-qualified to better qualified (or the reverse).
An idiosyncratic event that affected the amount and quality of math instruction (e.g., a teacher strike, a school closure, etc.).
Other changes in the school curriculum or the students' exposure to more or better math instruction (e.g., science courses, out-of-school programs, etc.).
In light of these problems, how can we find out whether or not the new program improved math scores?
We need, for comparison, a measure of change under the conditions of non-treatment, i.e., we need an estimate of how much the students in our program would have improved if they hadn't experienced the program.
The best way to obtain such an estimate is
through a control group.
Ideally, the control group is comparable to the
treatment group in all respects except one: it didn't receive the treatment.
This means if the scores of the treatment group improve and those of the control
group don't, we can be reasonably confident the increase was due to the new
program.
If a control group is not possible, we can use the interrupted time-series design (assuming we have the necessary several periods of pre- and post-intervention data).
Summary of Key Points for Evaluating
Intervention Programs
You must determine if the intervention was in
fact implemented (formative evaluation). There are a lot of unfortunate
examples of proceeding to evaluate the effectiveness of a "non-program."
The effectiveness of a program cannot be
determined by simple, pre-post changes in outcome measures without controlling
for other factors that could have produced those changes.
What can be done if a control group is not possible, and if historical, time-series data are not available?
Do a full-scale formative or implementation evaluation, and postpone the impact evaluation until it can be done properly.
Retreat to a less powerful evaluation design (e.g., judgment-observation or self-assessment) while making explicitly clear its limitations.
General Principles:
Do the most powerful evaluation you can, but don't do one that is improperly done and will expose you to justified criticism.
But avoid premature summative evaluations.
Don't claim more than your evaluation will support.
When making statements or claims, err on the conservative side.
Strengths and Weaknesses of Different Summative
Evaluation Methods
Strength/Weakness
|
Type of Evaluation |
Power, Strength, Credibility |
Replica-bility |
Logistical Feasibility |
Technical, Methodological Difficulty |
Time/Cost |
|
|
Low |
Low |
High |
Low |
Low |
|
Third Party Observations/ Judgments |
Medium |
Medium |
Medium to High |
Low |
Medium |
|
Experimental and Quasi-Experimental Designs |
High |
High |
Medium to Low |
High |
High |
Matrix of Type of Dissertation
by Type of Evaluation
|
Type of Dissertation |
1. Implement a new educational program and evaluate its effects on performance and behavior. |
|
|
|
|
|
|
Evaluation Question Ø
|
Was it effective? |
Was it implemented? |
Were they implemented, and what were their effects? |
Who, what, and where are the major social and political forces? |
What happened?
|
What are the key factors and variables? |
|
Type of Evaluation |
|
|
|
|
|
|
|
Self-Assessment |
Weak |
Weak |
Weak |
Weak |
Weak |
Weak |
|
Third-Party Observations/ |
Limited |
OK |
OK |
OK |
OK |
OK |
| Experimental and Quasi-Experimental Designs |
Strongest |
N/A |
OK, but may be difficult. |
N/A |
N/A |
N/A |
Common Dissertation Projects
Effectiveness of a Title I Reading Project
Implementation of a Basic Skills Course
Effectiveness of a Special Education Teacher Training Program
Improving School Climate Through TQM
Effect of Technology Training on Teachers and Learning
Implementing a Staff Development Plan for ESL Teachers
Implementing an Induction Program for New Administrators
Effect of a Peer Helping Program on the Integration of ESL Students
Effect of a Compensatory Education Program on Competency Scores
Feasibility of Strategic Planning
Improving the Reading Skills of Grade 10 Students
Effect of a Pre Kindergarten Program on Children's Readiness
Effectiveness of Individual Instruction in the First Grade
Effect of an Accelerated Program on At Risk Learners
Effectiveness of Conflict Resolution Strategies
Effect of an Incentive Program on Sixth Grade Achievement
Effect of an Intervention Program on Middle Level Underachievers
Effect of Block Scheduling on Attitudes and Achievement
Comparative Effectiveness of Private Christian Schools
Effectiveness of Supervision Appraisal Waivers
Improving Student Achievement Through Staff Development
Improving Math Achievement With Specialized Software
Using On-Line Access to Improve Parent Satisfaction
Effect of a Mentoring Program on Disciplinary Referrals
Effectiveness of Alternative vs. Traditional Teacher Certification on Student Achievement
Top of Short Course on Evaluation Basics
Design Selections | Evaluation Support Home
Applied Research Center | EdD Major | Ed Leaders Home/ Basic Statistics