A Short Course on Evaluation Basics

John W. Evans, PhD

Most NSU students face the joint task of implementing an intervention for their dissertation project and determining its effectiveness. 

 

Beyond the dissertation, the ability to develop programs and to assess their effectiveness is something all educators must be able to do throughout their careers.

 

This course will enable you to:

What is the purpose of program evaluation?

To determine the extent to which a program or intervention is effective, i.e., to determine if it is successful or how well it meets its objectives.

Why is good evaluation important?

Validly determining whether an educational program works does two things:

  1. If the evaluation evidence shows that a program works, we are justified in continuing it (if it's an on-going, operating program), or proceeding to full-scale implementation (if it's a pilot program).  In other words, solid evaluation evidence that a program is effective tells us our educational efforts are working.  We can be confident we are making a real contribution to educating our students.  This is not something we can take for granted.  Many programs do not, in fact, work.
     

  2. On the other hand, if the evidence shows the program is not working, that is not a bad result, despite the fact that it may make your life politically or bureaucratically difficult.  It's a good result because it prevents us from wasting time and money on something that is not achieving our objectives, and it provides the impetus and the basis for improving the program or substituting another one for it.

What are the characteristics of a good evaluation?

  1. It is objective.

    Self-assessments and subjective judgments of those responsible for a program have low credibility.

     

  2. It is replicable.

    Someone else should be able to re-do your evaluation and get the same results.

     

  3. It is as methodologically strong as circumstances will permit.

    We want to have confidence in the evaluation's findings; and we want the evaluation to be able to resist criticism and attack.  Most educational and social action programs-and the evaluations of them-have their political supporters and detractors.

     

  4. Its results are generalizable.

    The results should apply to the broad range of students, classrooms, schools, and situations to which the program is aimed, not just an atypical population or situation.

Program Theory

Before we take even the first step toward evaluating the effectiveness of a program or intervention (which terms we'll use interchangeably), we need first to be very clear on the assumptions and the logic of the program we're proposing to evaluate: 

 

Does it credibly address the problem? 

 

How is it supposed to produce an impact on the problem?

 

A good way to answer these questions is to construct a program theory diagram like the one in Figure 1. 

 

Figure 1. Example of a Program Theory Diagram.

 

 

 

 

 

Types of Evaluation

 

Evaluations can be categorized in many ways, but in this course we will focus on two main types-formative and summative-and the important differences in their purposes and methods

 

  1. Formative Evaluations

    (Sometimes known as process, implementation, or monitoring evaluations.)

    Sample questions:

     

    1. Does the program exist?
       

    2. Is it operating as it is supposed to?
       

    3. If not, what changes are needed to make it operational?
               

    4. Are funds being appropriately spent?
       

    5. Has the program hired and retained competent staff?
       

    6. Are eligible participants being recruited?
       

    7. Are they receiving the appropriate services?

       

  2. Summative Evaluations

    If we have satisfactory answers to all the formative evaluation questions, that tells us the program is up and running properly and delivering its services.  That is extremely valuable information and something that cannot be taken for granted.  But it does not tell us whether the program is effective.  We must now ask different questions that require different information and methods.

  1. Is the program effective?  Is it achieving its objectives?  Is it producing the intended results?
     

  2. Can any changes we observe in outcome or performance measures, results, or effectiveness criteria be confidently attributed to the program rather than other factors and conditions?

Alternative Summative Evaluation Methods

  1. Self-Assessment.  (The weakest and least credible type of summative evaluation, but sometimes it's all that can be done.)

  1. Program director judgments.
     

  2. Participant judgments.

  1. Third Party Judgments and Observations.  (Better than self-assessment.)

  1. External evaluator judgments.
     

  2. Expert professional judgments.
     

  3. Data from non-participant observations.

  1. Experimental and Quasi-Experimental Designs (The strongest and therefore the most credible types of summative evaluation.)

  1. Random assignment, pre-post measures on both treatment and control groups. (Clinical trials.)
     

  2. Random assignment, post tests only on both treatment and control groups.
     

  3. Pre-post measures on treatment and non-equivalent control group.
     

  4. Statistical control of non-equated variables. 
     

  5. Interrupted time series:  multiple historical measures on a treatment group before and after its exposure to the program. 

a.  Random assignment, pre-post measures on both treatment and control groups. (Clinical trials.)

  1. From a large group of students, form two smaller groups by randomly assigning students to one or the other of the two groups.  One will be the experimental or treatment group; it will receive the program.  The other will be the control group.  It will not receive the program.  (Why is random assignment such a powerful technique?)
     

  2. Administer the pre-test to both groups.
     

  3. Expose the treatment group to the program while withholding it from the control group.
     

  4. Administer the post-test to both groups.
     

  5. If there is a difference which favors the treatment group, you can be reasonably confident it was due to the program.  If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is not effective.  (It is unlikely but not inconceivable that the treatment group would do significantly worse than the control group.)

 

            Form Two                                Treatment Group

            Groups by                                Gets Program.

            Random                                   Control Group

            Assignment                              Does Not.

 

T                      Pre-Test                                   X                                Post-Test

C                     Pre-Test                                                                         Post-Test

b.  Random assignment, post- tests only on both treatment and control groups.

  1. From a large group of students, form two smaller groups by randomly assigning students to one or the other of the two groups.  One will be the experimental or treatment group; it will receive the program.  The other will be the control group.  It will not receive the program.
     

  2. Pre-tests are omitted on the assumption that the scores of the two groups are equal as a result of random assignment.  This design is useful in cases where a pre-test is impossible to administer or might alert the participants to the program's intended effects.
     

  3. Expose the treatment group to the program while withholding it from the control group.
     

  4. Administer the post-test to both groups.
     

  5. If there is a difference which favors the treatment group, you can be reasonably confident it was due to the program.  If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is not effective.

 

            Form Two                                Treatment Group

            Groups by                                Gets Program.

            Random                                   Control Group

            Assignment                              Does Not.

 

T                                                           X                                 Post-Test

 

C                                                                                                Post-Test

c.  Pre-post measures on treatment and non-equivalent control group.

  1. The treatment group is composed of students already in the program through self- or external selection.  Random assignment to treatment and control groups is not possible.  Another group that is as similar to the treatment group as possible is selected for a control group.  It does not receive the program. 

  2. Administer the pre-test to both groups.  Differences can be adjusted later by statistical means.

  3. Expose the treatment group to the program while withholding it from the control group.

  4. Administer the post-test to both groups.

  5. If there is a difference which favors the treatment group, you can be fairly confident (though less so than with random assignment) that it was due to the program.  If there is no difference, i.e., if the scores for both groups remained the same or changed equally (up or down), that indicates the program is probably not effective.

                                                                                                Treatment Group

                                                                                                Gets Program.

                                                                                                Control Group

                                                                                                Does Not.

 

Treatment Group                               Pre-Test                       X                     Post-Test

Already Determined.

 

Select Similar                            C         Pre-Test                                                 Post-Test

Control Group.            

d.  Controlling non-equated variables through statistical analysis (or making non-equivalent groups similar).

Oftentimes we want to evaluate the effectiveness of a program that is already in place, and we are not able to construct a treatment and a control group. 

 

For example, suppose we wanted to evaluate the effectiveness of public schools vs. private schools on academic achievement.  Let's say we looked at the average NAEP scores for 4th grade students in public and private schools, and we found the following:

Figure 2. Mean Mathematics NAEP Scores of 4th Grade Public and Private School Students.

 

Mean Math Scores Public vs Private

So, should we conclude that private schools do a better job than public schools at producing higher student achievement?  

 

You will see newspaper columnists and others reaching exactly that conclusion from the data in Figure 2.  But clearly that would be a very pre-mature judgment because we know that: (1) achievement is related to socioeconomic status, and (2) there are more higher SES students in private schools than public schools.

 

So what does the picture look like when we control for SES?

Figure 3. Mean Mathematics NAEP Scores of 4th Grade Public and Private School Students Controlling  For Socioeconomic  Status.

 

Mean Math Scores SES

When we compare public and private students of the same SES, we find there is little difference in their achievement. But because there are more high SES students in private schools, the overall comparison (Figure 2) is misleading.  (For the precise data on these comparisons, see Lubienski and Lubienski, "A New Look at Public and Private Schools."  Phi Delta Kappan, May 2005.)

What kind of analysis like this would you do to evaluate the effect of traditionally vs. alternatively certified teachers on student achievement?

 

e. Interrupted time series:  multiple historical measures on a treatment group before and after its exposure to the program.

In situations where a control group is not possible,  if (1) data on the treatment group can be obtained for several periods both before and after the participants are exposed to the program, and (2) if there is a sudden change in scores, and (3) if there is a continuation of the change, that is considered to be good evidence of the program's impact.    

Time

                          1       2       3               4       5       6

Experimental Group                      0       0       0      X      X      X      X     

                                                                                                           

                                                                                                        

                                                            New program is introduced.

 

(See Campbell's example of the effect of Connecticut speeding crackdown on traffic fatalities.)

 

I

II

III

Caution:  Unlike treatment-and-control-group designs, where the same individuals are being compared on a pre-post basis, in the typical interrupted time series design, the historical data will usually be on different individuals.   For example, to determine if a new reading program improved the achievement scores of 4th graders using the ITS design, you would compare the 4th grade scores for several years before and after the implementation of the new program.  If, during this time, there has been a significant change in the quality of teachers or the demographic characteristics of the students, those changes, rather than the new program, could account for any improvement in reading scores.  Thus, when using this design, it is important to insure that the population being studied, and other factors that could affect the outcome, have not undergone major change.

Threats to The Validity of Experiments
(Summarized from Campbell, Cook, Klein, and Borg et al.)

Researchers describe two types of design flaws or threats to validity that can affect the confidence we have in experiments: internal and external Internal validity refers to the extent to which to which extraneous variables have been controlled by the researcher,  so that any observed effect can be attributed to the treatment variable.  External validity is the extent to which the findings of an experiment can be generalized beyond the sample and the setting in which it was carried out.  

Threats to Internal Validity

  1. History.  Experimental treatments extend over a period of time, providing the opportunity for other related, external events to affect the outcome.  If, for example, you were carrying out an experiment to determine the effectiveness of a course designed to increase civic and political awareness, and at the same time there was a widely publicized controversy over local election issues that did not occur in other communities, that, rather than the political awareness course, could account for any increase in awareness that occurred. 
     

  2. Maturation.  Any biological, physiological, or psychological processes which change within the subjects of a study are considered a threat of maturation to the internal validity of the study.  You might study the effects of a program of strength training in physical education by comparing the scores before and after a semester of rigorous training for 7th grade students.  It should be apparent that the children were also 4 or 5 months older at the end of the study and their normal gains in strength due to maturation would need to be taken into account as you assessed the effects of your study.
     

  3. Testing.  The effects of taking a test on the outcomes of subsequent administration of the same or a highly related test can affect the internal validity of a study.  For example, if you were studying the effects of a political awareness program, the pretest might sensitize subjects in both experimental and control groups to be more aware of issues which could influence their scores on the posttest.  Additionally, the practice effect of taking the same or a similar test might increase the scores as well.
     

  4. Instrumentation.  A learning gain might be observed from pretest to posttest because the measuring instrument has changed.
     

  5. Statistical Regression.   This is the tendency for individuals who have been selected to participate in a program on the basis of their extremely low or high scores to score closer to the mean on a posttest (i.e., "regress" to the mean) regardless of the effect of the program.  Extreme scores are often the result of factors and conditions that make them unstable.
     

  6. Differential Selection.  If random assignment of subjects to treatment and control group was not used, and the way in which the individuals for the treatment and control group were sleeted resulted in one of the groups being significantly different from the other on variables related to the outcome, e.g., ability or motivation, then those initial differences rather than the intervention program could account for any pre-post differences.
     

  7. Experimental Mortality.  Also know simply as attrition, this refers to the fact that participants in both the experimental and treatment groups may be lost for various reasons.  Not only does this reduce the overall numbers, which becomes a problem in testing for statistical significance between the two groups if the numbers become too small, the more important consideration is that attrition is often non-random, i.e., those who drop out are likely to be different from those who don't in ability, motivation, severity of the problem, and other factors related to the outcome of the treatment.  Thus, at the posttest, the two groups are likely to be different on these influential variables.
     

  8. Experimental Treatment Diffusion.  If, for example, the teachers in the control group learn about what is going on in the treatment class and begin to introduce elements of the treatment program in their classes, and that in turn leads to greater achievement gains in the control group than would not otherwise have occurred, the control group has been "contaminated," thereby reducing the differences between the two groups on the posttest, and leading to the mistaken conclusion that the program was not effective.
     

  9. Compensatory Rivalry By the Control Group.  This extraneous variable involves a situation in which control group participants perform beyond their usual level because they believe they are in competition with or threatened by the  treatment group.  It  is sometimes called the John Henry effect after a legendary, low-wage laborer who worked on railroad tunnels after the Civil War using pile drivers and hand drills to make holes for dynamite.  When a steam powered drill was introduced, Henry said he was faster and challenged it to a contest.  He outperformed the steam drill but died in the attempt.
     

  10. Resentful Demoralization of the Control Group.  This is the opposite of the John Henry effect and occurs when the performance of participants in the control group is subnormal because they become discouraged, believing they cannot compete with the advantages provided to the treatment group.
     

  11. Hawthorne Effect.  This refers to situations in which the mere fact that individuals in the treatment group are aware that they are part of an experiment and feel they are expected to perform better.  These expectations by themselves, not the treatment variable, may be what produce improvements in performance. 

Threats to External Validity

Ecological Validity and Population Validity If the treatment effects can be obtained only under a special set of conditions (a particular classroom, school, or community) or only on a narrowly defined population, they clearly have limited applicability

An Example Showing the Most Common Pitfall in Evaluating the Most Common Type of Dissertation:
The Intervention Project

OK?

 

NOT OK, because:

 

We don't know what would have happened to those scores if we hadn't introduced the new program.

 

The math scores could have risen (or fallen) as a result of factors and conditions other than our program.

 

Like what?

  1. Selective attrition or non-random dropouts. This could produce a change in the demographic composition of the 8th grade math students from the beginning to the end of the year that could result in a higher (or lower) percent of educationally disadvantaged students.
     

  2. A change in math teachers from less well-qualified to better qualified (or the reverse).
     

  3. An idiosyncratic event that affected the amount and quality of math instruction (e.g., a teacher strike, a school closure, etc.).
     

  4. Other changes in the school curriculum or the students' exposure to more or better math instruction (e.g., science courses, out-of-school programs, etc.).

In light of these problems, how can we find out whether or not the new program improved math scores?

 

We need, for comparison, a measure of change under the conditions of non-treatment, i.e., we need an estimate of how much the students in our program would have improved if they hadn't experienced the program.

  1. The best way to obtain such an estimate is through a control group.

    Ideally, the control group is comparable to the treatment group in all respects except one:  it didn't receive the treatment.

    This means if the scores of the treatment group improve and those of the control group don't, we can be reasonably confident the increase was due to the new program.
     

  2. If a control group is not possible, we can use the interrupted time-series design (assuming we have the necessary several periods of pre- and post-intervention data).

Summary of Key Points for Evaluating
Intervention Programs

  1. You must determine if the intervention was in fact implemented (formative evaluation).  There are a lot of unfortunate examples of proceeding to evaluate the effectiveness of a "non-program."
     

  2. The effectiveness of a program cannot be determined by simple, pre-post changes in outcome measures without controlling for other factors that could have produced those changes. 
     

  3. What can be done if a control group is not possible, and if historical, time-series data are not available?

  1. Do a full-scale formative or implementation evaluation, and postpone the impact evaluation until it can be done properly.
     

  2. Retreat to a less powerful evaluation design (e.g., judgment-observation or self-assessment) while making explicitly clear its limitations.

  1. General Principles:

  1. Do the most powerful evaluation you can, but don't do one that is improperly done and will expose you to justified criticism.
     

  2. But avoid premature summative evaluations.
     

  3. Don't claim more than your evaluation will support.
     

  4. When making statements or claims, err on the conservative side.

Strengths and Weaknesses of Different Summative
Evaluation Methods

 

Strength/Weakness

  

Type of Evaluation

Power, Strength, Credibility

 

Replica-bility

  

Logistical Feasibility

Technical, Methodological Difficulty

 

  

Time/Cost 


Self-Assessment 

 

 Low

 

 Low

 

 High

 

 Low

 

 Low

 

Third Party Observations/ Judgments

 

 Medium

 

 Medium

 

Medium to High

 

 Low

 

 Medium

Experimental and Quasi-Experimental Designs

 

High

 

High

 

Medium to Low

 

High

 

High

 

 

Matrix of Type of Dissertation
by Type of Evaluation

 

 

 

 

Type of   Dissertation

1. Implement a new educational program and evaluate its effects on performance and  behavior.



2. Establish a new educational program (no impact evaluation).

 

 
3. Implement and evaluate administrative or organizational changes.

 


4. Determine important community attitudes and influences

 


5. Chronicle the history of a major policy change or event in the school.

 

 
6. Exploratory case study.

 

Evaluation Question

 

 

Was it effective?

 

Was it implemented?

 

Were they implemented, and what were their effects?

 

Who, what, and where are the major social and political forces?

 

What happened?

 

 

What are the key factors and  variables?

Type of    Evaluation
 

 

 

 

 

 

 

Self-Assessment

 

Weak

 

Weak

 

Weak

 

Weak

 

Weak

 

Weak

Third-Party Observations/
Judgments

 

 Limited

 

OK

 

OK

 

OK

 

OK

 

OK

Experimental and Quasi-Experimental Designs
Strongest

N/A

OK, but may be difficult.

N/A

N/A

N/A

 

Common Dissertation Projects

Effectiveness of a Title I Reading Project

 

Implementation of a Basic Skills Course

 

Effectiveness of a Special Education Teacher Training Program

 

Improving School Climate Through TQM

 

Effect of Technology Training on Teachers and Learning

 

Implementing a Staff Development Plan for ESL Teachers

 

Implementing an Induction Program for New Administrators

 

Effect of a Peer Helping Program on the Integration of ESL Students

 

Effect of a Compensatory Education Program on Competency Scores

 

Feasibility of Strategic Planning

 

Improving the Reading Skills of Grade 10 Students

 

Effect of a Pre Kindergarten Program on Children's Readiness

 

Effectiveness of Individual Instruction in the First Grade

 

Effect of an Accelerated Program on At Risk Learners

 

Effectiveness of Conflict Resolution Strategies

 

Effect of an Incentive Program on Sixth Grade Achievement

 

Effect of an Intervention Program on Middle Level Underachievers

 

Effect of Block Scheduling on Attitudes and Achievement

 

Comparative Effectiveness of Private Christian Schools

 

Effectiveness of Supervision Appraisal Waivers

 

Improving Student Achievement Through Staff Development

 

Improving Math Achievement With Specialized Software

 

Using On-Line Access to Improve Parent Satisfaction

 

Effect of a Mentoring Program on Disciplinary Referrals

 

Effectiveness of Alternative vs. Traditional Teacher Certification on Student Achievement

 

Top of Short Course on Evaluation Basics

Design Selections | Evaluation Support Home
Applied Research Center | EdD Major | Ed Leaders Home/ Basic Statistics