Discussion of the Tennessee Value Added Assessment System

The following is an edited log of an asynchronous online discussion of the Tennessee Value Added Assessment System (TVAAS) for assessing teachers. This discussion took place in late 1994 and early 1995 on an internet LISTSERV known as EDPOLYAN, which was housed at the Arizona State University College of Educaiton and managed by Gene Glass, a professor there. The log of the discussion has been edited in an attempt to add clarity and reduce redundancy. Each participant in the discussion was given the opportunity to review the edited log and make corrections.
Questions about this transcript of the discussion can be directed to Gene Glass at glass@asu.edu.

December 1995




=========================================================================
Date:         Fri, 9 Sep 1994 12:57:52 EDT
From:         Scriven@AOL.COM

"Do you really want to see teaching become an even higher-turnover
profession?" No, and it's because of the bad teacher evaluation systems now
in use that we lose many good teachers, so I'm trying to improve the
evaluation systems. Piece work is not the best way; it's just a
counter-example to the idea one can't use individual rewards.
 
I do include the TN Value Added System as one of the new and promising
efforts, and I'd like to hear what you see as its flaws.
 
The reason for rewarding teachers as individuals is that nearly all of them
work as individuals; the Deming approach is fine for situations where you
only have groups as the work unit.
------------------------------------------------------------

Date:         Fri, 9 Sep 1994 23:17:35 -0500
From:         SHERMAN DORN 
  
For those of you unaware of the Tennessee Value Added Assessment system
(mandated by state law), I'll try to describe it briefly:  you take
the standard scores from the state normative test for each child, subtract
from it the standard score for the child from the previous year, and use
the gain score (essentially a putative change up or down a normal curve)
in a statistical equation with district, school, and teacher as explanatory
variables.  (For statisticians out there, I know Bill Sanders, the
agronomist who created VAA [I guess he got the idea from measuring
corn yields or something similar], uses a mixed-model equation.  I don't
know, however, which are the fixed-effect variables and which the
random-effect variables.)
 
The problems with VAA?  Well, I can think of a few:
 
        a.  The intrinsic worth of the standardized tests are
        questionable, at best.  For test security, test questions
        and structure are confidential.  Which means we can't judge
        what, in fact, the things are measuring.  Not to mention the
        usual shenanigans that occur in schools with high-stakes
        tests.
 
        b.  The tests for different grades (first, second, etc.) have
        different norming populations that are not equivalent:  some
        people are retained in every grade, there's some differential
        migration, more kids in higher grades are certified in
        special ed (and thus likely excluded from the norming population) --
        with the end result that a standard score of 500 in first grade
        (putatively at the mean of a normal curve) cannot mean the
        same thing for ANY OTHER GRADE.  Ergo, subtracting standard
        scores is patent nonsense.  (Essentially, having noncomparable
        norming populations does two things:  shifts the mean around,
        and changes the standard deviation, but in ways we just can't
        know without having access to the original norming group.)
 
        c.  A gain score is a questionable basis for statistical
        analysis.
 
        d.  Since the standardized tests are taken at least one, and
        in some years almost two, months before the end of the year,
        gain scores conflate the effects of two different teachers.
 
        e.  VAA may seriously underestimate the effects of prior
        knowledge, social background, etc.  I know of no analysis to date
        of the implications of model misspecification for VAA -- as far
        as I know, Sanders just plugs in last year's scores, this year's
        scores, the districts, schools, and teachers, and lets the
        program rip, without reference to race, sex, gender, or even
        the possibility of a nonlinear relationship between last year's
        score and this year's.  Would the effects be different if you
        put in sex, race, economic class, perhaps a square of last year's
        scores, in the equation?  I bet no one knows.
 
        f.  VAA is not an evaluation system accessible to teacher
        understanding.  It is as abstruse and anxiety-provoking as
        anything one could imagine.  An incentive for better teaching?
        Absolutely not, from my experience working with teachers.
        An incentive for absolute panic?  You bet.
 
 
>The reason for rewarding teachers as individuals is that nearly all of them
>work as individuals; the Deming approach is fine for situations where you
>only have groups as the work unit.
 
The question is not whether people work as individuals or groups, but whether
the *work* is individual or group.  And, unless I'm wrong, the socialization
and education of children is not done by an individual but by lots of
people.  Is the third-grader who reads poorly the fault of the third-grade
teacher?  Perhaps a bit, for plenty of third-grade teachers will refuse to
teach a child who cannot read.  But it is certainly the fault of prior
teachers.  Same for a high-achieving child in a subject.
 
And the status quo of individual teaching does not justify the use of
individual incentives; we need to compare individual merit-pay schemes with
more collective incentive systems combined with a change in teaching
structure.  The fact that teachers do not work together now very often is
not a justification for disincentives from working together forever.
 
-----------------------------
Date:         Thu, 27 Oct 1994 13:22:16 LCL
From:         "William L. Sanders" 
 
     Last week, a response to one of Scriven's comments by Dorn (9-
9-94) was called to my attention.
 
     I am writing in response to this totally erroneous and
distortive description of the Tennessee Value-Added Assessment
System (TVAAS) .  Never in my 28+ years working in the academic
arena have I read or seen such a blatant misrepresentation of the
truth--a misrepresentation which, I hope, is based upon a lack of
knowledge instead of a deliberate attempt to sabotage.  I want to
response to this gross misrepresentation point by point.
 
 
     First, I am not an agronomist; I am a statistician.  For many
years, I have had the responsibility for the Statistical and
Computing Services Unit within the Agricultural Experiment Station.
The University of Tennessee being the Land-Grant university of
Tennessee, not unlike Iowa State or Purdue or NC State, etc. has
supported statistical consulting and research through its
Experiment Station for many years.  I am also an Adjunct Professor
in the Department of Statistics which at UTK is in the College of
Business.  I let my professional record and research speak for
itself.  My research interests are and have been in the areas of :
1. experimental design and 2. statistical mixed models.  I came
into the educational research arena in the early '80's quite by
accident; but due to an apparent need to solve many of the
statistical problems which educational types were citing, this has
become in the past 5-6 years my primary focus.
 
     What is so distressing about Dorn's comments is the blatant
lack of knowledge about our work and methodology. He claims to know
me; as I write this, I would not know Sherman Dorn if he walked
into my office.  I have made over 200 presentations conceptualizing
this approach; if I met him during and after one of those
presentation, I certainly do not remember it.  If he is so
committed to 'trash' this approach at least he should have the
common decency to make accurate and objective academic arguments
without relying on petty slurs and innuendos.
 
     Second, his attempt to describe the modeling approaches are
totally inaccurate.  We do not even calculate simple gains. For
example, we use the whole observation vector for each child over
all subjects and grades.  In fact, we have a 'small' article later
presented at an American Statistical Association meeting that
demonstrates how this approach is superior to traditional
multivariate approaches in that the whole observational vector is
not lost due to one missing value for a variable.
 
     But the most telling comment of his ignorance of the TVAAS
process was when he indicated that he did not know what was
'random' in the model.  The process is built on the formulation of
the late C.R. Henderson, a Cornell animal breeder who was named a
fellow in the American Statistical Association for his pioneering
work in this area.  Henderson's development of the concept of best
linear unbiased prediction (BLUP), has been shown to be related to
other shrinkage estimator concepts by David Harville at Iowa State
and others (i.e. many bayesian concepts, kalman filtering from the
engineering sciences, some of the hierarchical linear model
concepts, etc.) .  However, we have found that Henderson's
formulations have tremendous computing advantages over some other
equivalent alternatives.  In fact, while I was a consultant to SAS
Institute, Inc., as they were planning and implementing their
rather new MIXED procedure, I strongly recommended that they use
this formulation in their development and manuals -- a
recommendation which was accepted.
 
     As we apply these approaches in the context of the estimation
of the teacher and school effects on the academic growth of
populations of students, we take advantage of the prior knowledge
of the distribution of the variance-covariance structure among
populations of teachers, as well as the variance-covariance
structure among students.  By so doing, we solve the following
problems:  1. fractured student records
           2. teachers changing assignments
           3. modes of instruction (team teaching,
              departmental instruction, self contained classrooms)
           4. use of dissimilar indicator variables over time.
           5. combinations of different quantities and qualities
             of information
 
              etc., etc., etc., etc., .......
 
     Dorn indicated that "since the standardized tests are taken at
least one, in some years almost two, months before the end of the
year, gain scores conflate the effects of two different teachers".
This is another example of his lack of knowledge of the process.
We have developed a process which we call the 'stacked block
concept' which enables the partitioning of these effects.  This
concept is not totally dissimilar to the recovery of interblock
information from incomplete block designs.
 
     I invite any of you, including Dorn, to call and we will
supply you with as much detail and information as you wish.
 
     As to his concern with regard to model specification, we have
done a tremendous amount of computer simulation to evaluate just
that concern, and we can demonstrate to any reasonable person that
this process is very robust against rather severe mispecification.
 
     We welcome legitimate and qualified criticism.  For the past
four years, we have developed and implemented this process for the
whole state of Tennessee.  Presently, the data base contains over
3 million records longitudinally merged for the entire state's
student population.  To provide these estimates for all schools
(and next year all teachers grades 3-8 for the entire state), the
solution to tens of thousands of equations is required. We are
doing this computing on a dedicated RS-6000 workstation.  Because
of this work crunch, our writing has lagged; however, we do have a
sufficient amount of written material to enable informed comment.
Informed comment, we welcome; hatchet jobs like Dorn just pulled,
have no place on the internet.  I describe his recent message on
the board as worst than political dirty tricks.  That kind of
activity is not consistent with the academic reputation and
tradition of one of Tennessee's treasures, Vanderbilt University.
 
=========================================================================
Date:         Fri, 28 Oct 1994 07:41:07 -0500
From:         SHERMAN DORN 
 
I will reply at some length to Prof. Sanders' comments later, but just
a few points here:
 
1.  I have searched in several places for peer-reviewed descriptions
of the Tennesee Value-Added Assessment System (TVAAS), as well as looking
at the text of the legislation in Tennessee Code Annotated.  Until a
member of the staff at the center producing the assessments told me
last week of an article in the *most recent* Journal of Personnel
Evaluation in Education, I have been unable to locate a single
article about it through ERIC or the other periodical indices I have
searched.  What I described in September was the result of information
I had at the time -- which, by the way, was more than what teachers
and principals are given to understand about TVAAS.  The legislation
was passed several years ago; the fact that I have been unable to locate
peer-reviewed articles on the subject using abstract indices in 1994 is
not evidence of laziness or misrepresentation on my part, I believe.
 
(It is not evidence of either on the part of TVAAS center staff, either,
just not evidence about my actions.)
 
2.  I stated very clearly that I *knew* of no attempts to gauge the
results of model misspecification and that I thought it would result
in wide discrepancies -- and by this I mean discrepancies in the
rankings given schools.  My question about what are the random or fixed
effects stands:  the fact that the TVAAS uses a mixed-model methodology
which SAS now uses does not tell us whether the specific models used
in calculating TVAAS make sense in the real world.  I know SAS
has that model, and I have read the article in American Statistician which
Sanders co-authored and which is cited in the legislation as criteria for
acceptable statistical models.  The fact that Sanders' mixed-model
methodology is a good general approach to mixed models (where the
effects of some things are fixed and others are random, with the
possibility of interactions among the levels) is irrelevant to the
question of whether the model MEANS anything.
 
3.  Prof. Sanders' statistical expertise does not answer the policy-related
questions which I raised at the time -- or the additional ones I have
now.
 
Quickly re-reading my comments in September, I regret implying that
Sanders' expertise was in agronomy and not in statistics only.  I will
confirm that Sanders left me a message several days ago, and I have
sent an e-mail message explaining why I have not returned his call.
(I did so yesterday, before I saw his message.  The two may have passed
... rather, the fact that we did not see the other's message does not
imply anything except that the Internet is not instantaneous.)
 
(Yeesh, please mentally delete "only" from "in statistics only."  I forgot
to use the "/edit" command for this, and can't re-edit lines above.)
 
As mentioned above, a TVAAS staff person has pointed out to me a recent
article, and I will wait until I receive it and read it before I
comment further.  I have also recently become aware of Sanders'
presentation to the CREATE folks at Western Michigan State last year
and (according to one educator I know) of presentations about TVAAS
this summer.  Unfortunately, the CREATE gopher site at Western Michigan
is not yet up-and-running, and the campus' gopher does not have a
working faculty directory.  (CREATE is one of the federally-sponsored
research centers for education, and although I forget what the
acronym stands for, it is about assessment and evaluation.)
 

=========================================================================
Date:         Fri, 28 Oct 1994 12:02:53 MST
From:         Gene Glass 
 
Dear Professor Sanders:
   I like statistics; I made the better part of my living off of it for
 many years. But could we set it aside for just a minute while you answer
 a question or two for me?
   Although I have read little about the Tenn Value Added Assessment system,
 I gather that it is a means of measuring what it is that a particular
 teacher contributes to the basic skills learning of a class of students.
 Let me stipulate for the moment that for your sake all of the purely
 statistical considerations attendant to partialling out previous contributions
 of other teachers' "additions of value" to this year's teachers' addition
 of value have been resolved perfectly--above reproach; no statistician who
 understands mixed models, covariance adjustment and the like would
 question them. Let's just pretend that this is true.
   Now imagine--and it should be no strain on one's imagination to do so--
 that we have Teacher A and Teacher B and each has had the pretest (Sept)
 achievement status of their students impecabbly measured. But A has a
 class with average IQ of 115 and B has a class of average IQ 90. Let's
 suppose that A and B teach to the very limit of their abilities all year
 long and that in the eyes of God, they are equally talented teachers.
 We would surely expect that A's students will achieve much more on the
 posttest (June) than B's. Anyone would assume so; indeed, we would be
 shocked if it were not so.
   Question: How does your system of measuring and adjusting and assigning
 numbers to teachers take these circumstances into account so that A and B
 emerge with equal "added value" ratings?
 

=========================================================================

=========================================================================
Date:         Sat, 10 Dec 1994 20:55:23 CST

From: William Sanders
 
The process is built on the formulation of the late C.R.
Henderson, a Cornell animal breeder who was named a fellow in the
American Statistical Association for his pioneering work in this
area.  Henderson's development of the concept of best linear
unbiased prediction (BLUP), has been shown to be related to
other shrinkage estimator concepts by David Harville at Iowa
State and others (i.e. many bayesian concepts, kalman filtering
from the engineering sciences, some of the hierarchical linear
model concepts, etc.) .  However, we have found that Henderson's
formulations have tremendous computing advantages over some other
equivalent alternatives.  In fact, while I was a consultant to
SAS Institute, Inc., as they were planning and implementing their
rather new MIXED procedure, I strongly recommended that they use
this formulation in their development and manuals -- a
recommendation which was accepted.
 
     As we apply these approaches in the context of the
estimation of the teacher and school effects on the academic
growth of populations of students, we take advantage of the prior
knowledge of the distribution of the variance-covariance
structure among populations of teachers, as well as the
variance-covariance structure among students.  By so doing, we
solve the following problems:
           1. fractured student records
           2. teachers changing assignments
           3. modes of instruction (team teaching,
              departmental instruction, self contained
classrooms)
           4. use of dissimilar indicator variables over time.
           5. combinations of different quantities and qualities
             of information
 
              etc., etc., etc., etc., .......
 
                                   .........
 
We do not calculate simple gains. For example, we use the whole
observation vector for each child over all subjects and grades.
In fact, we have a 'small' article later presented at an American
Statistical Association meeting that demonstrates how this
approach is superior to traditional multivariate approaches in
that the whole observational vector is not lost due to one
missing value for a variable.
                                  ..........
 
     {Some have worried} that "since the standardized tests are
taken at least one, in some years almost two, months before the
end of the year, gain scores conflate the effects of two
different teachers".  We have developed a process which we call
the 'stacked block concept' which enables the partitioning of
these effects.  This concept is not totally dissimilar to the
recovery of interblock information from incomplete block designs.
 
      We have done a tremendous amount of computer simulation to
evaluate just that concern, and we can demonstrate that this
process is very robust against rather severe mispecification.
                                  ..........
 
     We welcome legitimate and qualified criticism.  For the past
four years, we have developed and implemented this process for
the whole state of Tennessee.  Presently, the data base contains
over 3 million records longitudinally merged for the entire
state's student population.  To provide these estimates for all
schools (and next year all teachers grades 3-8 for the entire
state), the solution to tens of thousands of equations is
required. We are doing this computing on a dedicated RS-6000
workstation.
 
                                   Question:
      But is there any way that you or someone on your staff can
explain in some sort of intuitive way how the numbers you are
crunching have anything to do with children's learning, and
teachers' teaching -- and how you can know that for certain.  I
am not a statistician, but I do know something about the
relationship between math and the physical world, and I know that
just manipulating formuli will not always produce results that
reflect what occurs in the physical world.  How can you know that
what you are measuring is what you think you are measuring?
Especially since it seems unlikely we can directly test your
results to see whether you have "it" right; since there is no
"it" apart from your results that we can directly observe as
verification.
 
                       Answer and further introduction:
The Tennessee Value-Added Assessment system was developed on the
basis of three assumptions:  Fair and valid assessment is
necessary to program improvement; such assessment models can be
constructed; and it is reasonable to expect that children grow
academically at a rate commensurate with their peers if effective
instruction is taking place.  None of these assumptions is
original with us, but the way TVAAS addresses them is new.
 
Because it uses metrics that have long been suspect, at least in
part due to the fact that previous methodology was incapable of
rendering the necessary degree o fairness and validity, TVAAS is
likewise looked upon with suspicion.  That's to be expected.
However, TVAAS in not plagued by the flaws of previous models,
and it may be that a real paradigm shift is in order here.  This
is why:
 
1.  TVAAS utilizes the scaled gain scores students make over time
in order to model their learning patterns.  In this way, it is
possible to note when the normal pace of academic growth
deviates.  Obviously, this cannot be accomplished by figuring
simple gains for each child.  TVAAS uses the entire observational
vector for each child across time.  By using all the available
data, TVAAS estimates the effects of educational entities without
having to calculate the gain for individual kids.  The records of
students with fewer data points are weighted more lightly than
those of students with complete sets in the calculation of
educational effects, but the data of students with incomplete
sets are not ignored.
        The advantage of following growth over time is that the
child serves as his or her own "control."  Ability, race, and
many other factors that have been impossible to partition from
educational effects in the past are stable throughout the life of
the child.  By taking the child where we find her; by aggregating
the gains of cohorts of students in school systems, schools, and
classrooms; and by constructing covariance matrices among
teachers in a school, schools in a system, and systems in a
state, we can fairly and reliably attribute educational effects.
 
2.  The scaled gain scores are derived from the norm-referenced
items (the CTBS/4) of the Tennessee Comprehensive Assessment
Program (TCAP). TCAP is administered state wide to all students
in grades two through eight and ten.  Scores from science, math,
social studies, language arts, and reading furnish the data for
TVAAS.  The Education Improvement Act, the legislative package
which identifies TVAAS as the model for educational
accountability in Tennessee, mandates that it derive its data
from "fresh, non-redundant, equivalent tests," insuring that
"teaching to the test" is minimized and perhaps eventually
extinguished due to lack of effectiveness.
        Of course, there is a very vocal group that excoriates
any use at all of standardized test data.  We do not contend that
the CTBS/4--or any other assessment tool, standardized or
otherwise--can accurately depict the totality of a child's
learning experience.  However, we believe that the information it
does provide is valuable.  On the other hand, if better
measures are found, TVAAS can easily utilize them in addition to
or instead of the CTBS/4, so long as they provide linear measures
with appropriate statistical properties.
 
3.  TVAAS does not prescribe any particular model for effective
instruction.  Teachers are free to teach as they see fit.  As
long as their students make appropriate gains, teachers can
assume that they are meeting the needs of their pupils,
regardless of the method they choose.
 
4.  TVAAS encourages appropriate instruction for all students.
We all know that students enter the classroom with various levels
of preparedness.  It is not expected that all students will
perform at the same level of competence or will achieve the same
outcomes from a year of instruction.  It is assumed, however,
that each child will achieve the normed gain for his or her age
and grade.  Four pilot studies and three years of state wide
reports have verified that this is a reasonable assumption since
gain has been shown to be unrelated to level of ability, racial
make-up of the student body, or socio-economic indicators such as
percentage of students who receive reduced-price and free
lunches.  What this means is that if each child is taught
according to his or her ability and level of preparedness, normal
gains are to be expected.  We think this attention to the needs
of the individual child is crucial. Schools, systems, and
teachers who do best under TVAAS are those who address the needs
of all their students.
 
5.  TVAAS can and does furnish far more than the sterile reports
schools are used to receiving as a result of standardized tests.
Each school in Tennessee will receive, beginning this year, a
report broken down by student achievement level, indicating the
gains of students in four to five divisions, ranging from low to
high.  Average gains for each achievement level in each tested
subject and grade will be provided. This allows schools to
pinpoint areas in which they are doing well and areas that need
attention.  For instance, a pattern of high gains for low
achievers and low gains for high achievers might suggest that a
school was doing well by its slower students but might not be
offering enough enrichment or accelerated classes for its high
achievers.  Schools can also assess their programs in specific
subject areas and grades.  All of this is essential to effective
program improvement.
 
6.  An outgrowth of TVAAS is an enormous data base of educational
information.  Currently, there are over three million pieces of
merged data in the data base.  This unique resource allows for an
unprecedented perspective on educational phenomena.  It is
already yielding significant findings and promises to continue to
enlighten research on a tremendous range of educational issues
for many years to come.  The University of Tennessee Value-Added
Research and Assessment Center hopes to form research
collaboratives to investigate the findings deriving from the
TVAAS data base as well as educational research questions
originating from other sources.
 
{We wish to} encourage you to think about collaborative research
projects.  I don't think I'm stretching it when I say that the
TVAAS data base is an educational treasure, totally unprecedented
in educational research.  Consider the fact that we are talking
about a state wide population of students and teachers--not a
sample.  We want it to be used.
 
                                   Question:
    You talked about the question I sent, but did not quite
answer it, however.  Remember I am not a statistician, so it may
be difficult for you to answer what I am asking, but it is the
question of how you know that your method of analysis really does
isolate and "attribute educational effects."  For example,
suppose that second grade teachers teach material in such a way
that second graders seem to make normal gains, but that in
actuality the second grade teacher is teaching "surface tricks"
or algorithms that make it difficult for a child, in third or
fourth grade, to learn the material there because they were not
given an important kind of understanding then.  Would TVAAS
likely point to second grade as the problem?  Also, I can see
that TVAAS can give EVIDENCE for problem educational areas; do
you claim it is CONCLUSIVE evidence or just an indicator that
some specific investigation may need to be done in what seems to
be a problem place?
 
And, from what you said, it makes it seem (correct me if I am
wrong) that teachers are not held accountable for making the
kinds of gains with culturally deprived kids that teachers would
make in culturally advantaged areas -- unless some teachers in
culturally deprived, similar areas also made much better gains.
Did I understand this correctly?  What I am getting at here is
how the assessment is used with regard to evaluating rewarding/
punishing/weeding out teachers or improving teaching methods,
etc.
 
      On re-reading your post about TVAAS I noticed one particular
sentence I want to ask you about, the one that says teachers,
schools and systems who do best under TVAAS are those that
address the needs of all their students.  This is the kind of
thing I was asking about --independent verification that the
people TVAAS says are best ARE best.  What led you to know that
this correlation you mention exists?  And would not that
independent kind of verification work in place of, or better
than, TVAAS? What I am getting at is how do you know that the
teachers your methods say are doing the best ARE the teachers who
are doing the best?
 
                        Further question (Gene Glass):
      I gather that TVAAS is a means of measuring what it is that
a particular teacher contributes to the basic skills learning of
a class of students.  Let me stipulate for the moment that for
your sake all of the purely statistical considerations attendant
to partialling out previous contributions  of other teachers'
"additions of value" to this year's teachers' addition  of value
have been resolved perfectly--above reproach; no statistician who
understands mixed models, covariance adjustment and the like
would  question them. Let's just pretend that this is true.
   Now imagine--and it should be no strain on one's imagination
to do so--  that we have Teacher A and Teacher B and each has had
the pretest (Sept) achievement status of their students
impeccably measured. But A has a class with average IQ of 115 and
B has a class of average IQ 90. Let's suppose that A and B teach
to the very limit of their abilities all year long and that in
the eyes of God, they are equally talented teachers. We would
surely expect that A's students will achieve much more on the
posttest (June) than B's. Anyone would assume so; indeed, we
would be shocked if it were not so.
   So the question is: How does your system of measuring and
adjusting and assigning numbers to teachers take these
circumstances into account so that A and B emerge with equal
"added value" ratings?
 
 
                                    Answer:
>     You talked about the question I have, but did not quite answer it,
>however.  Remember I am not a statistician, so it may be difficult for you to
>answer what I am asking, but it is the question of how you know that your
>method of analysis really does isolate and "attribute educational effects."
>For example, suppose that second grade teachers teach material in such a way
>that second graders seem to make normal gains, but that in actuality the
>second grade teacher is teaching "surface tricks" or algorithms that make it
>difficult for a child, in third or fourth grade, to learn the material there
>because they were not given an important kind of understanding then.  Would
>TVAAS likely point to second grade as the problem?
 
More than likely.  Although it is true that there are algorithms that
improve test taking skills to some degree, large gains over all subject
areas attributable to such "tricks" have not been documented.  As to
whether we can isolate teacher effects, the answer is yes.  I am not a
statistician, either.  I am a theoretician who interprets TVAAS to
non-statisticians.  Therefore, I will tell you what this model does
without telling you how.  If you want to know how, re-ask and I will
restate.
        TVAAS, in determining teacher effects, the first reports of which will
be issued in 1995, uses the gains of at least three years of a teacher's
students as data.  This information is entered in a covariance structure
that includes the performance of these students under their previous and
subsequent teachers, as well.  Since we can follow students over time, we
can examine whether one teacher is injecting an artificial high.  If a
teacher's students make tremendous gains above those achieved under their
previous teachers and then "fall off the table" the year after, we may suspect
that something bogus is happening.  The software is engineered to call
such data to our attention.  You have to understand that we are talking
about whole classes of students for three or more consecutive years and
that, usually, these students are disbursed to many subsequent teachers,
not just one who is offering substandard instruction.  At this point you
may be getting your first glimpse at how revolutionary TVAAS is.  These
data are computed simultaneously for all students in grades two through
eight in Tennessee, along with the covariant matrices for teacher,
school, and system effects.  As long as a child takes TCAP, we can follow
him or her from school to school, system to system, and teacher to
teacher.
 
> Also, I can see that TVAAS can give EVIDENCE for problem educational areas;
> do you claim it is CONCLUSIVE evidence or just an indicator that some
>specific investigation may need to be done in what seems to be a problem
>place?
 
Excellent question, and the answer is that TVAAS supplies indicators that
something is going well or ill in specific areas, at least in so far as
certain subject-specific achievement in specific grades.  TVAAS furnishes
this information to schools, systems, and, next year, to teachers for
grades 3-8 in science, social studies, math, language arts, and reading.
 
> And, from what you said, it makes it seem (correct me if I am wrong) that
> teachers are not held accountable for making the kinds of gains with
> culturally deprived kids that teachers would make in culturally advantaged
> areas -- unless some teachers in culturally deprived, similar areas also
> made much better gains.  Did I understand this correctly?
 
Absolutely not.  As I told you, where a kid starts does not influence the
expected gain for that child.  Deal is, all those schools in poorer areas
where the students enter disadvantaged can look good for once if they
achieve normal--not to mention extraordinary--gains.  This is really
important.  Some of our inner city schools are, for the first time, able
to demonstrate that they are bringing their kids along much faster than
our suburban showcases.  In the past, because scale scores labelled these
kids "substandard," no matter how much progress they made, they always
compared poorly to kids from enriched environments who started out miles
ahead of them.  Now, that's not an educational effect.  That's
environmental.  If we're trying to see what schools are doing, we have to
look at growth.  Some of our higher scoring schools and systems are
finding out that, unless their students make gains, even if they scale
well above the norm, they don't do well under TVAAS because they're
failing their kids by letting them coast.  Our best systems achieve
appropriate gains even in their very top scorers.  TVAAS was developed to
insure that EVERY child, regardless of ability, achieved academic gains
normal for their peer group.
 
>  What I am getting at here is how the assessment is used with regard to
>evaluating rewarding/punishing/weeding out teachers or improving teaching
>methods, etc.
 
Hmmm.  Well, there's a complex question.  In Tennessee, there are state
and local evaluation models for performance evaluation of teachers.
Those involve a combination of principals, supervisors, and state
evaluators, depending on which cycle it is and what the teacher
requests.  We also have a Career Ladder which is a series of performance
evaluations, dialogs augmented with "evidence" which is something akin to
a portfolio, professional development initiatives, and leadership surveys
as well as principal surveys.
        The EIA (Education Improvement Act) states that TVAAS cannot be
the sole reason for dismissal of a teacher, and, while school and system
reports are public record, teacher reports are not and are provided only
to the teacher and "appropriate administrators," under the law.
        So far, and for as far as we can see, TVAAS is not part of a
teacher's formal evaluation.  It is, instead, a tool for self- and
program evaluation.
 
You previously asked how we knew whether the teachers we were evaluating as
best really were the best.  Well, once again, the first teacher reports don't
come out until next year.  However, in pilot studies done in the early
eighties, when principals were asked to predict whether their teachers would
be in the top, middle or bottom third as assessed by the prototype of TVAAS,
they were abe to predict the bottom third of teachers in all subjects; there
was good discrimination between the top and average group of math teachers;
but they had no clue as to who their top and average language arts teachers
were.  That's very anecdotal, to be sure.  Interesting, but anecdotal.  The
reason I mention it is to say that TVAAS is an indicator.  Professional
knowledge, performance assessment, TVAAS--all of these are ways of knowing,
and although they are correlated, I contend that their emphases are different.
Performance assessment concentrates upon process.  Professional knowledge of a
teacher by an administrator may also be process of a different sort.  TVAAS is
product oriented.  We look at whether the child learns--not at everything s/he
learns, but at a portion that is assessed along the articulated curriculum, a
portion each parent is entitled to expect an adequately instructed child will
learn in the course of a year.  Of course a child will learn more, and we can
use that information in TVAAS, so long as it furnishes linear metrics with
appropriate statistical properties (I've said that before).  Right now, we've
merged the data for all the kids that have taken ACT and/or PLAN in the last 3
years and are about to enter data on the first year of state wide writing
assessment in grades 4, 8, and 10.  We're also in the process of developing
subject-specific high school tests which have to be on-line by 1999.  We're
going to be examining the relationships among all of these data to determine
whether some offer information not available from others, what the
intercorrelations might be, and, of course, we'll be on the look out for the
unexpected, too.  Wanna help?
 
                           Response to this answer:
 
I did want to comment quite favorably on one of the things you said TVAAS
does --pointing out schools and districts that are "coasting" on what I
would call culturally advantaged students, that is, those districts who
have "good" students to work with because of their home environments, etc.
and who don't add as much to their education as they could, or develop the
potential those students have.  I am in an area where three of the perennially
top four school districts (in the state) based on SAT etc. scores do not, I
believe, do much to develop their students potentials or cultural advantage
nearly as much as they should and could.  I have argued for nearly twenty
years that the school district that finishes in one of the top five each year,
but has a much "lower" socio-economic mix of students with much less
culturally advantaged background is probably the best school district of the
bunch because it does more with the students it gets.  So I was really excited
to see that is something you seek as well.
     Rick Garlikov (dems042@uabdpo.dpo.uab.edu)
=========================================================================
=========================================================================
Date:         Mon, 12 Dec 1994 13:11:16 CST
 
Question from Sherman Dorn:
 
First, I want to divide TVAAS as a research tool from TVAAS as
an evaluation tool.  There is no doubt that a database with several
million records is an invaluable resource.  In addition, I
can't see anything inherently wrong with the mixed-model methodology
as described, and quite a bit good.  As a tool of policy,
I have some additional qualms about the effects of a high-stakes
evaluation instrument as the data and law currently exists.
But for now I'm only going to talk about TVAAS as research and
statistics.  Besides, I hold the state legislature responsible for
policy.  So, my thoughts:
 
1)  I think there are several good reasons to enter the prior year's
scale score as a covariate with the current year's score as the
dependent (okay, the matrix of scores, with the matrix of scores from
the prior years as covariates).  First, as Gene Glass mentioned on
EDPOLYAN's mailing list, a good bit of research suggests that
kids who do well on an omnibus test like the CTBS would expect to have
higher gains.  Second, the cross-sectional nature of norming on the
CTBS, at least for the data as it currently is in TVAAS, makes gains
in scale scores ambiguous in meaning:  (a) gains between grades in
1989 may conflate gains one might expect with differences in cohort
experiences; (b) the norming in 1989 did not represent what one would
expect from truly longitudinal cohorts -- due to differences in
populations, as well as grade retention, for example, the grade 3
norming population is not the grade 2 norming population aged a year.
The two are incomparable.  Both of these issues change the scale
scores in a linear fashion, so entering the prior years' scores as
covariates solves the problem.  (Solving Gene Glass' conundrum means
that one assumes a linear relationship between first set of scores and
second set of scores, but that's much more tenable than assuming an
expected gain that's constant across the distribution of first sets of
scores.)
 
2)  There are a few potential problems I can identify with the
procedure TVAAS uses to include information from kids who are not
completely observed (e.g., not taking the language arts test in
fourth grade).  From what I gather -- and correct me if I'm wrong --
the gain score in an unobserved domain is imputed from a combination
of information about the location of the child (system, school, and
teacher) and information taken from the child's scores in observed
areas (i.e., the deviation from the gain one would expect for a
child in that particular system, school, and teacher), and then the
score is given a lower weighting in the algorithm.  First, I
assume that this is multiple rather than single imputation.
Beyond that, it appears that you're assuming that observation of
a single score is random (and thus it is reasonable to assume that
a child's deviation on an unobserved test score would be the same
as on the child's observed test scores).  Have you tried other
models -- specifically, that school officials may try to steer
kids into being absent on days when the test they're weakest on
is being conducted?  For kids in special education, that
bias is almost certainly true and is explicit in the ability of school
officials to exclude kids with disabilities from tests.  Finally, I'm
concerned about the implications of weighting kids' scores differently --this
might end up making kids who school officials expect to do well
being more important in the TVAAS system, and thus giving an incentive
for teachers to teach to kids who are already doing well.  If the
weights are something like .95 versus 1.00, that's pretty minor,
but more severe weighting might give some perverse incentives.
 
3)  I would hope that the TVAAS staff would urge the state legislature
to revise the state code regarding program and personnel evaluation
to eliminate the exclusion of students with disabilities from the
core of the system.  I've already heard reports of this creating
some perverse incentives for schools to ignore the needs of students
with disabilities, and it's part of why I asked about retained
students.  TVAAS staff are not responsible for state policy, but
they can go a long way toward convincing legislators of the need
to revise the statute so individuals with disabilities are not
ignored in program and personnel evaluation.
 
The first two items are really the gist of the matter, and I'm curious
as to your thoughts about them.
 
******************************************************
 
Answer from TVAAS (Tennessee Value-added Assessment System)
 
I don't have time to go into all your points right now, but I would like to
point out, as I did in my last message, that special ed kids are
included in the assessment of schools and systems.  We could easily
include them in teacher assessment, as well, but are precluded by law
from doing so.
 
A point that is very salient here:  all of our research has shown that
Dr. Glass, if he indeed attributes higher gains to higher scorers, is
mistaken.  Gain is not predicted by achievement level.  Low scoring
students are as likely as high scoring ones to achieve normal gains,
above-normal gains, and below-normal gains.  Eliminating potential low
scorers is ineffective in manipulating gain scores for cohorts of
students.  That's one of the good things about TVAAS--your low scorers
must be attended to AND ALSO you must provide challenging materials for your
high scoring students.
 
Please remember, too, that we do not merely compare one year's scores
with another.  The scores kids make are modelled into a description of
what learning is normal for each child over time.  Although we don't
model the actual learning curve for individuals, it is a good metaphor
for what happens deep within the calculations.  If we have, say, ninety kids
who drop from their normal learning curve in various years but under the same
teacher or, conversely, if they rise under a certain teacher and gain from
there the next year, we know something about that teacher.  (If they rise and
then show very low gains the next year, we know something else about the
teacher in question.)  As you know, in Tennessee schools, it is highly
unusual, except in very small schools, for students to remain together from
grade to grade--they tend to be dispursed among teachers. Therefore, within
the model, we must also take into consideration the
interaction among students among teachers among schools among system over time
and calculated simultaneously.
 
Further, I must question how much "strategic cheating" goes on,
especially to the point of asking students to stay home from the test.
We are able to spot such wholesale manipulation of the testing situation as
well as many other forms of cheating, should they occur.  The law
specifies immediate and harsh penalties for those engaging in such
practices.  If anything, the EIA has cut down on such practices, and it is
very rare that such anomolies occur according to our best knowledge
and that of the SDOE's Department of Accountability.
 
After you read this stuff, would you please reask the questions I didn't get
to?  I'll be right here.
**************************************************
 
Sherman Dorn:
      If the data on Tennessee kids shows equivalent expected gains for
low-performing and higher-performing kids, that would be very interesting --
and important -- research.  (I still prefer using prior scores as
covariates, because of the non-equivalence of norming populations
in different grades -- unless it makes no difference in the order of
estimated effects.)
 
That finding may not hold, however, for kids in special education.
Because their participation in TCAPs is at local discretion, you may
be getting a selection effect.  (You may not be getting a selection
bias; you just don't know without 100% participation.)  If some
system(s) mandates participation for kids labeled LD, MR, or SED,
that would be a possible way of examining the issue (assuming the
system has some way of accommodating their needs appropriately and
that accommodation is held fairly constant across the years for a
specific child). (LD=learning disabled, MR=mental retardation,
SED=severe emotional disorder.)
 
Besides, my concern is less with conscious manipulation than the
fact that high-stakes testing can feed people's prejudices and
lead to perverse incentives regarding students with disabilities.
Because of the local decisions about testing participation for individuals
with disabilities, teachers and building principals may write off some kids in
*hopes* of improving test scores for others.  (The hope doesn't even have to
be rational.)  Assistant Commissioner Cannon acknowledged this
week in a public forum that he's heard stories of this type from across the
state; I've heard the same thing locally from special ed teachers.
It's very disconcerting when in a group of kids you know, the vast
majority were missing at least one TCAP score in reading or math -- when this
test is the major instrument of personnel evaluation.  Without
their mandatory participation in *some* form of assessment, their
needs will come after those of nondisabled students; it's just the
incentives their exclusion (either mandatory or discretionary) creates.
 
But, then again, this issue of incentives is a policy matter.
******************************************************
 
TVAAS:
Whether special ed student are required to take the test is something
over which we have no control.  As I said before, their gains are, by
law, a factor in the determination of school and school system
effectiveness, so it is certainly in the best interest of principals and
superintendents to meet the needs of these students.
 
I can do nothing whatever about superstition and wrong-headedness of
individuals.  I can only hope, and with good reason, I think, that
systems such as TVAAS will provide an incentive for real improvement for all
children.  That is the basis upon which it was created.
 
Mr. Cannon may have "heard of" such instances.  We have all "heard" of a lot
of things, some of which have basis in reality and some of which
don't.  Document such instances and they will be examined by the
Department of School Accountability.  We want such things stopped as much as
you do.
 
As for finding gains unrelated to level of achievement, yes, that is an
interesting finding, isn't it.  And our n's are huge.  We invite and
encourage use of the TVAAS database for educational research.  Perhaps
you have a point you would like to examine?
***************************************************************
 
(The following message also came to me off-list from someone else, and it
relates to this matter.  I have asked TVAAS for clarification, since I am
confused about what the law or policy is regarding special ed students'
taking, or being exempted from taking, the test that TVAAS uses to make its
evaluations. Rick Garlikov)
 
I read your posting about TVAAS with great interest and have followed up to
get some information about students with disabilities in the system.  I
learned the following.
 
The Reform Act provided that *all* students with disabilities were to be
exempted from testing in the TVAAS.  That has been changed to allow local
option as to whether or not they are included.  There are no consistent
criteria applied to make that decision - it is up to the local school and the
assessment team.  (The TN special education program includes the
gifted and they are not exempted from testing under the Act).
 
If students with disabilities are included in the testing, there will be 2
different reports issued - one with their scores included and one without.
 
I understand the TVAAS is causing a lot of consternation especially among
teachers.  As I understand it, the first year of the system, the *Report Card*
was issued only for the school systems and in year two, it included data for
each school in each system.  This is now year three and the data will be
specific to each teacher in the system as well at the end of this year.
 
I look forward to finding out more about this topic in your future
postings.
***************************************************
 
Rick Garlikov (dems042@uabdpo.dpo.uab.edu)
=========================================================================
=========================================================================
Date:         Tue, 13 Dec 1994 06:49:48 -0600
From:         SHERMAN DORN 
 
I can clarify somewhat the situation regarding students with disabilities.
The 1992 law establishing TVAAS says nothing about students with
disabilities except in the section on "estimates of teacher effects" --
i.e., reports on individual teachers.  *There* the legislature excluded
the scores explicitly from the calculation of teacher effects.
 
According to the Tennessee Value-Added and Research Center, they
include any tests from individuals with disabilities in the calculation of
school and school system effects.  That is, if a person took a test.
However, it is up to the discretion of local officials as to whether
individual students with disabilities will take the battery of
standardized tests that are the basis of the Tennessee Value-Added
Assessment System.  Tennessee only has a limited number of
accommodations possible in the administration of the TCAPs.  In part
because of those limited accommodations, in part because the TCAPs are
a VERY LONG battery of tests (schools only administer one part per
day), in part because of real or imagined fears on the part of
local officials, many *many* students with disabilities either do
not take the tests at all or only take a few sections.  (I could
call the Division of Accountability to ask about the actual numbers,
though it may take a while for me to get around to it during the
workday.)
 
In addition, as I wrote to the Tennessee Value-Added and Research
Center, the way they explain their imputation procedure it looks as
though they assume all missing scores are missing at random, and then
they weight the missing scores less than observed scores.
I think that a random assumption is poor for students with
disabilities, and that (depending on the weighting factors) lower
weights for the missing scores *still* gives incentives for
local officials to exclude individuals from some tests.
 
(By the way, there is a group which does research precisely on the
inclusion of students with disabilities in assessment systems.
It's the National Center for Educational Outcomes, located in the
Dept of Special Education at the University of Minnesota.  They have
a slew of reports available through ERIC.)
 
Sherman Dorn
=========================================================================

Date:         Tue, 13 Dec 1994 20:08:35 CST
From:         Rick Garlikov 
*************************************************************************
     Regarding TVAAS and Students with disabilities:
From Sherman Dorn:
I can clarify somewhat the situation regarding students with disabilities. The
1992 law establishing TVAAS says nothing about students with
disabilities except in the section on "estimates of teacher effects" --
i.e., reports on individual teachers.  *There* the legislature excluded the
scores explicitly from the calculation of teacher effects.
 
According to the Tennessee Value-Added and Research Center, they
include any tests from individuals with disabilities in the calculation of
school and school system effects.  That is, if a person took a test.
However, it is up to the discretion of local officials as to whether
individual students with disabilities will take the battery of
standardized tests that are the basis of the Tennessee Value-Added
Assessment System.  Tennessee only has a limited number of
accommodations possible in the administration of the TCAPs.  In part
because of those limited accommodations, in part because the TCAPs are
a VERY LONG battery of tests (schools only administer one part per
day), in part because of real or imagined fears on the part of
local officials, many *many* students with disabilities either do
not take the tests at all or only take a few sections.  (I could
call the Division of Accountability to ask about the actual numbers,
though it may take a while for me to get around to it during the
workday.)
 
In addition, as I wrote to the Tennessee Value-Added and Research
Center, the way they explain their imputation procedure it looks as
though they assume all missing scores are missing at random, and then
they weight the missing scores less than observed scores.
I think that a random assumption is poor for students with
disabilities, and that (depending on the weighting factors) lower
weights for the missing scores *still* gives incentives for
local officials to exclude individuals from some tests.
 
(By the way, there is a group which does research precisely on the
inclusion of students with disabilities in assessment systems.
It's the National Center for Educational Outcomes, located in the
Dept of Special Education at the University of Minnesota.  They have
a slew of reports available through ERIC.)
************************************************************
 
TVAAS:
As for who gets tested among special ed students, I really don't know,
and I'm going to have to seek that information from State Department
sources.  As I've said before, the EIA says that special ed students are
excluded from the TVAAS assessment of individual teachers, but their
scores are included in the assessment of schools and school systems.  I
checked with John Schneider who works with the raw data, and he says that the
data we receive distinguishes special ed students only by the time
they spend in special ed classes per week.  These students fall into one of
four groups depending upon the amount of time they spend in special
ed., but we don't use that data because whether they are included or not is a
question of "any" or "none," not "to what degree."   Therefore, gifted are
included or excluded along with the rest of special ed kids since
they are not identified as gifted on the data collection sheets; and all
special ed, regardless of the hours are treated in the same manner.
=========================================================================
 
 
Date:         Thu, 15 Dec 1994 19:32:19 CST
From:         Rick Garlikov 
Subject:      Re: TVAAS #3
 
      From TVAAS:
 
I need to answer a couple of points I haven't addressed in the replies
below.
 
On Tue, 13 Dec 1994, Rick Garlikov wrote:
> *************************************************************************
>      Regarding TVAAS and Students with disabilities:
> >From Sherman Dorn:
> I can clarify somewhat the situation regarding students with disabilities. The
> 1992 law establishing TVAAS says nothing about students with
> disabilities except in the section on "estimates of teacher effects" --
> i.e., reports on individual teachers.  *There* the legislature excluded the
> scores explicitly from the calculation of teacher effects.
>
> According to the Tennessee Value-Added and Research Center, they
> include any tests from individuals with disabilities in the calculation of
> school and school system effects.  That is, if a person took a test.
> However, it is up to the discretion of local officials as to whether
> individual students with disabilities will take the battery of
> standardized tests that are the basis of the Tennessee Value-Added
> Assessment System.  Tennessee only has a limited number of
> accommodations possible in the administration of the TCAPs.  In part
> because of those limited accommodations, in part because the TCAPs are
> a VERY LONG battery of tests (schools only administer one part per
> day), in part because of real or imagined fears on the part of
> local officials, many *many* students with disabilities either do
> not take the tests at all or only take a few sections.  (I could
> call the Division of Accountability to ask about the actual numbers,
> though it may take a while for me to get around to it during the
> workday.)
 
I spoke to State Testing today about the administration of TCAP to
special ed students.  They said that whether a special ed student is
required to take the test is not based on the decision of "officials," as
such.  In other words, special ed testing is not decided on a group
basis.  Instead, whether or not a special ed student is tested is a
decision based on the recommendations of that student's M-team, a group
composed of the student, his or her parent or guardian, the student's
teachers (some or all), guidance and/or social/psychological support
personnel, and, sometimes, an administrator.  This team also decides
whether any modifications in the testing need to be implemented such as
individual administration, provision of large-type editions, provision of
a reader, etc.
> > In addition, as I wrote to the Tennessee Value-Added and
Research
> Center, the way they explain their imputation procedure it looks as
> though they assume all missing scores are missing at random, and then
> they weight the missing scores less than observed scores.
> I think that a random assumption is poor for students with
> disabilities, and that (depending on the weighting factors) lower
> weights for the missing scores *still* gives incentives for
> local officials to exclude individuals from some tests.
 
Not at all.  If a student is missing all scores, that student
doesn't appear in the computations.  If we have some data on a student
and he or she does not take the test for some reason in some year, we do
not assume a "generic" score.  The missing data is specific to that child
and is a projection based upon past performance.  Furthermore, as
subsequent data are collected, the "missing" score is modified to be a
better estimation of what *that* student would have scored and all
computations incorporating that score are also modified to reflect the
new estimate.
=========================================================================
Date:         Fri, 16 Dec 1994 06:56:25 -0600
From:         SHERMAN DORN 
 
Rick Garlikov asks:
 
>Why does TVAAS use (or bother to make) projected scores at all?  Why not
>just say something like "With 91% of the students taking the exam, the
>scores are ....."  What is the purpose of projected scores?  And don't
>projected scores defeat the purpose of taking the exam?
 
There are two general reasons why one would impute scores for any
statistical purpose.  (Impute means to plug in numbers for the missing
values.)  One is technical:  you are underestimating the variance of
a parameter if you only rely on complete records.  The other
is common-sense:  using only complete-data records does not give
the whole picture.  You may be getting a biased picture by only
including records of individuals with all tests.
 
As I understand it, the proper way to do imputation is to decide on
a few a priori models of censoring (or the pattern in which data
is missing) and then test each of those models through a process
called multiple imputation.  I'm going to ignore multiple
imputation and only talk about the models.
 
Some models of missing values are essentially random:  in the case of
test scores, you may reasonably assume that some kids are going to be
sick on the day of the test, and that such illnesses will be distributed
randomly.  On the other hand, some patterns are *not* random, and that
is what I believe to be the case with individuals with disabilities.
I am fairly sure that, for students in special education, they are much
less likely to take a subject test (even with accommodations) if they
are performing very low in that subject.
 
Now, what are the models that TVAAS uses?  According to the article
by Sanders and Horn in the Journal of Personnel Evaluation in Education
(1994), the model for all student scores is partitioned into school
system effect, school effect, teacher effect, and child deviation
from the other effects (and maybe an error term; I'm at home while
I type this).  As I read the article, the imputation model plugs in
a score for a particular subject that is a combination of school system
effect, school effect, and teacher effect for *that* subject and
child deviation *on other subjects and other years* (i.e.,
observable data).  That seems to me to be a random model of censoring;
it's fine if the child's deviation is randomly distributed across
subject areas; it is also fine if missing values are randomly distributed
(and thus you wouldn't expect the deviation for any missing value to be
greater or worse than deviations for observed values).
 
However, if my supposition is right, it is a poor model of censoring for
individuals with disabilities, because the child's deviation is likely
to be nonrandomly distributed, and you should expect much lower values for
deviations in the missing scores.  As I understand it, the TVAAS
imputation model could be overestimating the scores for individuals
with disabilities.
 
Sherman Dorn
=========================================================================
=========================================================================
Date:         Fri, 16 Dec 1994 07:41:22 -0600
From:         SHERMAN DORN 
 
The Tennessee Value-Added and Research Center writes:
 
>I spoke to State Testing today about the administration of TCAP to
>special ed students.  They said that whether a special ed student is
>required to take the test is not based on the decision of "officials," as
>such.  In other words, special ed testing is not decided on a group
>basis.  Instead, whether or not a special ed student is tested is a
>decision based on the recommendations of that student's M-team, ...
 
I stand corrected.  I suspect that in some cases, while an
individualized education plan (or IEP) decided by an M-team may be
the document that *should* govern the taking of tests, parents and
guardians may be pressured to agree to certain testing omissions.
(IEP meetings are well-known in special education to be frequently
concerned with compliance with federal law and less with good
programming.)  Would it be possible for the Tennessee Value-Added
and Research Center to flag schools or school systems where
the pattern of test score omissions seems particularly out of line
(and this needs to be done by hours of service)?  That would be
a valuable service.
 
>> I think that a random assumption is poor for students with
>> disabilities, and that (depending on the weighting factors) lower
>> weights for the missing scores *still* gives incentives for
>> local officials to exclude individuals from some tests.
 
>Not at all.  If a student is missing all scores, that student
>doesn't appear in the computations.  If we have some data on a student
>and he or she does not take the test for some reason in some year, we do
>not assume a "generic" score.  The missing data is specific to that child
>and is a projection based upon past performance.
 
As I describe in my response to Rick, the notion of random imputation
is not isolated to plugging in the mean value.  Technically, I believe
that TVAAS uses an "ignorable response model," which I think is
inappropriate for individuals with disabilities.
 
=========================================================================
Date:         Fri, 16 Dec 1994 17:16:29 EST
From:         Anne Louise Pemberton 
 
Sherman,
   I find the problem with the disabled far less disturbing
than you fear AS LONG AS the assigned scores have some basis on
a/some former scores for the individual, and that efforts are
made to arrange for "make-up tests" for missed tests if/when an
avoidance pattern emerges on an individual student. ...
 
   What concerns me is the application of the data is "rating"
schools and especially individual teachers. If the tests are
only administered every 3 years rather than annually, you are
measuring ONLY the aggregate of the 3 teachers. I'm
uncomfortable with the idea that you can
magically/mathematically deduce exactly what each of the 3
accomplished individually. Perhaps I would feel more
comfortable if there was additional data input from the
teachers themselves, and others who physically observe the daily
goings on.  Things like noisy mechanical equipment,
insufficient textbooks or supplies, room too hot/cold,
dark/bright, etc. etc likewise limit the options a teacher has in
maximizing her/his effort. Are these factors rated and reported?
(Factoring out is still working with "unproven" factoring).
 
    I'm not a statistician, but wonder if we have instruments
and measurements for physical plants at the individual
classroom level. Is it possible to rate an instructional
facility? Is TVAAS doing it now? Should they be? Would a line
on the report say, for instance:
 
Mrs. Sunshine, Rm 323, Tiny Elementary.1993 Student gain .79
grade/age level, in a 1.38 physical facility.
=========================================================================
Date:         Fri, 16 Dec 1994 16:58:58 CST
From:         Rick Garlikov 
 
    Regarding the question of why have projections about missing test
scores:
    TVAAS:
That's a very important question, so I asked Bill to frame an answer.  I
am forwarding it to you, having added only the meanings of BLUE and
BLUP for the benefit of the non-statisticians.
 
---------- Forwarded message ----------
 
One of the major advantages of the mixed model process is that it solves
the 'fractured record' problem.
 
All students do not have complete test records for various reasons: illness,
absenteeism, movement from the state etc., etc.. To complete longitudnal
analyses, only those students with complete records would be included if
traditional multivariate statistical processes were deployed.  However,
with the mixed model process all student records are included, yet still
provide BLUE* of the fixed effects and BLUP* of the random effects.  To
achieve this, the prediction and inclusion of the missing values of
the record for each student is NOT necessary.  Rather BLUE and BLUP come
directly from the solution of the mixed model equations.  Perhaps what has
generated some confusion on this point are our teaching examples which we have
presented in some of our short courses.  In these examples, we asked
attendees to visual a situation in which we would use all other information
to predict the missing value for each student, then 'plug in' those predictions
with appropriate weights to obtain a conceptual approximation to BLUE.
      As to Sherman's concern of differential weighting of special
ed. students, even if some systems are 'playing' games, then the effect would
be observed on the estimate of means, and should have trivial effect on
the estimated gains for schools, systems, and teachers.
 
BLUE:  Best linear unbiased estimator
BLUP:  Best linear unbiased predictor
=========================================================================
Date:         Sat, 17 Dec 1994 09:10:28 CST>
From:         Rick Garlikov 
 
It seems to me that Tennessee teachers probably would be worried about at
least four things regarding the Tennessee Value-added Assessment System:
(1) accuracy of the assessments, (2) administrative uses of the assessments,
(3) meaning of the reports issued, and (4) significance of what is tested.
 
With regard to (1), the point that Sherman makes about projections, and my
earlier question of independent verification of the statistical procedures
used, it seems to me a test is readily available: withhold the grading
of a randomly selected significant sample of students who DID take the test,
make the projections TVAAS thinks they can make to whatever degree of accuracy
about what those test scores probably are, then grade the tests and see how
close the projections are.  This would be an empirical test of a statistical
methodology.
 
Regarding (3),
I would like to see sample reports for districts, schools, individual teachers,
to see what kinds of things TVAAS says -- i.e., what the reports actually
look like.   I am interested in the intelligibility of the reports and what
sort of form they take.  If you could supply examples of a report, please,
or some sample passages from a report.
    Rick Garlikov (dems042@uabdpo.dpo.uab.edu)
=========================================================================
=========================================================================
Date:         Mon, 19 Dec 1994 17:25:09 +0000
From:         Harvey Goldstein 
 
I have come in on what I gather is the tail end of a discussion of missing data
 in the analysis of
TVAA system data to produce estimates of school effects. Apologies therefore if
 the issue has
been discussed already, and also because I'm from a different educational system
 but one
where we have had quite a lot of debate about value added analysis using
 longitudinal data.
 
I have two questions. The first refers to Rick Garlikov's message of Dec 16th
 which I don't
really understand. If you are missing a test score which is the dependent
 variable in the
particular model being used then the traditional procedure is to omit the case,
 here presumably
the student. This is somewhat inefficient, however, (and ignoring the issue for
 now about bias)
and there are a range of imputation procedures (e.g. multiple imputation a la
 Rubin etc) which
attempt to make use of whatever is available. The same can be done for the
 predictor
variables when missing. I do NOT think, however, that this is a standard
 by-product of the
mixed model. What is a by-product of that model is when you have what is often
 referred to
as a repeated measures design where response measurements on some occasions are
missing...then a mixed model approach, as apposed to the traditional
 multivariate one, allows
you to ignore just the missing occasions without throwing out each individual
 subject. As I
understand it, however, the value added model is not a repeated measures one,
 but an analysis
of 'outcome' scores, adjusting for prior achievement...or have I missed
 something important.
 
A second question....in the UK the value added debate has been looking at
 problems with the
sampling errors (standard errors) of value added gain scores...it turns out that
 these are
typically so large that you cannot make any statistically significant
 comparisons between most
of your schools...only those at opposite extremes of a ranking. Is this also the
 case in
Tenessee? If so what do you do about it when reporting?
 
=========================================================================
Date:         Mon, 19 Dec 1994 17:45:47 -0500
From:         Greg Camilli 
 
I think that we probably aren't on the tail end of the
debate on missing values. Regarding the TVAA procedures,
we may be just beginning.
 
I have a few questions (not related to missing values).
First, it was indicated earlier that the norm-referenced
items used by TCAP are somehow related to the CTBS/4.
In this regard, I'm wondering if items are sampled from
the CTBS, or whether new items are being written at every
assessment. The latter is suggested by the phrase
"fresh, non-redundant, equivalent tests." My second
question is regarding the CTBS scores per se. A number
of different metrics are available from CTB: is TVAA
using the IRT (developmental) score scale? Finally, there
is the suggestion that the CTBS/4 scores are linear
measures with appropriate statistical properties. I'm
wondering how this was established. Is the contention here
that the CTBS scale is known to be a linear metric?
(More to follow after I receive the answer to this question.)
 
Perhaps I could also ask if my interpretation of the missing data
procedure is correct. I understand that multiple imputation
does not affect the values of "effect" coefficients. The models
and methods used to obtain these bypass the estimation of individual
scores. However, multiple imputation increases the standard errors
of the estimated coefficients -- this simply refects the notion
that with less than complete data, less is known about the parameter.
Thus, multiple imputation is a post hoc adjustment to estimation.
However, with multiple imputation one can always produce a
"complete" set of data (where imputed values have replaced missing
values) to expedite reporting or secondary analyses. (Please do not
hesitate to correct errors in this formulation.)
 
=========================================================================
Date:         Mon, 19 Dec 1994 19:39:31 -0600
From:         Sherman Dorn 
 
Harvey Goldstein writes:
 
>[An explanation of the difference between multivariate imputation a la
Rubin and repeated measures
>designs.]
 
Thanks -- no, no one has yet mentioned this.  I will admit being relatively
ignorant about repeated measures
designs.  I suspect that when Bill Sanders weighs in, TVAAS will be more
akin to repeated measures.
 
>A second question....in the UK the value added debate has been looking at
> problems with the
>sampling errors (standard errors) of value added gain scores...it turns out
that
> these are typically so large that you cannot make any statistically
significant
> comparisons between most
>of your schools...only those at opposite extremes of a ranking. Is this
also the
> case in Tenessee? If so what do you do about it when reporting?
 
I don't know if the Tennessee Value-Added and Research Center noted the
standard errors when
talking to the press:  I know that the Nashville papers did NOT report
either standard errors nor
took account of that when reporting scores publicly.
 
In the 1983 report co-authored by Sanders on a feasibility study (using a
forerunner of the
TVAAS model) in Knox County, the vast majority of teachers fell within two
times the
standard error of the median, in all three grades and all subjects tested.
In that case, it truly seemed
to be true that the gain score methodology only distinguished the very worst
and very best teachers
from the mass of teachers.

=========================================================================
Date:         Mon, 19 Dec 1994 19:58:04 CST
From:         Rick Garlikov 
 
In response to Anne Pemberton who thought the TCAP testing schedule was
once every three years.  No, it is every year, in the spring.  TVAAS
incorporates three years of data for all its assessments.
=========================================================================
Date:         Tue, 20 Dec 1994 07:04:59 -0600
From:         Sherman Dorn 
 
Please accept my apologies for the prior post:  I was unaware that
the mailer would turn the attachment into gibberish.
 
The following are my thoughts about TVAAS and accountability after
reading the articles and manuscripts available about it and
after a lengthy exchange with a staff member of the Tennessee
Value-Added and Research Center.  The following are largely
comments about the use of TVAAS for policy purposes (i.e., the
evaluation of school personnel and programs).  They are not
comments about the research utility of the TVAAS database.
 
My concerns about TVAAS-as-evaluation focus around the following
five issues:
 
1.  The state legislature has confused the technocratic tool of
TVAAS-as-evaluation with the fundamentally political task of
program evaluation.
 
2.  The state legislature locks program evaluation into a set of
centrally-calculated, rather than grassroots-developed, statistics.
 
3.  The state legislature locks educational program evaluation into
summative, rather than formative, evaluation.
 
4.  The creation of the TVAAS locks the state into a testing program
(the Comprehensive Test of Basic Skills) that gives little
useful feedback to teachers.
 
5.  The state legislature has discriminated against individuals
with disabilities in the exclusion of students eligible for special
education from the TVAAS estimates of teacher effects.
 
The following will discuss each issue in turn.
 
1.  The creation of TVAAS at the heart of educational evaluation
demonstrates the legislature's confusion about the fundamentals
of evaluation.
 
When the Tennessee legislature enacted TVAAS, it tried to put a
statistical model (value added assessment) at the heart of
evaluation.  Essentially, it instructs the state board of
education to gauge whether schools meet the "required rate
of progress" on TVAAS in order to be
in compliance with state policies (i.e., that schools should have
a "mean gain for each measurable academic subject within each
grade greater than or equal to the gain of the national norms"
[Tennessee Code Annotated $49-1-601 (b) ]).  If schools
don't meet this rate of progress, the commissioner of education
can place school systems on probation and can, if the rate of
progress doesn't come up to speed, remove local board members
and the local superintendent.
 
Similarly, the legislature allows TVAAS estimates of teacher
effects to be used for formal personnel evaluation after three
years of data exist.
 
Now, while I think that statistics are an important part of
policymaking when used correctly, I do *not* think that it is
intelligent to put a single mechanism at the heart of such
decisions.  Fundamentally, decisions about competence in
administration (and teaching) are political.  I suspect that
any such decisions in Tennessee will eventually become
a matter of political judgment, but even in that case educators
will then see TVAAS through cynical eyes:  "They said it was
the numbers that are all, but look at who they decided to
remove."  It is much better to avoid policies that can easily
be seen as hypocritical.  We have enough of that in education
as it is.  If we are to have evaluation of school systems and
teachers, it must be the *educators* who see the evaluation as
legitimate.  Will TVAAS encourage that?  I'm not so sure; I have
yet to see evidence of teachers' seeing TVAAS as acceptable in
Nashville area schools.  One principal told me that TVAAS made
her school look good, but she didn't think it was a legitimate
tool of evaluation.
 
And I even have some doubts about whether the judgments will
be judgments and not just rote decisions based on cut-offs.
Case-in-point:  the state established a few hundred thousand
dollars which the state board was supposed to distribute this
year to schools who met certain goals, one of which included
a 10% "dropout rate" or less.  One
of the schools it gave an award to was Hume-Fogg Magnet School
in Nashville, which had an amazingly low dropout rate.  Guess what?
Magnet schools don't have dropouts; they send problem students
back to their home schools.  So setting up the criterion led
to an unfair advantage for magnet schools, but the board decided
it still had to follow the guidelines because they had established
the goals explicitly.  This may happen with TVAAS and
probation decisions as well.
 
In short, setting a statistical mechanism at the heart of
evaluation ignores the basic political fundament of evaluation
and policy decisions.  In the long run, it is not wise.
 
2.  TVAAS as established by the state locks Tennessee into
centrally-created, rather than grassroots-developed, statistics.
 
Essentially, the Tennessee Value-Added and Research Center
does the statistical analysis and presents the
information to the state by itself.  With some added "help"
by the media.  (The Nashville Tennessean this year, by the
way, printed two versions of the rank-ordering and scores of
schools this fall.  They had to print it a second time because
they screwed up a lot of the scores the first time.  Imagine
what that does to teachers reading the paper on a Saturday,
and then reading their school's scores the next day as different.)
These numbers are not derivable by educators themselves and --
more to the point -- I have yet to meet a teacher which understands
the information presented to him or her.  This may be because
I have a limited sample, or maybe it's biased towards teachers
who don't have the proper information, or maybe teachers sense
my skepticism and wouldn't speak to me if they really like TVAAS,
but I suspect that it's because the numbers are created far
away from them.  They're not tangible, in the form that a
test graded by them, or in the form of answers correct, would be.
Again, this erodes the legitimacy of evaluation in the eyes
of teachers -- and it's a problem with all centrally-administered
tests.
 
In addition, the central creation of statistics also erodes the
possibility of grassroots political activism to reform schools.
If the core of Tennessee's reform system was -- to create a
thought experiment -- the mandatory publishing by each local
system of certain facts, and manuals distributed to local
organizations and citizens of how to calculate their own
statistics on a home computer, how might educational reform
be different?  The voters might *vote out of office* board
members who didn't move to change things.
 
In reality, statistics are produced both by central administrations
and locally.  There should be a healthy mix, however, and the
weighting of things towards central administration in TVAAS
is unhealthy.
 
Let me give a contrast between Chicago school reform and
the removal of school officials in East St. Louis.  In Chicago,
a broad alliance of people, from businesses to civil rights
activists, were fed up with schools.  They used a mix of
statistical sources -- from standardized test scores to
a report on high school dropouts created by Aspira, Inc.
(a Hispanic civil rights organization) to demand first the
ouster of the superintendent and then the wholesale reform
of schooling in Chicago.  The state legislature made the
*explicitly political* decision to reform and restructure
Chicago's public schools through a statute.  The result (as far
as the legitimacy of the reforms is concerned):  the primary
opponents of the reform were the school principals and
administrators, many of whom lost jobs (and for the principals,
lifetime tenure) through the reforms.
 
In East St. Louis, the state department of education decided to
remove school officials in this largely-African American
city for incompetence.  In contrast with Chicago, the move
was opposed by LOTS of different groups locally, with a
commonly-voiced suspicion that some racial politics were
involved in the decision to remove officials in this particular
school system.  The lesson, I think, is that you need pressure
from both grassroots activists and others allied with them to
create successful movements for school reform in the political
arena.  And I don't think you can get that grassroots activism
without grassroots creation of statistics.  People have to
*feel* they know what's going on in schools.  Numbers created
far from them, which they cannot replicate, aren't enough.
If they only have such numbers, the debate can turn superficial
very quickly, when opponents can only spout numbers at each
other instead of talking about what the numbers mean.  In Chicago,
a local civil rights group did its own research and created its
own, alternative, dropout statistics.  That is a far cry from
what the state legislature has done in Tennessee.
 
3.  The legislature has locked Tennessee into an evaluative model
that is summative rather than formative.
 
When a state uses annual tests as part of an accountability
mechanism, it is creating a summative evaluation system, where
you make a decision about a program as a whole [do we remove the
superintendent], instead of formative, where you can use the information
to make programmatic decisions [what do we do now?].  The problem
with annual tests is that, when you get the information back,
the kids are gone.  The teachers can't do anything to change
their teaching methods for that group of kids.
 
(Okay, I suppose annual tests are formative if you're only looking
at the style of the individual teacher.  It is *not* formative,
however, if you're looking at adjustments for the benefit of
specific children.)
 
In contrast, there *are* formative evaluation systems that let
teachers make instructional decisions many times during the year.
Let me contrast TVAAS (or the CTBS, on which it's based) with
two systems I'm familiar with, one in math and one in reading.
The following table describes how often students take the
test, whether the test provides information about subskills, and
how many tests are needed before you can make a teaching change
inductively:
 
            Frequency   Subskills?   Minimum to change
TVAAS/TCAP  annually       no            unknown
Math        weekly         yes               4
Reading     2x week        no                8
 
In the case of the *truly* formative evaluation systems, you can
make a teaching change every month based on substantive, real data.
If you want to make a teaching change on a particular group of kids
based on TCAPs, you can't.  In order to change teaching methods,
you either need information about subskills during the year that's
comparative across time, or frequent tests, or (preferably) both.
 
And that leads me to ...
 
4.  The state legislature has locked the state into a testing system
with little useful feedback.
 
Put simply, the state currently uses the Comprehensive Test of
Basic Skills as its standardized test (or part of it, the part that's
used for TVAAS).  The only breakdown you can get from the test is
by subject area:  scores in math computation, math application,
reading, social sciences, science (and maybe language arts and
writing, I forget exactly).  But if you're a teacher and wondering
if your students aren't picking up division, you're out of luck.
 
And I suspect that, for a variety of reasons, such tests (without
any item information coming back to teachers) will continue to be
used in Tennessee, because the state is very keen on test security
(and it interprets that broadly to include item information on
feedback) and because the test used for TVAAS will always be
standardized.
 
5.  The current TVAAS discriminates against individuals with
disabilities.
 
This discrimination comes in two forms:  one is the explicit
exclusion of individuals with disabilities from the estimation of
teacher effects (and that's in the statute); the other comes from
the frequent exclusion of individuals with disabilities from
taking the standardized tests at all, and the implicit exclusion
of individuals with severe disabilities from tests that they
really can't take (and thus are excluded from any measure that
would describe their progress from year to year).
 
First, this is a direct denial of benefits to individuals wtih
disabilities.  The state has decided that evaluation is a critical
function of state government, as important to educational improvement as
activities like building construction and textbooks.  Thus, we should
analyze the investment in this evaluation system in the same way
we do for building construction and textbooks.  Would it be acceptable
for the state to create buildings that were only used by nondisabled
students?  Or for the state to purchase instructional materials that
are only used or usable by nondisabled students?  Absolutely not.
That would be a clear violation of section 504 of the Rehabilitation
Act of 1973.
 
Second, the exclusions create an incentive to overidentify students
as disabled.  If a teacher is worried about a student's progress,
and knows that the student's progress will be part of the teacher's
evaluation unless the child is certified in special education,
the exclusion creates an additional incentive to refer the child
rather than make one more attempt to accommodate her or his needs.
(This will probably happen for children who are, in the opinion of
the teacher, borderline.)
 
Third, the exclusions create an incentive to remove disabled students
from regular classrooms.  If a teacher thinks that a student is
somewhat disruptive, and whose scores are unimportant anyway for
TVAAS (or who probably won't take the tests), he or she may push
to get the child out of the classroom "so that the other children
can learn."  This sort of action would clearly violate the right
of disabled children to be in the "least restrictive environment"
where they can be accommodated.  Again, this would probably happen to
children who teachers think are borderline acceptable as far as
presence in the classroom is concerned.
 
Fourth, the exclusions create an incentive for regular education
teachers to ignore the instructional needs of disabled students.
If students are either unlikely to take the test or are by fiat
excluded from the teacher's TVAAS responsibilities, the teacher
may very well concentrate on those children "who matter."  This
could easily be a self-fulfilling prophecy as far as students
with mild disabilities are concerned:  regular teacher thinks child probably
won't take the tests, teacher doesn't pay attention to child, child
learns so little and whose confidence drops so much that the teachers
convince the parents to exclude the child from the tests at the end
of the year.
 
As far as these issues go, the state has at least three possible
ways to eliminate this discrimination:  (a) it could eliminate
standardized tests and the TVAAS; (b) the gain scores of students
who don't take the test could be arbitrarily set to zero; or (c) the
state could invest in a parallel system of teacher supervision, as
heavily funded as TVAAS, for individuals with disabilities.
 
Note that the elimination of the statutory exclusion of individuals with
disabilities from TVAAS would *not* eliminate the discriminatory
effects.  First, some children will never take the standardized tests
because of the severity or nature of their disabilities.  Second,
the exclusion of some children from testing at all retains all
the perverse incentives described above.
 

 
=========================================================================
Date:         Wed, 4 Jan 1995 16:21:37 CST
From:         Rick Garlikov 
 
      This is *MY* response to a previous post; it is NOT a response
      from TVAAS.      Rick Garlikov
 
Sherman Dorn said, then subsequently explained and argued for,
the following:
 
"My concerns about TVAAS-as-evaluation focus around the following
five issues:
"1.  The state legislature has confused the technocratic tool of
TVAAS-as-evaluation with the fundamentally political task of
program evaluation.
"2.  The state legislature locks program evaluation into a set of
centrally-calculated, rather than grassroots-developed,
statistics.
"3.  The state legislature locks educational program evaluation
into summative, rather than formative, evaluation.
"4.  The creation of the TVAAS locks the state into a testing
program (the Comprehensive Test of Basic Skills) that gives
little useful feedback to teachers.
"5.  The state legislature has discriminated against individuals
with disabilities in the exclusion of students eligible for
special education from the TVAAS estimates of teacher effects."
 
Regarding (1):  I am not happy with the language used to express
what Sherman's good examples show he means; and I will return to that
after discussing what I understand him to mean.  First, I don't
think the state has at this point confused the issue, though
there is a reasonable danger, in ways Sherman points out, that
they will.  TVAAS has already said in a previous post that their
results are supposed to be used as INDICATORS that something is
very good, or very wrong, about a school/district/teacher, not
conclusive evidence.  That leaves room for the kind of evidence
to be given about the merit of schools/districts/teachers which
IS reasonable to educators -- a concern that Sherman expressed in
explaining (1).  Keeping this system from degenerating into a
system of rote cut-offs based on TVAAS numbers without use of
judgment is crucial.  But surely it should be possible to prevent
such degeneration; and safeguards should be put into place to
make certain it does not occur.
 
Sherman says "If we are to have evaluation of school systems and
teachers, it must be the *educators* who see the evaluation as
legitimate."  I want to make certain this is not ambiguous; while
educators must see evaluations as legitimate, they should not be
the only ones who do.  If educators are the sole arbiters of
whether they are doing a good job or not, and which of them is
competent or not, without having to give reasons that make sense
to anyone else, I fear we would end up with the same kinds of
problems we have with the AMA policing its own --which they tend
not to do very much unless there is the most egregious and
publicized ineptness, negligence, or malpractice.  My experience
with "ethics" or "licensing" boards in governmental,
professional, and business organizations is that they end up with
a set of guidelines they follow that are superficially ethical or
professional sounding, but which, at any meaningful level,
frequently have very little to do with normal notions of
competence or morality.  In fact, much of the public wants to
know why some administrators cannot seem to identify incompetent or
excellent teachers when left to do so on their own; and why some teachers
are able to get degrees at all.  While the criteria for competence in
education certainly need to be reasonable to educators, they
should also be reasonable to people who are not educators.
 
The reason why I am unhappy with the language of (1) is that I
don't think program evaluation is fundamentally a political
decision, though I think programs often are judged on political
grounds (i.e., power, perhaps coupled with ideology), and perhaps
even more often appear to be.  That is true when evaluations are
not reasonable or when they do not respond to valid reasons given
in opposition to them.  But normally, when people act in good
faith to evaluate a program, they are trying to be reasonable,
not simply asserting power.  The fact that some programs are
judged purely on political (power and ideological) grounds, and
are therefore not really being reasonably evaluated at all,
should not lead us to say that the proper way to do evaluations
is politically.  The proper way to do evaluations is based on
reasons and judgment, not power and blind ideology.
 
Moreover, I thought TVAAS was established to judge how well
policy goals are being met, not how worthy those policies goals
are.  From what has been said so far, TVAAS tries to ascertain
how well certain subject content areas --as prescribed outside of
TVAAS-- are being taught, not whether those subject content areas
are important or whether "knowing" them in certain ways is
important.  It IS important that legislators, administrators, and
the public understand that these other issues ARE crucial, and
that TVAAS evaluations are at best no more meaningful than what
they are (legislated to be) based on.  The media should never
report mere numbers or ratings -- though they will.
 
Regarding (2), I think what Sherman says is true and important
here, but MORESO if instead of speaking only about the
calculation of "statistics", the operating principle was about
the expression of "reasons".  Although statistics can give
evidence, not all evidence is statistical in nature.  Plus, as
Sherman points out, debates in terms of "numbers" alone, rather
than the reasonable meaning of those numbers, can quickly turn
superficial.  In Sherman's examples from the two different school
districts, it seems to me the important features were whether the
reasons given by a centralized authority were agreed with by the
local residents, and vice verse.  Without some sort of consensus,
the issues do become politicized and embattled, turning on a
struggle for power, rather than reasonable and peaceful
resolution.  I think that this way of understanding (2) links it
closely with (1).
 
Regarding (3), I may be misunderstanding the point of TVAAS, but
I thought it WAS established to give summative evaluations about
teaching competence/progress, not formative evaluations about
improving instruction.  So that even if TVAAS cannot be used to
improve instruction or pinpoint students' specific learning
difficulties, that is not a point against it.  It is important to
identify problems or good resources, even if that identification
does not pinpoint a remedy for the problems.  The profession should
try to provide the remedy, I would presume, not TVAAS.
 
Teachers are free to give tests which allow formative evaluations
and instructional corrections.  There is no reason the state
needs to be giving these tests for teachers.  The mission of
TVAAS seems to have nothing to do with this aspect of teaching,
so I don't see that as an actual objection against TVAAS.
 
Regarding (4), Sherman says,
            "the state currently uses the Comprehensive
            Test of Basic Skills as its standardized test
            (or part of it, the part that's used for
            TVAAS).  The only breakdown you can get from
            the test is by subject area:  scores in math
            computation, math application, reading,
            social sciences, science (and maybe language
            arts and writing, I forget exactly).  But if
            you're a teacher and wondering if your
            students aren't picking up division, you're
            out of luck."
 
But again, the purpose of TVAAS, I presume, is to evaluate how
well teachers/schools/districts have done that aspect of their
jobs involved in teaching content of whatever sort is measured on
the Comprehensive Test of Basic Skills.  And surely, shouldn't a
teacher be able to figure out whether his/her students are
"picking up division" or not without the state's having to do
that for him/her.  I take it that TVAAS is not about helping
teachers/schools/districts do their jobs, but about suggesting,
in some attempted objective way, to everyone how well or poorly
they have done it.
 
=========================================================================
Date:         Wed, 4 Jan 1995 20:12:39 -0600
From:         Sherman Dorn 
 
>      This is *MY* response to a previous post; it is NOT a response
>      from TVAAS.      Rick Garlikov
 
And here I was, worried that no one had read that post.
 
Regarding my concern that TVAAS will be used as a rigid evaluation
tool:
 
>First, I don't
>think the state has at this point confused the issue, though
>there is a reasonable danger, in ways Sherman points out, that
>they will.  TVAAS has already said in a previous post that their
>results are supposed to be used as INDICATORS that something is
>very good, or very wrong, about a school/district/teacher, not
>conclusive evidence.
 
With respect to the Value Added and Research Center (VARC) staff
member who posted that information, I beg to disagree, based on the
law.  The relevant sections of Tennessee's education code (49-1-601 through
49-1-610, in the 1994 Supplement to Tennessee Code Annotated)
clearly says that systems which perform subpar for several years are
subject to probation.  The code does mention dropout rates (undefined)
as another criterion, and pulling local officials is at the discretion of the
state commissioner of education, but there is no doubt what is the
central force here:  TVAAS statistics.
 
To be fair to the incoming governor, a Nashville paper reported
yesterday that he was intending not to follow the provisions of the
code regarding the removal of local educational officials.  Of course,
with the reputation local papers have for absolute accuracy, I'll
still wait and see what happens.  (But see what happens if the report
is true:  the governor would be explicitly ignoring a provision of the
education code.  Great.)
 
>Sherman says "If we are to have evaluation of school systems and
>teachers, it must be the *educators* who see the evaluation as
>legitimate."  I want to make certain this is not ambiguous; while
>educators must see evaluations as legitimate, they should not be
>the only ones who do.
 
Agreed.  But this makes it very clear that evaluation is a political
task, in its underlying assumptions if nothing else.
 
>The reason why I am unhappy with the language of (1) is that I
>don't think program evaluation is fundamentally a political
>decision, though I think programs often are judged on political
>grounds (i.e., power, perhaps coupled with ideology), and perhaps
>even more often appear to be.
 
The fact that actions are political does not make them bad.  You can decide,
for example, to fire Ronald Jones on the basis that Jones' lowest-performing
students didn't improve their ability to read at all during ten years in his
class; but that has at its basis the political judgment that a teacher must pay
attention to lower-performing students.  This decision may anger parents of
higher-performing students who loved Jones' teaching and how their children
did.  That's a political decision, not a technocratic one.
 
>Moreover, I thought TVAAS was established to judge how well
>policy goals are being met, not how worthy those policies goals
>are.
 
My contention is that TVAAS is an abdication of the legislature's
need to make policy goals.  It presumes that, by the installation of
TVAAS, Tennessee's schools will improve.  A technocratic
solution.
 
Moreover, my concern is that TVAAS freezes the state into its current
set of tests; it will be very difficult to change evaluation goals at this
point.  TVAAS is a policy flywheel.
 
Regarding summative versus formative evaluation:
 
>I thought it WAS established to give summative evaluations about
>teaching competence/progress, not formative evaluations about
>improving instruction.
 
Sanders and Horn's article in the Journal of Personnel Evaluation
Education claims that one advantage of TVAAS is its formative
potential, and its instructional-methods neutrality.
 
>Teachers are free to give tests which allow formative evaluations
>and instructional corrections.  There is no reason the state
>needs to be giving these tests for teachers.  The mission of
>TVAAS seems to have nothing to do with this aspect of teaching,
>so I don't see that as an actual objection against TVAAS.
 
This is a very interesting claim.  I hope I am not misinterpreting,
but is Rick suggesting that the state should evaluate program
effectiveness to the extent of removing officials and firing teachers,
but not encourage them to use formative evaluation?
 
Moreover, I think the high-stakes nature of TVAAS will drive
other forms of evaluation from consideration, as teachers focus on
the annual tests in the spring.  Besides, there is
NO GUARANTEE that a specific form of formative evaluation,
even if it is tied to the curriculum, will mean students will
perform better on the tests used in TVAAS.  So it is, in fact,
unfair to tell teachers, "well, you go train yourself in formative
evaluation, use it to the best of your abilities, and hope that
the students then do better on our annual tests.  Oh, and no,
the evidence you accumulate in your classroom doesn't count
as much as the TVAAS."
 
In addition, formative evaluation is alien to most teachers, as far as
guiding instruction is concerned.  Getting teachers to use formative
evaluation would require a very different type of support structure,
and quite a bit of money.  It conflicts with TVAAS simply because
it would compete with the VARC for funding.  (And the next
governor has promised he won't raise taxes.)
 
Regarding the usefulness of  feedback of the tests (and my claim that
TVAAS freezes the state into using these tests):
 
>But again, the purpose of TVAAS, I presume, is to evaluate how
>well teachers/schools/districts have done that aspect of their
>jobs involved in teaching content of whatever sort is measured on
>the Comprehensive Test of Basic Skills.  And surely, shouldn't a
>teacher be able to figure out whether his/her students are
>"picking up division" or not without the state's having to do
>that for him/her.
 
I'll defer here to William Webster, head of the evaluation unit in the Dallas
schools, who told a CREATE audience a few years ago that one of
the requirements of top-down evaluation of teachers was that teachers
be provided with explicit, helpful feedback.
 
>I take it that TVAAS is not about helping
>teachers/schools/districts do their jobs, but about suggesting,
>in some attempted objective way, to everyone how well or poorly
>they have done it.
 
This is precisely what it is, and I think Tennessee law
gives an unwise emphasis to one type of evaluation instrument.

=========================================================================
Date:         Wed, 4 Jan 1995 20:52:42 CST
From:         Rick Garlikov 
 
Here is TVAAS's response to Sherman Dorn's first post on the policy aspect
of TVAAS.
----------------------------Original message----------------------------
 
In response to Sherman Dorn's most recent critiques of the Tennessee
Value-Added Assessment System (TVAAS), the UT Value-Added Research and
Assessment Center offers these remarks as a complement to those presented
by Rick Garlikov:
 
On Tue, 20 Dec 1994, Sherman Dorn wrote:
 
> Please accept my apologies for the prior post:  I was unaware that
> the mailer would turn the attachment into gibberish.
>
> The following are my thoughts about TVAAS and accountability after
> reading the articles and manuscripts available about it and
> after a lengthy exchange with a staff member of the Tennessee
> Value-Added and Research Center.  The following are largely
> comments about the use of TVAAS for policy purposes (i.e., the
> evaluation of school personnel and programs).  They are not
> comments about the research utility of the TVAAS database.
>
> My concerns about TVAAS-as-evaluation focus around the following
> five issues:
>
> 1.  The state legislature has confused the technocratic tool of
> TVAAS-as-evaluation with the fundamentally political task of
> program evaluation.
 
TVAAS is a statistical model for program evaluation.  It is not a
"technocratic tool," as Dorn so colorfully phrases it.  It was not
developed by the State of Tennessee but by Dr. Bill Sanders, a
statistician, for the purpose of addressing problems previously
encountered in using student achievement data in educational assessment.
The State of Tennessee adopted it as the model for such assessment
because TVAAS was able to supply valid, reliable, unbiased data based on
student gains, data the State thought was important.  The Education
Improvement Act (EIA) allows the use of TVAAS data as a *component* of the
evaluation of educational entities--teachers, schools, and school
systems.  It specifically states that TVAAS may not be the only source
of data in such evaluations.  Teachers are further protected in several
ways.  First, since the law mandates the use of at least three years of
data for a formal evaluation (although fewer years may be used for
informational purposes), only tenured teachers will receive TVAAS reports
that can be used for assessment purposes.  Second, the EIA specifically
states that teachers may not be dismissed exclusively on the basis of
TVAAS data.  Third, teachers are evaluated by several different models
including performance assessment by local supervisors and principals.
These evaluations are mandated by the State.  Fourth, teachers may elect
to be evaluated intensively to achieve advanced Career Ladder status.
Career Ladder evaluations include dialogs, presentation of artifacts, and
repeated classroom observations.
        TVAAS data is currently available only for grades three through eight
in the subject areas of science, social studies, math, reading, and
language arts.  This means that many teachers do not have the advantage
of TVAAS data.  Subject-matter specific tests for grades nine through
twelve are now being developed by CTB/McGraw Hill in conjunction with
committees of Tennessee secondary school teachers who are devising
questions in their areas of expertise.  However, it is unlikely that
TVAAS will ever assess ALL teachers in ALL subjects.  Therefore, it will
never be the ONLY means of assessing teachers or schools.  TVAAS merely
provides measures of student academic gains, surely a useful component of
any educational assessment system.
        As for Dorn's projections of what "might" happen, we see no basis
in reality for such dire prophecies.  The link he attempts to establish
between the problems associated with fairly rewarding schools for
improvements in the dropout rate and TVAAS is nonexistent.
        Finally, we disagree with Dorn on the political nature of
assessment.  Assessment should be a scientific process rather than a
political one.  Although setting the criteria for assessment is a
political act, assessment itself must be scientific in orientation,
centered on the real meanings of the most common of assessment
terms--reliability, validity, and fairness.
 
> 2.  The state legislature locks program evaluation into a set of
> centrally-calculated, rather than grassroots-developed, statistics.
 
The State of Tennessee has a vested interest in monitoring the health of
one of its largest and most important endeavors--the education of its
children.  It has chosen TVAAS as a means of carrying out its oversight
responsibilities because TVAAS fairly estimates the academic growth of
students, an indicator recognized as important by teachers, parents,
local administrators, community leaders, and state officials, alike.
Other indicators--dropout rate, attendance, promotion rate, and
graduation rate--are also used by the state for assessment purposes.  As
stated above, a variety of methods are used state-wide for the evaluation
of teachers.
        TVAAS is a sophisticated statistical process that is beyond the
capacity of many to grasp mathematically.  However, TVAAS data is
reported in a manner that is comprehensible to any educator willing to
look at the reports and graphs supplied, with explanations, specifically
for the purpose of rendering the data USEFUL. Judging by reports from
many schools and systems across Tennessee, TVAAS data is now extensively
used for curriculum planning and development.
 
> 3.  The state legislature locks educational program evaluation into >
summative, rather than formative, evaluation.
 
For further clarification of Mr. Dorn's point, we quote him from a
paragraph in which he discusses this point, further on in this same
composition.  Mr Dorn writes
        "When a state uses annual tests as part of an accountability mechanism,
it is creating a summative evaluation system, where you make a decision
about a program as a whole [do we remove the superintendent], instead of
formative, where you cn use the information to make programmatic
decisions [what do we do now?].  The problem with annual tests is that,
when you get the information back, the kids are gone.  The teachers can't
do anything to change their teaching methods for that group of kids.
        "(Okay, I suppose annual tests are formative if you're only looking at
the style of the individual teacher.  It is *not* formative, however, if
you're looking at adjustments for the benefit of specific children.)"
 
        First, we would direct those following this topic to Rick
Garlikov's excellent discussion on this point.  We add only a few comments:
        The State of Tennessee considers TVAAS to be a model for both
summative and formative assessment.  Education is an ongoing effort, and
the end of a school year is simply a point in a continuum.  TVAAS is
used, as stated under (2) above, to assess how well students are
progressing under current practices.  If they are making expected gains,
we assume that the teacher, school, or system is providing effective
instruction.  If not, then adjustments should be made.  When systems are
doing very poorly indeed, they are placed under supervision and are more
closely monitored by the state.  While under probation, they are expected
to make progress toward acceptable standards--a formative process
directed by many indicators, one of which is student gain.
        TVAAS provides breakdowns of student gain data to schools and
school systems by school, grades, subjects, and achievement levels of
student (this last, upon request from the Department of School
Accountability).  With all this information, schools and systems can
easily pinpoint problems and successes and make specific policy decisions
based upon this knowledge. The data provide guidance as to "What we do now."
        The State of Tennessee leaves the day-to-day adjustments in
teaching strategies to the classroom teacher.  In the last four years,
Tennessee has conducted a concentrated effort to deregulate its schools,
placing great emphasis on site-based management and school-based
decision-making.  In dispensing with the rules and regulations that have,
in the past, served to dictate how schools may operate, Tennessee has
placed the responsibility for program development, allocation of
resources, the organization of schools, teaching strategies, and many
other things in the hands of local administrators.  Because this is so,
TVAAS was adopted to monitor the educational process, not the "style," as
Dorn puts it.  The "style" in which education is conducted is not
prescribed.  The process of education can be completely individualized.
TVAAS merely assesses the value added, the product of educational
efforts, in the grades and subjects for which we have appropriate
linear measures.  As Garlikov points out, it was never designed to be
otherwise.  And as we have repeatedly stated, no one assessment model will
suffice for all legitimate assessment purposes.
 
> 4.  The creation of the TVAAS locks the state into a testing program
> (the Comprehensive Test of Basic Skills) that gives little
> useful feedback to teachers.
>
        As we have repeatedly told Dorn, TVAAS does not lock the state into
any testing program whatever.  TVAAS can use any assessment data that
provides appropriate linear measures.  In other words, TVAAS requires
scalable data because it assesses progress over time.
        TCAP was in place before TVAAS was ever adopted, and state-wide
annual tests have been used in Tennessee, as in other states, for years.
TVAAS simply utilizes data in a new way, furnishing far more useful
information to teachers and schools than was ever possible in the past.
 
> 5.  The state legislature has discriminated against individuals
> with disabilities in the exclusion of students eligible for special
> education from the TVAAS estimates of teacher effects.
>
Please see Dorn's discussion of this item, below.
 
        First, Dorn refuses to acknowledge, as we have repeatedly told him,
that low-achieving students are at least as likely to achieve appropriate
gains as their higher-scoring classmates (based upon three years of
state-wide data), so there is no logical reason to exclude them from the
tests.
 
        Second, we find it most disturbing that Dorn attributes such
unethical behavior to Tennessee teachers as to ignore students in need
of their tutelage and to connive to have their parents classify them as
special education and to encourage them to avoid testing.
 
        Third, we have suggested to Dorn that if he has any real
knowledge of such occurances that he refer them to the Department of
School Accountability.  There is a mechanism (to use Dorn's
terminology) in place to insure that special education and other students
are not exploited in this manner.
 
        Fourth, Dorn is obviously unaware that the funding for TVAAS
is miniscule and that supervisory personnel are already available in every
system for the purpose of overseeing teacher conduct and, in particular,
special education programs.  Perhaps he believes that they are possessed
of the same iniquities he attributes to teachers and cannot be trusted
with their appointed duties.
 
        In contrast, it is our supposition that teachers and
administrators are dedicated to the same goal that the State of
Tennessee espouses: the improvement of education for all students in
order to maximize the potential of each of them.  We sincerely hope
that TVAAS is a means by which this goal can be more readily achieved.
 
We appreciate the opportunity to respond to these issues.
 
=========================================================================
Date:         Wed, 4 Jan 1995 23:12:35 -0600
From:         Sherman Dorn 
 
The UT Value-Added Research and Assessment Center wrote a lengthy response
to my long comments several weeks back about the Tennessee Value-Added
Assessment System.  (Please accept my apologies for the incorrect name of the
center in other posts.)  I will try to confine my remarks to central issues and
matter of potentially discriminatory effects on children with disabilities.
 
Regarding evaluation and my claim of its political assumptions:
 
>        we disagree with Dorn on the political nature of
>assessment.  Assessment should be a scientific process rather than a
>political one.  Although setting the criteria for assessment is a
>political act, assessment itself must be scientific in orientation,
>centered on the real meanings of the most common of assessment
>terms--reliability, validity, and fairness.
 
This is the core of our disagreement.  TVAAS assumes that
assessment should fundamentally be a scientific process.  I
assume that evaluation's political nature will outweigh any claim to
positivistic science, *especially* if it's asserted as such.
 
(Note:  this does not mean that I am anti-empirical, or that VARAC
staff are against using other materials for evaluation.  This is an
argument about the orientation of evaluation.)
 
Regarding the origins of statistics:
 
>The State of Tennessee has a vested interest in monitoring the health of
>one of its largest and most important endeavors--the education of its
>children.
 
The question here is NOT whether the state should produce any statistics,
but whether its creation of statistics should dominate the debate.  I think it's
much healthier for public debate to include statistics from various sources and
levels of government and often from outside government as well.  I think
the Educational Improvement Act puts TVAAS into an overwhelmingly
dominant position.
 
Re:  my claim of potentially discriminatory effects on individuals with disabili
 
>so there is no logical reason to exclude them from the tests.
 
Information on low-achieving students as a group is NOT information about
individuals with disabilities.  Moreover, my concerns deal with the incentives
for flawed, stressed educators to try to slough off some of their responsibility
 
>        Second, we find it most disturbing that Dorn attributes such
>unethical behavior to Tennessee teachers as to ignore students in need
>of their tutelage and to connive to have their parents classify them as
>special education and to encourage them to avoid testing.
 
Most unethical behavior in institutional settings occur not because of conscious
and evil conniving but because the circumstances and work habits combine
to create incentives for unethical behavior.  Besides, the existence of
accountability frameworks such as TVAAS presumes the irrationality and
untrustworthiness of teachers (else why would we need to hold them
accountable?).
 
>        Third, we have suggested to Dorn that if he has any real
>knowledge of such occurances that he refer them to the Department of
>School Accountability.  There is a mechanism (to use Dorn's
>terminology) in place to insure that special education and other students
>are not exploited in this manner.
 
Such evidence would have to be very clear; the hearsay testimony from
a non-tenure track academic about the supposed intent to pressure
parents into removing their children from testing would not cut it.
However, I do not have to make a hard case of specific
instances in order to make a pretty good case that Tennessee
educators tend to exclude kids with disabilities from testing.
According to the National Center on Educational Outcomes, fewer than
50% of students with disabilities in Tennessee participated in state
testing in 1992.  Some states have much, much higher participation
rates.  My conclusion:  Tennessee schools just don't try hard to include
students with disabilities in testing.
 
>        Fourth, Dorn is obviously unaware that the funding for TVAAS
>is miniscule
 
Compared to the entire education budget, certainly.  However, there is
NO such comparable accountability system with quantifiable tracking of
student outcomes for children with disabilities; when looked at in that
frame, individuals with disabilities receive the short end of this sytem that
the state legislature has put at the heart of educational reform in Tennessee.
If quantifiable tracking is good for nondisabled students, why doesn't
a system exist that will accommodate the requirements of students with
disabilities?
 
This is a question for the state legislature; I assume that VARAC staff
would have no objections to such a parallel system that overlaps TVAAS,
and that they have a great incentive for Tennesse to put in place preventive
measures against discriminatory effects.  Regardless of our differences over
educational evaluation, I assume we can agree on this.

=========================================================================
Date:         Wed, 4 Jan 1995 23:10:05 -0600
From:         Alan Davis 
 
I find that I agree with all of Sherman Dorn's criticisms of the TVAAS
evaluation system.  All evaluation is political, because the process and
findings of evaluation affect power and resources.  The interests and
assumptions of some are reflected in the questions the system are
designed to answer and the test data selected to answer them; the
interests and assumptions of others (teachers, in particular) are
not.  The fact that the system seems to rely on objective evidence of
learning and is statistically sophisticated does not change the
essentially political nature of the process; by cloaking the enterprise
in the garb of science, teachers may be doubly intimidated and policy
makers impressed, but nonetheless you have here a summative evaluation
conducted far from the site of local decision making that is likely, in
my view, to have pernicious effects despite the very sincere intentions
of its designers to be fair.
 
I am open to the possibility that TVAAS may be a valuable tool for
research.  I believe that good teaching is teaching that results in
significant learning for a broad spectrum of students.  At the elementary
level, I will go so far as to agree that the sort of standardized
multiple choice measures designed by CTB will detect improvements in
reading comprehension and mathematical conceptual and problem solving
understandings that most of us want for our children.  I have used
regression residuals from hierarchical linear models of learning outcome
measures regressed on pretests and SES to select individual teachers to
study in my own research on effective teaching.  The main problem, I
believe, is that the multiple choice measures used in Tennesee are at the
outset poor proxies for the tasks we really want kids to be able to
perform in school, and when we attach consequences for teachers around
socres on these measures, very bad consequences are likely to result.
 
Note first, that performance on the measures valued by TVAAS are not
valued by students.  The tests are not part of any instructional unit.
No grade is associated with performance on them.  Students have no
intrinsic or extrinsic motivation to perform well on them, especially if
they don't like mental puzzles for their own sake.
 
At the same time, students' performance on the tests may have
consequences for teachers.  At the very least, someone is forming
judgments about what teachers are good and what teachers are not so good
based upon their students' performances on these tests.  Linda McNeil,
Lorrie Shepard, Mary Lee Smith, and others have documented the behavior
of teachers under these circumstances.  They concentrate on raising the
scores rather than teaching to the broader goals of instruction, and the
strategies they hit upon to raise scores involve adjusting instruction to
model the format, content, and mindset of the multiple choice tests,
sometimes to the point of teaching actual passages and items.
 
The problem, which I do not believe can be avoided, is that the validity
of tests is inseparable from the uses to which they are put.  A test that
may be valid as a research tool to discover effective teaching will lose
its validity when teachers suspect that the outcome has consequences for
themselves.
 
=========================================================================
Date:         Fri, 6 Jan 1995 19:22:04 CST
From:         Rick Garlikov 
Subject:      TVAAS/Harvey Goldstein
 Harvey Goldstein stated on Dec. 19, 1994, "..it turns out that
these [standard errors] are typically so large that you cannot make
any statistically significant comparisons between most of your
schools...only those at opposite extremes of a ranking.  Is this
also the case in Tennessee?  If so what do you do about it when
reporting?"

  From TVAAS: 
     Below are listed the mean gains for math with their standard
errors for schools within one of the larger school systems in
Tennessee.  These means are three year averages and were calculated
from the TVAAS mixed model process.  This should give an idea of
the sensitivity of the process.
 
            INTERMEDIATE SCHOOLS
 
        GRADE   SCHOOL    MEAN   STD. ERR
                RANK      GAIN
 
           3        1   71.596     4.931
           3        2   71.205     3.714
           3        3   67.038     2.624
           3        4   66.876     3.641
           3        5   62.734     3.427
           3        6   62.574     6.906
           3        7    62.14      5.17
           3        8   62.032     3.628
           3        9   61.713     3.534
           3       10   59.062      4.04
           3       11   58.096     3.337
           3       12   57.262     4.849
           3       13    57.22     2.321
           3       14   56.909     5.552
           3       15   55.062     3.866
           3       16   54.775     3.301
           3       17     53.8       2.4
           3       18   53.553     4.459
           3       19   53.452     2.813
           3       20    52.06     2.074
           3       21    51.24     1.411
           3       22   50.853     2.909
           3       23   49.298     3.716
           3       24   48.654     3.134
           3       25   48.269     4.603
           3       26   47.889     3.141
           3       27   47.884     3.714
           3       28   47.864     2.931
           3       29   47.574     5.398
           3       30   46.382      3.73
           3       31   46.281     1.664
           3       32   44.722     3.622
           3       33   43.843     2.533
           3       34   43.272     3.417
           3       35   42.965      2.47
           3       36   42.907     2.443
           3       37   41.844     2.803
           3       38   41.645     2.772
           3       39   41.018     1.859
           3       40   40.079     2.649
           3       41   39.492     2.958
           3       42   38.662     4.633
           3       43   37.784     2.807
           3       44    35.84     3.517
           3       45   33.597     4.541
           3       46    32.46     5.394
           3       47   31.586     3.868
           3       48   30.907     3.329
 
 
                 MIDDLE SCHOOLS
 
           7        1   25.894     0.971
           7        2   23.758     0.861
           7        3   23.134      1.17
           7        4   22.988     1.275
           7        5   22.738     1.347
           7        6    18.51     1.215
           7        7   18.185      0.99
           7        8   16.831     1.071
           7        9   16.268     1.075
           7       10   15.843     1.374
           7       11   15.394     1.055
           7       12   15.357     1.077
 
 
 
     Additionally, there was a question concerning how TVAAS deals
with missing scores.  We will write a longer respose to this
question later.  But, briefly, it is more like the analysis of
repeated measures.  However, we do include all of the scores among
subject-grade combinations.  This is certainly sufficient and
avoids the issues and problems associated with imputations.
Date:         Sat, 7 Jan 1995 04:43:10 CST
 
 Rick Garlikov:    
Sherman Dorn, Alan Davis, and Mark Fetler all make important and cogent
points about the nature of assessment and assessment tools.  But I am
arguing that characterizing these points by the term "political" is
a mistake, and it is a mistake that severely weakens their message and the
force behind that message.
 
      There should be no doubt that assessment is not purely a matter of
science, and that it is both a value-laden enterprise, and one subject to
misleading scientific/mathematic measurements and the use of those
measurements.  But these issues are not "political" issues. ...
      By using the word "political" to characterize the nature of the
technical aspects of TVAAS, or any other scientific/mathematical/statistical
sorts of assessment tools/indicators, you undermine your credibility with
a government and public who believes, and I think rightly so, that surely
there are some sorts of objectively recognizable contributions that teachers
make or don't make to students' educations.  To claim that all teacher
evaluation, by its very nature, is "political" SEEMS to make the claim that
there are no objective standards for discussing/judging teaching; and it makes
it sound as though you are arguing that any judgment of teaching ability is
worthless and unnecessary -- that no teacher is REALLY good or bad; they are
all just popular or unpopular, so there is no fair way to judge any of them.
        It seems to me that what you want to claim, especially if you go
public, or lobby legislators with your points, is that there are many
aspects, not only to good teaching, but even to the teaching of content,
that are not reflected in the tests that TVAAS analyzes -- for the
reasons that you give, and that the studies you cite show.  And that
though there is merit in the statistics that TVAAS produces (assuming that
is true), those particular statistics are not what necessarily determine a
teacher's value or contribution, any more than any one sport statistic
determine's a player's value or contribution.  People will understand
and accept that (at least as a possibility).  They will not accept your
saying that scientific measurement or statistics is political any more than
they would accept your saying that "batting average" in baseball is merely
political.  Neither of these things is "political" in any normal sense of the
word.  You are stretching the use of the word; and in doing so, you are
losing most of your audience since most people will neither understand nor
share your meaning.  And they will think you are saying there can be no
criteria of any sort for legitimately distinguishing good from bad teaching.
 

From:         Rick Garlikov 
 
The example Sherman gives of the firing/keeping of a teacher who
teaches high-achievers well and low-achievers poorly as being a
political decision seems to me to conflate a number of
possibilities here.  It is a political decision IF the wishes and
needs of the aggrieved party are ignored simply because they are
powerless to cause anything of consequence to the people making
the decision.  But it is not political if the teacher is replaced
by a teacher who is able to help both groups of students.  Nor is
it political if a way is found to utilize the skills of the
current teacher while also doing something to improve his/her
skills with the other students, or to divide teaching assignments
in other ways so that both sets of students get teachers who can
help them the most.  Not every decision is a political decision -
-even if some of the results are the same as a purely political
decision would be.
 
Now, since I don't understand what exactly Sherman is attributing
to me with regard to the question of TVAAS's impeding the use of
formative testing in teaching, I think there is some sort of
misunderstanding.  So let me try again.
     Sherman had said that since TVAAS and the test it utilized
did not give formative feedback in, say, students' long-division
skills,  it did not help teachers know whether their students
were "getting" long division at a time that knowledge would be
helpful to better teaching/learning.  My response was meant to
say that this did not matter --as long as TVAAS did nothing to
actually impede good teaching --in this case, presumably, the use
of formative testing by a teacher in order to figure out whether
students were understanding or correctly doing long-division or
not.  The state does not need to train/foster/force/guide good
teaching techniques; it just needs not to impede their usage.  By
analogy, serving good lunchroom food to teachers does not improve
their skill at learning whether their students are improving at
long-division either; but there are other reasons for serving
good food.  Similarly there are reasons for TVAAS apart from
whether it serves to help teachers identify --as they are going
along-- whether students are learning or not.
      If the test that TVAAS uses measures skill in long-division,
then presumably teachers will have a self-interest in teaching
long division well, and in learning to use whatever methods will
do that most reasonably.  But the state, and certainly TVAAS,
does not need to explain to teachers what those methods are.  Especially if
TVAAS is pointing out which teachers seem to be using good
methods.  An environment does not need to be helpful in ALL ways
as long as it is not harmful in important ways.
      Now, there will undoubtedly be teachers, just as there are
students, who will try to find some shortcut, easy method to get
good superficial results even if that means not doing what is
best in the long run.  But that seems to me to be a system
problem only if the test or the statistical use of the test is so
bad as to allow or reward, and therefore encourage, that.  TVAAS
is supposed to be able to pick this out even, not foster it.  If
Sherman has some reason to believe the test or the statistical
package encourages shortcut, easy-way-out "teaching", then that
is important to know.  But his claim was that testing for ability
in long-division at a time too late to correct inability, somehow
precluded teachers from giving such formative testing in their
everyday teaching.  That just seems clearly false to me.  The two
things are at most unrelated, but I would suspect that generally
something like TVAAS --if the test used is not vulnerable to
teaching "toward" in a superficial way-- would encourage teachers
to want to learn the best teaching methods, not the easiest bad
ones.  For Sherman's point to be successful, it seems there needs
to be some evidence that TVAAS and the test it uses does, or likely will,
adversely affect how teachers teach.  The fact that some testing is
high stakes testing, and the fact that some high stakes testing is
counterproductive gives reason to be initially concerned; but it does not
give evidence that TVAAS will actually be high stakes testing or that, if it
is, it will impede rather than enhance good teaching.
      Finally, it seems to me that the concern that TVAAS
statistics will be misunderstood and misused by administrators,
the legislature, the media, and the public, is best remedied by
informing those groups in ways they understand and can use; not
by trying to eliminate one kind of indicator from being used at
all.  Surely, though they resist it often, the media and the
legislature can be made to understand some things -- and that
one of them is that wise judgment needs to be based on all sorts
of available evidence and not just on one statistic.  As I argued
in regard to my baseball analogy of a previous post, ordinary
people do understand, when examples and reasons are given, that
one statistic alone is not likely to indicate the overall value
of a person's contribution to his/her profession.  This is not a
difficult concept to get even a legislator or a newspaper editor
to understand, if it is explained in the right way.  The
legislature may even be persuaded to require certain kinds of
clearly visible, audible, and understandable information about
how to use TVAAS statistics whenever those statistics are
reported in the media.
      I think there are some important concerns about this
enterprise that have not yet been raised, and I intend to raise
them soon if no one else does; but I think they can be remedied
too.  I am not trying to defend TVAAS or the current use of it
against all charges; but I think that, if the statistical
methodology really is reliable and there are tests it can use
which really do give some indication of what students
collectively (or on average) have learned in a content area from
a teacher, and if this information can be gathered relatively
inexpensively and quickly, it is important to use the program and
to simply be vigilant in keeping it from being misused.


From:         Sherman Dorn 
 
In Rick's response recently:
 
>For Sherman's point to be successful, it seems there needs
>to be some evidence that TVAAS and the test it uses does, or likely will,
>adversely affect how teachers teach.  The fact that some testing is
>high stakes testing, and the fact that some high stakes testing is
>counterproductive gives reason to be initially concerned; but it does not
>give evidence that TVAAS will actually be high stakes testing or that, if it
>is, it will impede rather than enhance good teaching.
 
This is an interesting counterpoint to the VARAC's argument that assessment
should be scientific:  don't we require "treatments" to be VERIFIED as both
harmless and beneficial before we let them loose on the general public?
There was no such experimentation with TVAAS (unless people consider it an
ethical experiment to subject several hundred thousand children to the unknown
effects of an untried evaluation system).
 
In reality, of course, there is little opportunity for such experimentation in
policy, and it would misrepresent the purpose of policy (which is often to creat
NEW systems -- and they're going to be new for someone at some point).
But I am very concerned by the switch in burden of proof -- Rick is suggesting t
despite evidence of some disturbing effects of high-stakes testing, we should le
systems go ahead unless someone else can demonstrate concrete effects.  This
represents a rather high standard for opposing a new policy -- after all, we can
demonstrate concrete effects until the policy is in place.
 
Also, let me answer Rick's questions about formative versus summative evaluation
with an example:  Suppose Renee Jones uses formative evaluation in her fourth-gr
class for math and reading, and she responds in appropriate ways (changing instr
when it seems appropriate, either for the entire class or parts of the class).
the kids know everything fairly well, though three kids have only partially mast
long division by the time of the annual tests.  Later in the year, TVAAS gives h
mediocre scores in math.  How is she supposed to respond to this?  Should she de
that she should have concentrated on long division?  Should she concentrate on o
skills?  Should she junk the formative evaluation system, since it obviously did
help her?  Remember, she has only three chances to improve her TVAAS scores in
math before they can be used in personnel evaluation, and she has to make the
decision for the ENTIRE year.  And, the test for TVAAS doesn't give any extra
information.
 
Now, I don't know whether TVAAS is consistent with other measures of school
achievement -- thus far, the only study I am aware of that preceded the state's
imposition of TVAAS was a mid-1980s test of the feasibility of the mixed-
model methodology.  I know of no empirical comparisons between TVAAS
estimates of teacher effects and other measures of student gain.  So let's
choose another, more likely scenario:
 
Renee Jones does *not* use formative evaluation currently.  For the past several
years, she has tried various measures to "improve test scores," as her
principal  is constantly saying to the staff at her school.  None have worked, a
as Tennessee started to use the CTBS, her frustration has just increased with th
lack of concrete feedback and the test scores that are at variance with her
admittedly subjective judgment of the children's competence.  Now comes
TVAAS, and the stakes have taken a quantum leap upwards for her.  The first
set of TVAAS scores for her as a teacher this year looks lousy.  She's very
nervous, and doesn't really know what to do.  She's considering several options:
adopting a math program that's patterned on "Hooked on Phonics," with moderately
expensive tapes but all planned out for the teacher; working with other teachers
at her school in an "experiential math curriculum" (which they haven't yet desig
spending most of her math time in the computer lab to try to teach using the
CAI software she can spend her allotted class material funds for; or investing t
time to learn a formative evaluation of math (which requires a lot of time and m
test scoring and the tests).  Again, of these FOUR options, she
has to choose ONE for a single year.  Also remember, she has a personal history
of trying various measures to improve her test statistics and of the test scores
being at variance with her own judgment of students' competence.
 
It is very likely, in my personal contacts with teachers and knowledge of resear
and gut instinct, that Renee Jones will dismiss the formative evaluation system
because (a) she has no reason to believe it will improve her test scores any mor
than anything else she has either tried or is considering trying; and (b) it is
resource-intensive, moreso than other possibilities.  In this case (and this is
assuming teachers are exposed to the idea of formative evaluation), the high-sta
environment of TVAAS can drive a teacher away from a way of checking kids'
performance that allows MORE flexibility for teachers.  That is why I described
TVAAS as in competition with formative evaluation, and why I believe states
should instead be explicitly supportive of it.
___________


From:         Alan Davis 

Rick,
Let me try to disentangle some of the arguments about the "political"
nature of the TVAAS evaluation system.
 
My point that evaluation is a political act was not meant to be a
criticism of TVAAS, or to argue that evaluation shouldn't occur.  It was
instead a recognition of the fact that when you decide to evaluate a
publicly funded activity, such as teaching, the act of evaluation has
consequences for the prestige, power, and resources of various actors.
When one thinks about the information generated in the process of
conducting an evaluation, one needs to distinguish the question, "Is this
information accurate?"  from the question, "Is this evaluation valid?"
The latter question requires one to examine the interpretations that will
be given to the data, the interests that will be served, and the
consequences associated with providing limited data addressing a limited
range of questions.
 
There are several problems with the validity of an evaluation system that
compares teachers on the basis of the measured learning of students.
1.  The system only credits learning measured on the test.  It tells
nothing about curiosity, work habits, or the social climate of the classroom.
2.  The test is very limited.  It correlates only moderately with the
quality of students' written work, the ability of students to tell about
what they have read, or the ability of students to conduct
investigations, for example.
3.  Students have little motivation to do well on the test.  It is external
to anything else they do in school.
4.  Once teachers become aware that they are being compared on the basis
of test scores, they are pressured to bring the scores up.  To do this,
they are likely to (a) stop spending time on things that are not on the
test -- things like art, music, science projects, written stories and
reports, and class discussions; (b) provide students more practice answering
all sorts of things in the format of the test -- multiple choice; (c)
teach the particular spelling words and vocabulary words that appear on
the test in place of ones that are not; and (d) grow to hate the test and
respond in a mode of resistance/compliance rather than in a pursuit of
professional excellence.
 
Rick, as one who has in the past argued vehemently against the validity
of multiple-choice assessment to evaluate student learning on this very
list, I have a hard time understanding why you are intrigued by the use
of the same tests to evaluate teachers.  No one attempts to reduce the
contribution of doctors or psychologists to a numerical score.  Teaching
is every bit as multi-faceted and context-specific, yet policy makers
want to find a way to reduce teaching to a bottom line.  I believe there
is no way to do this that is actually beneficial to the education of
children in the long run.
 


From:         Rick Garlikov 

Alan,
    I agree with the concerns you list.  And I have others along those
lines, that I will express.
     What I disagree with is calling what those concerns address to be
the "political nature" of evaluation.  (My mail got scrambled today,
and I sent a response to Brian's post before getting yours; but I
think it addresses this again.  My point is not against your concerns;
it is against calling the problems you discuss "political".  I don't
think that helps you make your case with anyone except those who happen
to use "political" in this same, unusual way.)
       TVAAS has made it quite clear that they are only providing one
indicator of possible teaching problems or excellence.  And they are
apparently going to some lengths to keep tests from being predictable
in certain ways.  Sherman feels there is reason to believe that in spite
of TVAAS intentions and explanations, forces are at work in the legislature
and in ed administrative offices to misuse the results, considering them
alone; and that teachers will be pressured into teaching in whatever ways
they can "to the tests."  Part of the reason for my wanting to discuss TVAAS
was to see if there were not some ways of articulating what is going on,
what should be going on, and what is likely to be going on, in ways that
will be useful to the teachers and people of Tennessee to try to make
sure there are pressures to have the "right" things going on regarding
TVAAS and teacher evaluation in general.  I AM concerned there is not
enough of the right sort of information being given teachers, administrators,
the media, and the public.  But I think that can all be addressed and
remedied.
       I never thought teachers should be evaluated solely by means of
a multiple choice test of their students.  But I think it may be a good
indicator of where to look for good or terrible teaching to confirm
what TVAAS shows.  When I was an academic adviser to freshmen and
sophomores at Michigan, incoming freshmen were given Benno Fricke's
OAIS test, which measured a number of things which were reported to
advisers.  I never saw one of those tests be wrong on a student, BUT I
also never took any of those tests as a sure thing.  They were merely
to be used as an indicator of student academic interest areas and
of students' motivation and emotional maturity levels.  Low scores were
simply red flags to be alert to certain kinds of possible problems
developing.  As an advisor I wanted to see other evidence and talk with
the students to see whether the OAIS scores might be significant or not.
        The same sort of thing should be done with TVAAS, and there is no
reason that cannot be made clear to everyone.  The OAIS scores were very
helpful to good counselors and just because bad counselors might misuse
them, that does not mean the test should not be given and scores reported.
         The issues that need to be guarded against are those you bring up.
But I think that can be done.  And, if so, then TVAAS can be helpful in
those cases where administrators in the past seem unable to detect or
remedy problem areas on their own.  It would be very difficult for a
superintendent to continue to tell everyone it is not his system's
fault they have poor GAIN scores, if other systems similar to his in all
other respects have good GAIN scores.  Further, one of the things I like
about the TVAAS approach is that culturally advantaged districts with
high achievement scores will not necessarily show the best GAIN scores --and
in areas of Alabama I am familiar with, that would be a great thing, since
many of the supposedly better systems rest on their students' backgrounds
rather than the school districts' contributions to their education.  They
don't help students gain all that much; and that is important to point out.
      Doctors and hospitals ARE partly assessed in certain kinds of numerical
scores --mortality rates being one important one.  But again, these are
used as indicators, not sole factors.
      I am not only opposed to grading students by multiple choice tests;
I am opposed to grading them at all; and to grading them by means of any
formal tests.  I am opposed to grading teachers too.  But TVAAS is NOT
INTENDED to grade teachers or districts; it is intended to point to
probable problems (and probable successes).  I believe in assessment,  not
grades.  And I believe that people have strengths and weaknesses which
can be assessed in various ways.  I try to assess my students' strengths
and weaknesses so that I can help remedy their weaknesses, or so that I
at least present the right sorts of material in the right ways --stuff they
can assimilate and deal with in some feasible ways.
        Education is one of the professions that has seemed rather weak in
setting or maintaining high standards for itself.   Many good teachers
understand the incompetence of some of their colleagues who nonetheless
receive tenure.  Many parents, for good reasons, as well as those parents
with bad reasons, think certain teachers are really incompetent.  There
seems to be a need for some sort of independent indication of who is
doing a good job and who is not -- teachers and administrators.
        Finally, one of the problems I have about testing students summatively
is that it is not quite clear to me that a given test will be very reliable
for a given student on a given day, for reasons I detailed a year ago in a
long series of posts.  But, what TVAAS is doing is weighing the results of
hundreds of student tests involving a given teacher.  It seems to me that
averages will cancel out a number of individual factors, so that teachers
will get some sort of average score, rather than an individual score.  I
don't want to give a kid a grade based primarily on a final exam; but I think
I should be able to judge how well *I* have taught from the collective nature
of 100 or 200 exams.  When the median on the midterm
in my second semester calculus course
(when I was a freshmen) was 30, out of 84 possible points, the teachers of the
1500 students in the course knew something was wrong with how they had all
taught that first part of the term.  But it was not clear that any one
student's test scores was definitely reflective of what s/he had learned or
could do.
    Anyway, what I am concerned about is ways of verifying TVAAS as a good
indicator of good, bad, or average teaching; seeing what its limitations
are in terms of what it can even identify probabilistically; and then
trying to figure out how to state this so clearly and forcefully that
legislators, media, and the public cannot misconstrue what the TVAAS results
indicate, and how they need to be confirmed or disconfirmed.  All this so
that teachers do not feel threatened by the test itself, nor encouraged to
try to take teaching shortcuts that merely shortchange kids.  It may be
that this cannot be done.  But I don't accept that as the STARTING point
in some a priori fashion.
=========================================================================
From:         Rick Garlikov 
Subject:      Re: TVAAS and Dorn further

In response to Sherman's points.  I was not as clear as I should have
been about the "burden of proof" he discusses.  There are things that
TVAAS has done or tried to do to eliminate the kinds of problems high-stakes
tests cause.  Further, they have tried to be clear in their literature
(though I think this could be much clearer yet) that TVAAS should not be
high-stakes conclusive testing, but merely an indicator of problems.  The
question ought to be whether they have succeeded in keeping the tests
used for evaluation from being high stakes tests or from being able to be
predictably taught "to".  It is not a question of burden of proof; it is a
question of whether the whole program is relevantly similar to the kinds of
high stakes tests, that do cause problems, documented in the literature.
 
As to "Renee Jones", it seems to me the rational thing for her to do would
be to find out who has good test results and go talk with them about what
they are doing that seems to work so well.  It may be they do teach better
than she, and that they can teach her how to do it.  It may be that they
use some trick she feels is unconscionable and short-changes kids.  If so,
TVAAS needs to be aware of this trick that skews their results.  And they
need to pay attention to the person reporting it.
    But it may also be that Renee Jones just isn't a very good math teacher
and isn't likely to ever become very good at it.  Maybe she doesn't really
understand math very well herself.  What then?  Do you want her to keep
teaching math?
     Finally, it boggles my mind that teachers --many teachers-- have no
clue about what I guess is called "formative assessment" which to me just
means doing whatever you can to try to monitor your students' knowledge
and understanding as you go.  I have gone over this ground before, but I
just cannot see how one can be trained as a teacher, certified as a teacher,
hired as a teacher, tenured as a teacher if one does not have at least
some intuitive idea that one needs to monitor what kids are "getting" from
one's instruction so that one can see whether modifications are necessary.
Surely it does not take "resources" to teach this to ed students or for the
state to expect it of teachers.
      Of course there is some question whether what is tested is appropriate --
e.g., long division or whether long division ought to be part of the fourth
grade curriculum.  But TVAAS is set up to test whatever content area is
specified by the curriculum; it is not their job to determine curriculum.
I myself tend to suspect that the math curriculum is not too difficult for
kids if it is taught correctly --which to me involves both practice AND deep
understanding.  But I am not certain primary grade teachers will have
sufficient understanding themselves.  I think finding the proper resources
to teach math well will involve something other than finding resources for
teaching about "formative assessment".  It is about bringing people into
classrooms who have enough math understanding to be able to teach math
decently, even if they have the right teaching techniques and tools.
      Rick Garlikov (dems042@uabdpo.dpo.uab.edu)
=========================================================================

From:         Rick 

Contrary to what Sherman says, teaching is done publicly.  I count kids
as being an audience that sees what is going on.  And parents can observe,
and also talk with teachers.
 
The important point he mentioned, however, is about ballplayers
negotiating the stats that will be relevant to their contract bonuses, etc.
That is only true within relevant limits -- generally the stats used have
to be something that seem relevant to accepted productivity.  Once in a
while some screwy stat clause gets in and actually occurs, but generally
the stat clause is about some obviously relevant achievement.  And though
teachers individually don't have the same luxury of contract negotiations
baseball players do, they often have great collective clout both politically,
and professionally --professionally where they help set standards that
schools, teachers, and districts should be measured against.  Although
teachers sometimes fall victim to having irrelevant standards placed on
their profession by outsiders, that is not always the case; and it doesn't
have to be the case at all, I don't think.

From:         "Fetler, Mark" 
 
     Rick, you said -
 
     By using the word "political" to characterize the nature of the
     technical aspects of TVAAS, or any other
     scientific/mathematical/statistical sorts of assessment
     tools/indicators, you undermine your credibility with
     a government and public who believes, and I think rightly so, that
     surely there are some sorts of objectively recognizable contributions
     that teachers make or don't make to students' educations.  To claim
     that all teacher evaluation, by its very nature, is "political" SEEMS
     to make the claim that there are no objective standards for
     discussing/judging teaching; and it makes it sound as though you are
     arguing that any judgment of teaching ability is worthless and
     unnecessary -- that no teacher is REALLY good or bad; they are all
     just popular or unpopular, so there is no fair way to judge any of
     them.
     ---
     In reply-
 
     Politics is not intrinsically good or bad. It is just the practice of
     resolving disputes or making decisions about the distribution of
     limited resources. I think what distinguishes evaluation from many
     other types of more basic or pure research is that evaluation is
     intended to serve policy makers. The decision to use evaluation is
     political, particularly if it affects policy decisions or the support
     for those decisions. Of course, there are always moral, ethnical, and
     relationship issues around the exercise of power. However, often one's
     perception of these issues is colored by the degree of benefit
     received. All this is true, in my opinion, even if the evaluation
     conforms to the most rigid technical standards and is "cleaner than a
     hound's tooth."
 
     As to objective measures of teachers' contributions. I would submit
     that a really well designed, administered, scored, and reported
     achievement test is an objective measure -- of students' performance
     in answering certain questions. How that measure relates to teaching
     ability requires additional information, for example, about the
     students' abilities, opportunity to learn, etc.
 

From:         Alan Davis 

Rick,
First, I want to concur with Fetler and Stecher regarding the meaning of
"political."  You wrote that "In politics, people often use data to
deceive" and went on to describe how politicians might intentionally
mislead people through a selective interpretation or mis-interpretation
of objective date.  But as Mark and Brian pointed out, those of us who
swim in the seas of policy analysis do not think of "political" as either
good or bad.  The term refers to processes affecting the ability to
influence decisions -- processes having to do with power, in other
words.  Evaluation cannot escape being part of the the political
process.  Even the decision to evaluate has political consequences, even
if no one ever reads the evaluation.
 
You long to introduce information into a politically neutral world, it
seems to me, when you write, "Anyway, what I am concerned about is ways
of verifying TVAAS as a good indicator of good, bad, or average teaching;
seeing what its limitations are in terms of what it can even identify
probabilistically, and then trying to figure out how to state this so
clearly and forcefully that legislators, media, and the public cannot
misconstrue what the TVAAS results indicate, and how they need to be
confirmed or disconfirmed.  Al this so that teachers do not feel
threatened by the test itself, nor encouraged to try to take teaching
shortcuts that merely shortchange kids."
 
It cannot be.  As soon as I publish a list of teachers with their
adjusted gain scores and my "clear and forceful" caveats about what these
mean, some school board member will say, "These teachers at the bottom of
this list should be put on probation" and a good many teachers will feel
threatened by the test itself.  I cannot conceive of any evaluative use
of this information that would not put teachers under pressure.
 

From:         Rick Garlikov 

I have a few questions about TVAAS results, and the way scores
are used and reported.
 
      According to TVAAS results, "student gains were not related
to the ability or achievement levels of the students when they
entered the classroom."  There are a number of possible causes
for this, some good, some not so good.  And I wonder whether
TVAAS has any evidence which causes operate, and to what extent.
      First of all, the evidence is somewhat counter-intuitive for
individual students at least, if not for groups of students.  It
would seem that a "bright" kid with a good background in
language, reading math, etc., curiosity, and motivation would
learn a lot more than a kid with a disadvantaged background who
is also perhaps a bit slower to pick up new concepts or
understanding.  Is the TVAAS result dependent in some way on
averages for groups rather than direct comparisons of the
"brightest high achievers" versus an average or "below average
student"?  That is, do a lot of bright high achievers lose
motivation or slow down on gains for some other reason, in order
to pull the "high-achiever" average down?
      Or are you saying that with equal or really good teachers
lower ability/achievement students can gain in some sense
actually or proportionally as much as higher ability/achievement
students?  That a student who comes into third grade being facile
with multiplication tables will not progress much further into
math (e.g., division, long division, factoring, combining
fractions through lowest common denominators, etc.) than a
student who enters the third grade with no good grasp of either
multiplication facts or an understanding of what multiplication
is about?
      Of course, one, I think bad, way this can happen is if high
ability/achievement kids are ignored and simply not taught to
their potentials, while teaching time, energies, and lessons are
devoted primarily to lower ability/achievement students.  This is
the way many teachers teach, in fact, and it is the what some
school districts seem to promote.  Higher ability/achiever
students are left to learn whatever they do on their own
essentially, so their "gain" relative to their potential gain is
low.  This philosophy often stems from a reaction to an also bad,
previous, philosophy where resources were channeled in helping
the PERCEIVED best and brightest academically while letting other
students slide more than would be helpful.
      Which brings me to the question of whether the notion of
"norms" does not serve as somewhat of a false standard for
students of all levels of ability/achievement.  Shouldn't the
"standard" reflect, not some sort of mean, but some sort of
reasonable potential, as gauged perhaps by what the best
teachers, schools, or districts do, not what the average does.
So that if one school or district consistently has significantly
higher gain scores (along with other evidence that school is
doing much for its students) with, say, the same relative budget)
isn't that evidence such teaching is possible and that it ought
to be the goal to aim for.  Perhaps research evidence might
demonstrate an even higher level to be reasonably possible under
the typical school conditions.  Shouldn't there be some sort of
distinction made about how far from what is possible a school,
district, or teacher does, not just how far from the average they
do?
      Of course, anyone can compute how far below the current best
they are in regard to these particular gain scores, by
subtracting; but I am not certain people bother to do that, or
that if they do, they care about the result.  If it were a
reported number, it might be more of an incentive for schools to
improve instead of trying to maintain or achieve merely the
status quo average.
      I raise this issue because I am concerned about
teachers/parents/administrators having unreasonably low
expectations.  I am aware that there may be "curve setters" based
on additional resources or circumstances not likely to be
replicable in other systems.  I am not after a standard that is
out of reach, but the highest reasonably, or probably, attainable
standards.

From:         Gene Glass 
Subject:      TVAAS, Bright Kids, Gains, Fairness.
 
On Tue, 10 Jan 1995 14:28:52 CST Rick Garlikov said:
>I have a few questions about TVAAS results, and the way scores
>are used and reported.
>      According to TVAAS results, "student gains were not related
>to the ability or achievement levels of the students when they
>entered the classroom."  There are a number of possible causes
>for this, some good, some not so good.  And I wonder whether
>TVAAS has any evidence which causes operate, and to what extent.
>      First of all, the evidence is somewhat counter-intuitive for
>individual students at least, if not for groups of students.  It
>would seem that a "bright" kid with a good background in
>language, reading math, etc., curiosity, and motivation would
>learn a lot more than a kid with a disadvantaged background who
>is also perhaps a bit slower to pick up new concepts or
>understanding.
 
  Indeed, Rick's Sense that something is not right here is quite
  understandable.
 
  I originally asked TVAAS (whoever it is we are talking to) how it could be
  fair to compare the gains in achievement of students from one teacher to the
  next if the abilities (on average) of the students in the classes differed
  substantially. To this question, TVAAS long ago gave the incredible answer
  that it was an empirical fact that in their data there is no difference
  in the gains of bright children and slow children. Rick quoted the
  relevant sentences above.
 
  This artefact is not because bright and slow children actually learn at
  the same rates; it is because the TVAAS system of calculating gains
  surely uses a form of least-squares estimation that forces the pre-year
  measures to be uncorrrelated with the post-year measures.
 
  I can take a group of IQ equal 120 kids who gain 1.5 years in grade
  equivalent units in a school year and a group of 80 IQ kids who gain
  .75 grade equivalent units in a school year and calculate the
  residualized gain score for the entire group and as sure as God made
  little green apples, the "gain" score will be perfectly uncorrelated
  with the pretest score. This is no paradox, nor does it make much sense
  as a measure of true gain.
 
  I continue to believe that TVAAS has no fair way of equating the teachers
  who are compared with respect to the ability of their pupils.
 
  Rick, you are right. TVAAS did not give an answer that was responsive to the
  question, and thus succeeded for a time in finessing an important criticism
  of the system.
=========================================================================

From:         Rick Garlikov 

Alan Davis said:
>It cannot be.  As soon as I publish a list of teachers with their
>adjusted gain scores and my "clear and forceful" caveats about what these
>mean, some school board member will say, "These teachers at the bottom of
>this list should be put on probation" and a good many teachers will feel
>threatened by the test itself.  I cannot conceive of any evaluative use
>of this information that would not put teachers under pressure.
     You are saying school board members will ignore the caveats and will
not be challenged by anyone in doing so.  I would think that if the caveats
ARE clear and forceful, it is likely there would be a strong challenge if
school boards were ignorant enough to act in the way you predict.
 
>
>On the other hand, I don't think any of us who are complaining about
>TVAAS would argue that teaching should not be evaluated.  How best to do
>that is an important discussion to have.  I am arguing that the
>evaluation should not involve students' gains on standardized tests, even
>though I believe that student learning is what teaching is mostly about.
        That is an agument worth pursuing; and I hope we can pursue that.
As I understand the reasons for that though, they are that in high stakes
testing, teachers tend to teach "to" the tests, and in doing so take
shortcuts, leave out other important elements of education, etc.  But
as I said in response to Sherman, since TVAAS and/or TCAP has taken steps
(1) to discourage teaching to the test by changing the test each year,
and (2) to prevent this from being
high stakes testing, by making clear it is only one form of indicator or
evidence about the quality of teaching, and that in content areas alone, it
seems to me the question is whether they have succeeded or not, or could.
    Why do you think they have not succeeded or that they cannot succeed?

From:         Rick Garlikov 
 
Mark,
     TVAAS claims to be able to isolate learning from the sorts of
factors you mention --ability, opportunity to learn, etc.  The question is
whether this claim will stand up.  But it is not that TVAAS has ignored
the sorts of factors you mention; and it is not that they were unaware
of their potential influence.  The only question in THIS regard
is whether they have
satisfactorily accounted for these factors or not.

From:         Sherman Dorn 
 
Rick Garlikov writes:
 
>    You can call looking for, and trying to compile and describe, relevant
>data a political act; but that just confuses a whole bunch of things under
>a designation that applies in only the most superficial way, if it can be
>said to apply at all.
 
Rick, am I incorrect in reading your argument as follows?  "It is unfair to call
statistics political when people try to gather them in good faith; things that a
political are manipulated."
 
I think this is a narrow reading of politics, and it misses the point that often
actions have political assumptions precisely WHEN people act in good faith.
Good faith judgment, in my view, is irrelevant to whether there is a politics
of statistics.  Wiliam Alonso and Paul Starr edited a book, _The Politics of
Numbers_, several years ago, which discussed the production of government
statistics, their political assumptions, and implications.  I remember some
discussion of the political assumptions of some government statistics, but
I don't recall either a conspiracy theory or a description of much crass
political motivation.
>



From:         Greg Camilli 
 
I am reposting an earlier message regarding TVAAS. I hope these
questions will fetch responses, though I suspect the TVAAS staff
are pretty busy.
 
First, it was indicated earlier that the norm-referenced
items used by TCAP are somehow related to the CTBS/4.
In this regard, I'm wondering if items are sampled from
the CTBS, or whether new items are being written at every
assessment. The latter is suggested by the phrase
"fresh, non-redundant, equivalent tests."
 
Second, a number of different metrics are available from CTB:
is TVAAS using the IRT (developmental) score scale which
was established with national norms? Is there a document I
could read to get the specifics of how the tests are constructed
and normed? If so, it would really save a lot of time.
 
Third, I'm wondering if my interpretation of the missing data
procedure is correct. I understand that multiple imputation
does not affect the values of "effect" coefficients. The models
and methods used to obtain these bypass the estimation of individual
scores. However, multiple imputation increases the standard errors
of the estimated coefficients -- this simply refects the notion
that with less than complete data, less is known about the parameter.
Thus, multiple imputation is a post hoc adjustment to estimation.
However, with multiple imputation one can always produce a
"complete" set of data (where imputed values have replaced missing
values) to expedite reporting or secondary analyses. (Please do not
hesitate to correct errors in this formulation.)
 
Finally, I hope that Harvey Goldstein can find time to study and
comment on the standard errors reported to us by Rick. Based on his
previous post I expected to see larger values. BTW, I've been
wondering what the standard errors mean. Usually, I have in mind
that a sample is drawn from a population, and an effect (say gain
score) is estimated from the sample data. The standard error then
conveys how precise this estimate is (much like the "margin of
error" that pollsters use). For TVAAS, what are the sample and
population?
 
 

From:         Rick Garlikov 
 
 Sherman Dorn said:
>Rick, am I incorrect in reading your argument as follows?  "It is unfair to
>call
>statistics political when people try to gather them in good faith; things that
>are
>political are manipulated."
 
No, that is not what I mean.  I am not talking about whether it is fair or
unfair to call something political.  I am merely trying to capture the
standard sense of distinguishing political acts from non-political ones.
Politics tends to involve promoting an ideology or vested interest, often
through some sort of position of power.  It does not have to involve deceit;
and I am sorry I mentioned that aspect of it before.  Politics is generally
distinguished from (attempted) dispassionate, reasoned, unbiased judgment.
 
This distinction is embodied in such dichotomies as "political" vs "non-
political" appointments; or in saying that the courts are supposed to be
a non-political part of government.  Or in distinguishing between political
trials and other sorts of trials, even though the non-political trials may
also have ramifications for many people's lives.  It is supposedly a bad
thing to "politicize" the process of appointing Supreme Court justices, and
democrats and republicans both often agree someone is qualified to serve and
is not a political nominee.  Of course, many supposed
non-political things are actually political; and many politically appointed
judges tend to ignore their own political views --often disappointing those
who appointed them to the bench.  The difference has to do with whether one
is pursuing a bias or whether one is open to evidence and reason in some
at least relatively unbiased way.
      The kind of things you, Mark, Brian, and Alan are discussing are
value-laden, of course, but being value-laden is not necessarily the same
as being political.  A scientist trying to find an answer to some problem
may have to choose among a number of research avenues; and s/he may make
an educated guess of some sort about which might be the most beneficial
to pursue.  Ordinarily, merely in the confines of the lab, that kind of
choice is not a political choice, even though it may involve expenditures
of funds, directions of lots of people, etc.  It is not the same kind of
thing as choosing to pursue a line of research because that is where
government funding is easy to secure or because it leads to promotion
and tenure at the university, fame, or whatever.  These latter I take to
be political considerations about what research to pursue.
     My point is simply that the average citizen makes a distinction
between "political" decisions and "non-political" ones, and even though
there may be some reason to argue his distinctions are sometimes fuzzy or
erroneous, still there seems to be some sort of clear-cut cases; and if
you say something is "political" in policy-analysts' technical sense of the
term that is not political in the ordinary sense of the term, people will
tend to dismiss what you are saying.


From:         Leslie McLean 
 
The purpose of this post is to focus on two aspects of the TVAAS that I feel
have received too little attention: validity and standard errors.
This is not to
say that the political nature of any evaluation is not important or to take
anything away from the discussion of formative vs. summative evaluation.
My other point will be that we are getting mixed messages as to the purpose
of TVAAS.  Since I started to put this collection together, Greg Camilli has
posted some related questions.
 
What follows is a selection from several postings concerning the TVAAS
((Alan Davies, Rick Garlikov, TVAAS, ...)), with comments from Les
McLean added.  Les's comments are enclosed in double parentheses: ((...)).
Emphasis added by Les is signaled by   *** ... ***
 
((Selections from Alan Davis's post of Wed. January 4.  He amplified these
on Jan. 9))
"The main problem, I believe, is that the multiple choice measures used in
Tennesee are at the outset poor proxies for the tasks we really want kids to
be able to perform in school, and when we attach consequences for teachers
 around scores on these measures, very bad consequences are likely to
result.
 
Note first, that performance on the measures valued by TVAAS are not
valued by students.  The tests are not part of any instructional unit.
No grade is associated with performance on them.  Students have no
intrinsic or extrinsic motivation to perform well on them, especially if
they don't like mental puzzles for their own sake.  *** The problem, which
I do not believe can be avoided, is that the validity of tests is inseparable
from the uses to which they are put.  A test that may be valid as a research
tool to discover effective teaching will lose its validity when teachers
suspect that the outcome has consequences for themselves. ***"
 ((emphasis added--and stressed Jan. 9))
 
 
((Date: Wed, 4 Jan 1995 20:52:42 CST
From: Rick Garlikov))
 
"TVAAS has already said in a previous post that their
results are supposed to be used as INDICATORS that something is
very good, or very wrong, about a school/district/teacher, not
conclusive evidence.  That leaves room for the kind of evidence
to be given about the merit of schools/districts/teachers which
IS reasonable to educators -- a concern that Sherman expressed in
explaining (1).  ((Emphasis in original))
 
.....From what has been said so far, TVAAS tries to ascertain
how well certain subject content areas --as prescribed outside of
TVAAS  ***are being taught,***  not whether those subject content areas
are important or whether "knowing" them in certain ways is
important."
 
 ((Emphasis added--this is the clearest statement I could find
that they equate
quality of teaching with scores on standardized tests, scores that have been
scaled and transformed in unknowable ways--see remark from TVAAS
below.))
 
((Rick goes on to make it very clear,)) ".....I may be misunderstanding the
point of TVAAS, but I thought it WAS established to give summative
evaluations about teaching competence/progress, not formative evaluations
about improving instruction. I take it that TVAAS is not about helping
teachers/schools/districts do their jobs, but about suggesting, in some
attempted objective way, to everyone how well or poorly they have done
it."
 
((But then he seems to change his mind,)) ".....TVAAS is a statistical
model for program evaluation.  It is not a "technocratic tool," as Dorn so
colorfully phrases it."  ((When most people refer to program evaluation,
they include content choice, teaching materials and teaching methods, and
they distinguish this from 'teaching competence/progress'.  Given the
complexity of the 'model', Sherman Dorn ought to be allowed to refer to it
as a 'technocratic tool'.))
 
 ....It was not developed by the State of Tennessee but by Dr. Bill Sanders,
a statistician, for the purpose of addressing problems previously
encountered in using student achievement data in educational assessment.
 
The State of Tennessee adopted it as the model for such assessment
because TVAAS was able to supply valid, reliable, unbiased data based on
student gains, data the State thought was important.
 
.....TVAAS merely provides measures of student academic gains, surely a
useful component of any educational assessment system.
.....TVAAS is a sophisticated statistical process ***that is beyond the
capacity of many to grasp mathematically.***  However, TVAAS data is
reported in a manner that is comprehensible to any educator willing to
look at the reports and graphs supplied, with explanations, specifically
for the purpose of rendering the data USEFUL. Judging by reports from
many schools and systems across Tennessee, TVAAS data is now
extensively used for curriculum planning and development.
 
 ((Yes, the mathematics is beyond most people's capacity--including most
of the officials in Tennessee.  How much beyond cannot be determined until
they publish a full technical report--such as the one's ETS publishes about
its surveys.  The term, "mixed model" is quite inadequate to describe the
scaling and multilevel model fitting that is going on.  BUT--the data are
being used for 'curriculum planning and development'?????  THIS
BRINGS ME TO MY MAIN POINT: what evidence is there that the "data"
resulting from these complex statistical manipulations can be interpreted as
INDICATORS of competent teaching--much less the quality of the
curriculum?  From the TVAAS, we learn of an even more ambitious
outcome--they provide schools with gain scores FOR INDIVIDUAL
STUDENTS.  See what they say:))
 
 TVAAS provides breakdowns of student gain data to schools and
school systems by school, grades, subjects, and achievement levels of
student (this last, upon request from the Department of School
Accountability).  With all this information, schools and systems can
easily pinpoint problems and successes and make specific policy decisions
based upon this knowledge. The data provide guidance as to "What we do
now."
 
......As we have repeatedly told Dorn, TVAAS does not lock the state into
any testing program whatever.  TVAAS can use any assessment data that
provides appropriate linear measures.  In other words, TVAAS requires
scalable data because it assesses progress over time.
 
((Appropriate linear measures? ... scalable data?  To anyone familiar with
test theory since Lord and Novick (1968), these phrases suggest a very
considerable locking in, even given Geoff Masters's extensions to IRT.
The numbers the TVAAS are reporting back to schools are a long, long way
from the item responses of the students to these multiple choice questions
that Alan Davies quite rightly characterises as,
" poor proxies for the tasks
we really want kids to be able to perform in school".  Even if we accept the
items as valid indicators, we are entitled to ask for some careful
checks that
the scaled and imputed and estimated scores really do, "provide guidance as
to 'What we do now'".
        According to Sherman Dorn (8 Jan. Post), ))
 
"Now, I don't know whether TVAAS is consistent with other measures of
school achievement -- thus far, the only study I am aware of that preceded
the state's imposition of TVAAS was a mid-1980s test of the feasibility of
the mixed-model methodology.  I know of no empirical comparisons
between TVAAS estimates of teacher effects and other measures of student
gain."
 

 
             ((Concerning the size of the standard errors))
 
Harvey Goldstein stated on Dec. 19, 1994, "..it turns out that
these [standard errors] are typically so large that you cannot make
any statistically significant comparisons between most of your
schools...only those at opposite extremes of a ranking.  Is this
also the case in Tennessee?  If so what do you do about it when
reporting?"
 
     Below are listed the mean gains for math with their standard
errors for schools within one of the larger school systems in
Tennessee.  These means are three year averages and were calculated
from the TVAAS mixed model process.  This should give an idea of
the sensitivity of the process.
 
((A very few of the results are listed
below for reference--the heading (Grade ...) has been edited to make sense.
I echo Greg Camilli's wonderment at the size of these 'Mean std. Err.'s.
Including all data from several years and using imputation procedures for
the (inevitably) missing numbers all lead to larger standard errors.))
 
            INTERMEDIATE SCHOOLS
 
        GRADE     RANK   GAIN   MEAN STD. ERR.
 
           3        1   71.596     4.931
           3        2   71.205     3.714
           3        3   67.038     2.624
                ........................
 
 
                 MIDDLE SCHOOLS
 
           6        1   22.624     1.043
           6        2   20.176     0.824
           6        3   15.602      1.07
           6        4   13.943     1.099
           6        5   13.152     1.286
           6        6   12.521     1.051
           6        7   11.146     0.961
           6        8    9.897     1.194
           6        9    9.362      1.34
           6       10    8.465      0.998
                     ...................
 
           8       12    13.55      6.605
           8       13   13.261      1.738
           8       14    11.08      0.98
                 ......................
 
((The problem for those of us who have calculated, pondered and puzzled
over such results as these, in national and international assessments,
is that
the reported standard errors are unbelievable (impossibly small).  We can't
say they are wrong, of course, because we lack the details of the
calculations, but Harvey Goldstein has analyzed at least as much data and
written several books and taken the lead in multilevel modeling (sometimes
called, by others, hierarchical linear modeling), and his informed and
experienced "opinion" is not to be taken lightly.  The standard errors remind
me of those Richard Wolfe found faulty in the first International Assessment
of Educational Progress--the fault being that the estimates of error
failed to
include all the components reasonable people agree should be included.
Moreover, the Std. Errors above are clearly proportionate to the mean
scores, not a desirable outcome.  The must be at least one error (three lines
from the bottom of those displayed above).  I, too, will leave to later a
comment on the statement below from TVAAS, except to say that whatever
it is they do is not "certainly sufficient":
 
  "...Additionally, there was a question concerning how TVAAS deals
with missing scores.  We will write a longer respose to this
question later.  But, briefly, it is more like the analysis of
repeated measures.  However, we do include all of the scores among
subject-grade combinations.  This is certainly sufficient and
avoids the issues and problems associated with imputations."
   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 

From:         Leslie McLean 
 
   Your questions are much appreciated, Rick, as they bring out unclear
points and (we hope) useful elaborations.
   1. The problem with pairing student achievement with teacher
competence is that the teacher, or even two teachers in succession, can
only build on what the students bring to the teacher's class, and then
they only add a small amount to the total experience of the students.  In
short, it is simply not fair to judge a teacher by what the students
learn in a year unless you are able to bring in lots of contextual
information, not all of it quantifiable.  This argument has been made in
many places and many ways, so I only espouse it.  I conducted a
province-wide (read: state-wide) survey of student achievement in
mathematics and English in 1981 (mostly constructed-response items,
thousands of them) in Ontario, and I produced distributions of classroom
mean achievement, suggesting that as many as 20 percent of teachers were
in need of some help.  BUT my contract specified that the teachers not be
"identifiable" (not just not identified--not identifiable), a sensible
provision in view of the large-scale nature of the survey--a sample, but
covering a larger area than Tennessee, and at least as diverse.  We are
not unaware of statistical developments here in the Great White
North--just skeptical when the mathematics cannot be understood except by
a half-dozen people who have not been in a classroom in more than 20 years.
2. What I am arguing is that whether the "math is sound" is not a matter
of correct calculations.  These multi-step scaling and estimation
procedures involve important substantive decisions at every
step--decisions often made by well-meaning programmers or statisticians
unfamiliar with the context from which the item responses came.  The
programmers and statisticians cannot ask the educators because the math
is so abstruse.  In summary--the numbers the TVAAS report are so far
removed from the test items that students answer that NO ONE can be
very sure the numbers correspond to any reality.
   It is this technical gap we worry ;about and ask to be bridged, and we
see no bridge.  Until there is a bridge, we cannot discuss how
appropriate the process is.
=========================================================================

From:         Leslie McLean 
Subject:      Les stands corrected
 
  Apologies to Rick for attributing to him some text from TVAAS.  I knew
when I was putting it together that this was a danger, and I was not
careful enough.  Mea Culpa.


From:         Rick Garlikov 
Subject:      Your TVAAS post
 
  Les,
I was glad to see your post.  Just wanted to make one small correction
first.   You accidentally confused one of my own posts with one that
was passed on by me from TVAAS.  I am in the odd position of having
my name and return address on all the posts they send me to forward, so
some of them are bound to be seen as MINE by accident.  The part right
after where you said "((But he seems to have changed his mind)) ....TVAAS
is a statistical model for...." is from them, not me.
 
However, I am not quite sure about the point you want to make regarding
the apparent inappropriateness of assessing teachers by their students'
progress on certain kinds of multiple choice types of tests.  Ignoring for
the time what "program evaluation" tends to cover normally, isn't there some
point in trying to assess how well students are in general learning the
specific kinds of content "knowledge" the State has already said they want
taught?  I see TVAAS as supposedly trying to measure how well schools, teachers
and districts have done in helping kids "gain" increased knowledge or skills
over the course of a year.  I cannot comment on the reliability or soundness
or meaningfulness of the mathematics of all this, but it seems to me that
if the math is sound, that the concerns Alan, you, and Sherman have is not
with TVAAS as much as it is with the kinds of tests they are being asked to
evaluate and report on.  TVAAS seems to me to have been quite clear about
the possible limitations of those tests in terms of "general education" of
students and other aspects of schooling.  Their's seems to be a quite
narrow focus; and one that is reasonable.  The problem is in making that
either a prime matter of assessment of teachers, or if Alan and Sherman
are right, ANY part of the assessment of teachers.  I tend to think it
is a reasonable PART of the assessment of teachers.  But it surely should
not be the only aspect of teacher evaluation; and TVAAS says that it should
not.    Do you disagree with their and my view of this?  Do you think there
is no place for standardized test results about certain parts of the
curriculum in seeking INDICATIONS of teachers' competence in those areas?

From:         Gene Glass 
Subject:      Re: TVAAS and norms -Reply
 
On Fri, 13 Jan 1995 07:55:49 -0600 Richard Swerdlin said:
>    Relatedly, an elementary principal in East St. Louis,Illinois
>once told me that he felt guilty over doctoring achievement test
>results (Stanford Achievement Test).  He was fearful of possible
>criticism, if results were below expectations.  Occasionally such a
>case does surface in various parts of the country.
>In conclusion, it is possible that the Texas news item is applicable
>elsewhere too.
 
   More than merely possible...a predictable consequence of these "high stakes"
  systems. When a large district in Texas--some 10 years ago--instituted a
  scheme of paying principals bonuses (up to $15,000) for test score gains,
  a couple of princpals were discovered to have doctored their results.
  Large gains on the target ("high stakes") test were common while no gains
  on closely related tests were also common throughout the district.
     These are demeaning situations in which to place educators.
 
     Would those who administer the TVAAS program be willing to stake their
 yearly raises on a "significant" score gain on National Assessment for
 the state of Tennessee? Or does this kind of "accountability" only apply
 to those under us?
 
     The effects of these high stakes arrangements on teachers and
 curriculum have been much studied, and they are as one would expect:
 debasing the curriculum (as judged by teachers and educators), transformation
 of inquiry into "drill and kill," teaching the test and sometimes worse.
 
     These are difficult matters to talk about because the effects of these
 programs place many teachers in circumstances in which they act
 unprofessionally. Then the politicians feel that their suspicions that
 teachers are unprofessional hacks are confirmed and they institute even
 further demeaning controls.
 
     It is a sign of the political powerlessness of teachers that these
 "high stakes" testing systems exist. They are not applied to lawyers; they
 are not applied to physicians; they are not even applied to college
 professors.
 
     And on this question of power plays. Does anyone else feel demeaned as
 I do by this arrangement that TVAAS has sucked us into for discussing
 these matters with them? We don't speak directly to them; our messages
 are carried to them as if to the great Wizard of OZ because they are too
 busy to be bothered by discussion--as if we who work here are mere slugs
 with nothing better to do with our time than nettle these great
 benefactors of society. There is greater combined statistical expertise
 among the participants in this forum than on the TVAAS staff, and those
 who are on the TVAAS staff have a professional responsibility to the
 discipline that extends beyond intermittent puffery and bluster.
 

From:         Harvey Goldstein 
 
I see that Les McLean and one or two others have taken up my
query about how well schools can be separated taking the
estimated standard errors into account. I don't yet know how
the standard errors have been calculated, but based upon a
table Sandra Horn sent me, I would say that the results (e.g.
for grade 3 based upon a 3 year average in one of the larger
school systems, are in line with our own results. What you do
(roughly) is multiply each standard error by about 1.5, use
this to place and interval (i.e. +-1.5 s.e.'s)about each gain
estimate and judge whether two schools are significantly
different at the 5% level by whether or not the intervals
overlap. Most intervals do! BUT if you average over 3 years
then you get smaller standard errors so fewer do. A particular
problem with averaging over 3 years is that the data are even
more out of date than using the latest 1 year data since they
refer to a cohort who started x years earlier where x = years
between intake measure and output measure + 2. Is such
historical data of great use. One should at least be measuring
trends over time. My own view is that value added procedures
may have some use as a crude screen to detect highly
'overperforming' or 'underperforming' schools etc but are not
diagnoses - such definitive judgements can only be made by
studying what goes on in schools in more detail.
A paper is shortly to appear in J. Royal Statist, Soc, A,
which gives details of the interval setting procedure outlined
above. Anyone who wants an advance copy please send me their
FAX number.
 

From:         Leslie McLean 
 
This is an ASCII file, output from MINITAB, but it might arrive a mess or
be so chopped up as to be unreadable.  If you can read it, you see a
simple analysis of the gains and standard errors Rick sent us from
TVAAS.  My observation that the gains were proportional to the standard
errors does NOT seem to be true--within grades.  If you lump all grades
together, the correlation is over 0.5, but within grades (the correct
plot, IMHO) it is essentially zero.  Grade six shows a substantial
NEGATIVE  correlation, but there are only 12 observations.
   What are these standard errors anyway? In a separate post to me, Greg
Camilli points out that if all students are tested, then the "sampling
error" has to be zero.  What we need to make sense of this is, as I have
said already, a technical report.  How are they modelling the error in
their multilevel models?  What explanatory variables do they use? (Rick
says that TVAAS says they allow for OTL ...) Do they include covariance
terms? Is the "standard error" an estimate of measurement error? Just how
much data is missing?  etc... etc... etc...
 
 
 MTB > note     DATA ON GAINS AND STANDARD ERRORS BY GRADE FROM TVAAS
                A=Grade 3, B=Grade 4 ..., F=Grade 8
 
  stderr  -
          -          F                                A
          -
       6.0+
          -              B                        A
          -               C       A         A        A
          -             B    B        A     A     A         A
          -           B           A   B B       A
       4.0+          C     B  2 C2     B    B B  A A
          -              2      C   A CB 2AAAA       2A  A A
          -                 C B2 4C BBB     3B   A A
          -             CCCC C2CBBCCBA423   A A A        A
          -         C  CB  C 2 B2CC 3C  B3  B   A A
       2.0+                222  3C2B  BA       A
          -          F  F C CCCCC     B    A  A
          -   D  DDDD346EFF5  C  C 2
          -       D2  F E2  EE C
          -
            +---------+---------+---------+---------+---------gain
            0        15        30        45        60
 
 MTB > boxplot 'stderr'     (I believe at least one outlier is an
                             error--the one from Grade 8)
 
 
                      -------------
                ------I     +     I--------------      * *
                      -------------
           +---------+---------+---------+---------+---------stderr
         0.0       1.5       3.0       4.5       6.0
 
 MTB > info c11-c22   (The data, unstacked by grade--outliers in)
 
 Column   Name          Count
 C11      gain3            48
 C12      err3             48
 C13      gain4            48
 C14      err4             48
 C15      gain5            48
 C16      err5             48
 C17      gain6            12
 C18      err6             12
 C19      gain7            12
 C20      err7             12
 C21      gain8            14
 C22      err8             14
 
1
 MTB > plot 'err3' 'gain3'
 
          -
  err3    -
          -                                      *
          -
       6.0+
          -                                *
          -       *               *             *
          -              *        *        *              *
          -         *                   *
       4.0+       *                      *   *
          -           *      * ** **            2*   *   *
          -      *                **     *  *
          -             *** 2     *  * *             *
          -                  2*         *  *
       2.0+                *          *
          -                     *    *
          -
          -
            ----+---------+---------+---------+---------+-----gain3
               30        40        50        60        70
 
 MTB > corr 'err3' 'gain3'
 
 Correlation of err3 and gain3 = 0.188
 
 MTB > plot 'err6' 'gain6'
 
      1.40+
          -     *            *
  err6    -
          -                         *
          -
      1.20+                   *
          -               *
          -
          -                           *  *
          -                        *                   *
      1.00+                *
          -                     *
          -
          -
          -                                       *
      0.80+
          -
          -
            --------+---------+---------+---------+---------+-gain6
                  5.0      10.0      15.0      20.0      25.0
 
1
 MTB > corr 'err6' 'gain6'
 
 Correlation of err6 and gain6 = -0.598
 
 MTB > plot 'err8' 'gain8'
 
  err8    -
          -             *
          -
       6.0+
          -
          -
          -
          -
       4.0+
          -
          -
          -
          -
       2.0+
          -            *                *
          -              *  *        2*           *   *   *
          -   *                  *                *
          -
            --------+---------+---------+---------+---------+-gain8
                 12.5      15.0      17.5      20.0      22.5
 
 MTB > note   NOT MUCH POINT IN CALCULATING PEARSON R HERE, EH?
 
 MTB > copy 'err8' 'gain8' c51 c52;
 SUBC> omit 'err8'=6.0:8.0.           (get rid of the outlier)
 MTB > name c51 'err8trim'
 MTB > name c52 'gain8trm'
 MTB > plot c51 c52
 
      1.80+
          -            *
  err8trim-
          -
          -
      1.50+
          -                             *
          -
          -                                           *   *
          -
      1.20+                           *
          -                          *            *
          -              *  *        *
          -
          -   *                  *
      0.90+
          -                                       *
          -
            --------+---------+---------+---------+---------+-gain8trm
                 12.5      15.0      17.5      20.0      22.5
 
 MTB > corr c51 c52
 
 Correlation of err8trim and gain8trm = 0.039
 
 MTB > NOTE    STILL SEEMS TO BE AN OUTLIER (A DIFFERENT ONE)
 MTB > note       NOT PURSUED.
 

From:         Rick Garlikov 
Subject:      TVAAS responses
 
From TVAAS:
...
All this is to say that I am looking at about a dozen posts regarding
TVAAS tonight, alone.  I don't know whether we can take on the
philosophical discussions, simply because we don't have the time, but there
are others among you who offer alternate viewpoints, eloquently.  I
honestly think we are only going to have the time to write responses
about the model, what it does and how it does it, although I will provide
all the answers I can about how TVAAS is being used by teachers,
administrators, and others.
 
In closing, I would like to direct you to our article, "The Tennessee
Value-Added Assessment System (TVAAS):  Mixed-Model Methodology in
Educational Assessment," in the _Journal of Personnel Evaluation in
Education_ 8:299-311, 1994.  It may answer some of your questions about
the model.
 
Thanks for your interest in TVAAS and for your patience.



From:         Sherman Dorn 
 
The following are excerpts from the 1992 Tennessee state act creating
TVAAS and the relevant sections ($$, if you'll excuse the ASCII
notation):
 
        $49-1-601.
 
        (a) There shall be
performance goals for each school district which shall include, but not
be limited to, determinations based on the current status of each local
school system as determined through the value added assessment
provided for in $$ 49-1-603 -- 49-1-608.
 
        (b) The goal is for all school districts to have mean gain for
each measurable academic subject within each grade greater than
or equal to the gain of the national norms.
 
        (c) If school districts do not have mean rates of gain equal
to or greater than the national norms based upon the TCAP tests
(or tests which measure academic performance which are deemed
appropriate), each school district is expected to make statistically
significant progress toward that goal.  The rate of progress within
each grade and academic course, necessary to maintain compliance
with $$ 49-1-209, $$49-1-210, and this part, will be established after
two (2) years of consecutive testing with tests adopted for each
grade and subject, as provided in $$49-1-603 -- 49-1-608.  Schools
or school districts which do not achieve the required rate of progress
may be placed on probation as provided in $49-1-602.  If national
norms are not available then the levels of expected gain will be set
upon the recommendation of the commissioner with the approval
of the state board.
 
[$$49-1-209, -210 deal with administrative oversight by the
state commissioner and board of education; it does not specify
directly anything about TVAAS or specific items of accountability.]
 
        (d)All schools within all school districts are expected to
maintain appropriate levels of school attendance and dropout rates.
The 1991-1992 school year is the base year for measuring
levels of attendance and dropout rates.  Schools which do not
maintain appropriate levels, as set by the state board on the
recommendation of the commissioner, may be placed on probation,
as provided in $49-1-602.
 
        (e) There is a rebuttable presumption that if a school or school
district has not achieved the goals pursuant to subsection (c) or
maintained attendance and dropout rates pursuant to subjection
(d), it is out of compliance with the requirements of $$49-1-209,
$$49-1-210, and this part and subject to probation as
provided for in $49-1-602.
 
        $49-1-602.  ...
 
[lots of administrative detail about probation, then]
 
        (c) ... If after two (2) consecutive years a system remains
on probation, the commissioner is authorized to recommend to
the state board that both the local board of education and the
superintendent be removed from office.  If the state
board concurs with the recommendation, the commissioner shall
order the removal of some or all of the board members and/or
superintendent and shall declare a vacancy in the office or offices....
 
[procedures for filling the vacancies follow here]
 
        $49-1-603.
 
        (a) Value added assessment means:
 
        (1) A statistical system for educational outcome
assessment which uses measures of student learning to
enable the estimation of teacher, school, and school district
statistical distributions; and
 
        (2) The statistical system will use available and
appropriate data as input to account for differences in
prior student attainment, such that the impact which the
teacher, school and school district have on the educational
progress of students may be estimated on a student
attainment constant basis.  The impact which a teacher,
school, or school district has on the progress, or lack
of progress, in educational advancement or learning of a
student is referred to hereafter as the "effect" of the
teacher, school, or school district on the educational
progress of students.
 
        (b) The statistical system shall have the
capability of providing mixed model methodologies which
provide for best linear unbiased prediction for the teacher,
school and school district effects on the educational
progress of students.  It must have the capability of
adequately providing these estimates for the
traditional classroom (one (1) teacher teaching
multiple subjects to the same group of students),
as well as team taught groups of students or other
teaching situations, as appropriate.
 
        (c) The metrics chosen to measure
student learning must be linear scales covering
the total range of topics covered in the approved
curriculum to minimze ceiling and floor effects.
These metrics should have strong relationship
to the core curriculum for the applicable grade
level and subject.
 
        $49-1-604.
 
 [This section refers to several published articles, including
one co-authored by Sanders in American
Statistician, February 1991.]
 
        $49-1-605.
 
[This section provides for annual estimates
of school and district effects.]
 
        $49-1-606.
 
        (a) On or before July 1, 1995, and
annually thereafter data from the TCAP tests,
or their future replacements, will be used to provide
an estimate of the statistical distribution of
teacher effects on the educational progress of
students within school districts for grades three (3)
through eight (8).  A specific teacher's effect on
the educational progress of students may not be used
as a part of formal personnel evaluation until data
from three (3) complete academic years are
obtained.  Teacher effect data shall not be retained for
use in evaluation for more than the most recent
five (5) years.  A student must have been
present for more than one hundred fifty (150) days of
classroom instruction per year or seventy-five (75) days
of classroom instruction per semester before that student's
record is attributable to a specific teacher.  Records
from any student who is eligible for special education
services under federal law will not be used as part of the
value added assessment.
 
        (b) The estimates of specific teacher effects
on the educational progress of students will not be a
public record, and will be made available only to the
specific teacher, the teacher's appropriate administrators
as designated by the local board of education, and school
board members.
 
[The following sections detail additional matters regarding
test security and requiring fresh test items annually.]
___________

From:         Sherman Dorn 
Subject:      Re: Laying down the law
 
>Please let there be a historian, investigative journalist, snooper or
>other seeker after truth who documents how this legislation came to be
>enacted in its present form!
 
I am an historian, but I have not had time to research this.  (I
moved to TN in late 1993, well after the relevant events.)
 
I gather (and the TVAAS folks can correct me) that Governor Ned
McWherter decided to enact a wholesale reform of educational
finances (including a plan to dramatically increase spending on
education over several years) in 1992.  The state senator who
chaired the education committee (Albright is his name, I believe,
though he was defeated in the primary last year) was very active
in inserting the TVAAS language into the bill as an accountability
mechanism.  My assumption is that he, and other legislators,
didn't want to spend a lot more money without some check on
the "product."  Very reasonable; I just disagree with how they
went about it.
 
More detailed materials are at the state archives downtown in
Nashville.  Someday I may have a chance to consult the
committee and whole-body legislative records.

From:         Rick Garlikov 
Subject:      Re: TVAAS legislation
 
From TVAAS:
 
----------------------------Original message----------------------------
We would like to thank Sherman Dorn for providing sections of the
legislation regarding educational accountability in Tennessee.  We would
like to comment on a few sections.  We are deleting the sections upon
which we have no comment to save your time and space but refer you to
Dorn's original post for the balance.
 
 
On Sat, 14 Jan 1995, Sherman Dorn wrote:
 
> The following are excerpts from the 1992 Tennessee state act creating
> TVAAS and the relevant sections ($$, if you'll excuse the ASCII
> notation):
>
>         $49-1-601.
>
>         (a) There shall be
> performance goals for each school district which shall include, but not
> be limited to, determinations based on the current status of each local
> school system as determined through the value added assessment
> provided for in $$ 49-1-603 -- 49-1-608.
>
Please note that the performance goals are not limited to TVAAS findings.
 
>         (b) The goal is for all school districts to have mean gain for
> each measurable academic subject within each grade greater than
> or equal to the gain of the national norms.
>
>         (c) If school districts do not have mean rates of gain equal
> to or greater than the national norms based upon the TCAP tests
> (or tests which measure academic performance which are deemed
> appropriate), each school district is expected to make statistically
> significant progress toward that goal.  The rate of progress within
> each grade and academic course, necessary to maintain compliance
> with $$ 49-1-209, $$49-1-210, and this part, will be established after
> two (2) years of consecutive testing with tests adopted for each
> grade and subject, as provided in $$49-1-603 -- 49-1-608.  Schools
> or school districts which do not achieve the required rate of progress
> may be placed on probation as provided in $49-1-602.  If national
> norms are not available then the levels of expected gain will be set
> upon the recommendation of the commissioner with the approval
> of the state board.
 
Please note that, in this and in all other sections of the law pertaining
to placing schools and systems on probation, the language is permissive,
not directive.  Schools and systems MAY be placed on probation;
superintendents and school board members MAY be removed.  There is no
language in the law that states that these actions MUST take place.
>
> [$$49-1-209, -210 deal with administrative oversight
by the > state commissioner and board of education; it does not specify
> directly anything about TVAAS or specific items of accountability.]
>
>         (d)All schools within all school districts are expected to
> maintain appropriate levels of school attendance and dropout rates.
> The 1991-1992 school year is the base year for measuring
> levels of attendance and dropout rates.  Schools which do not
> maintain appropriate levels, as set by the state board on the
> recommendation of the commissioner, may be placed on probation,
> as provided in $49-1-602.
>
>         (e) There is a rebuttable presumption that if a school or school
> district has not achieved the goals pursuant to subsection (c) or
> maintained attendance and dropout rates pursuant to subjection
> (d), it is out of compliance with the requirements of $$49-1-209,
> $$49-1-210, and this part and subject to probation as
> provided for in $49-1-602.
>
>         $49-1-602.  ...
>
> [lots of administrative detail about probation, then]
>
>         (c) ... If after two (2) consecutive years a system remains
> on probation, the commissioner is authorized to recommend to
> the state board that both the local board of education and the
> superintendent be removed from office.  If the state
> board concurs with the recommendation, the commissioner shall
> order the removal of some or all of the board members and/or
> superintendent and shall declare a vacancy in the office or offices....
>
Please note all of the safe-guards in this section.  Both the
Commissioner of Education and the State Boards of Education would have to
agree to this very drastic action.  It cannot happen "automatically."
 
=========================================================================

From:         Harvey Goldstein 
 
Re Les Mclean's message of 13th about standard errors.
He quotes Camilli as stating that the standard errors
given are 'sampling errors' and that if all students are
tested then these are zero. I am confused! The usual
standard errors quoted in this context are those
relating to the accuracy of the estimated school effects
where there is a conceptually infinite population of
students of whom those measured (whether they are all
those in the school at a particular time or not) are a
random sample. If they are not this then what are they?
 
 


From:         Gene Glass 
 
On Mon, 16 Jan 1995 15:10:34 -0500 Greg Camilli said:
 Harvey Goldstein raised questions about what Greg Camilli said
 about "standard errors" thusly:
 
>>Re Les Mclean's message of 13th about standard errors.
>>He quotes Camilli as stating that the standard errors
>>given are 'sampling errors' and that if all students are
>>tested then these are zero. I am confused! The usual
>>standard errors quoted in this context are those
>>relating to the accuracy of the estimated school effects
>>where there is a conceptually infinite population of
>>students of whom those measured (whether they are all
>>those in the school at a particular time or not) are a
>>random sample. If they are not this then what are they?
 
    Greg answered with a hypothetical conversation between an educator
 and a statistician. I think Greg exposed some key problems with this notion
 of standard errors, and it is no more a problem with TVAAS than it is a
 problem with most applications of inferential statistics in education.
 
    Harvey asks, in effect, what is wrong with regarding standard errors as
 being measures of the accuracy of samples as representations of
 "conceptually infinite populations" from which the samples might
 "conceivably have been drawn at random."
 
    After more than thirty years of calculating, deriving, explaining and
 publishing "standard errors" and their ilk, I have come to the conclusion
 that I don't know what they mean and I doubt seriously that they mean
 anything like what they are protrayed as meaning.
 
    Consider this:  if the population to which inference is made is one
 that is conceptually like the sample, then the population is just the
 sample writ large and the "standard error" is much larger than it ought
 to be. If you show me 25 adolescent largely Anglo-Saxon boys who love sports
 and ask me the population from which they could conceivably have been
 sampled, I'll conceive of an "infinite" population of such boys. If no
 population has actually been sampled and all I know about the situation
 before me is the sample, then I will conceive of a population like the sample.
 This is surely the very opposite of inference and standard errors are
 surely beside the point.
 
  Consider something even more troubling: I present you with a sample--
 Florida, Alabama, Tennessee, South Carolina. N=4. I calcualte the
 state high school graduation rates, average them and calculate a standard
 error. What is the population? States in the Southern U.S.? Fine; that's
 certainly conceivable, even if not "infinite." But suppose that
 someone else conceives of "States in the U.S." Well, that's conceivable
 too. But it is surely ridiculous to think that these four states can be used
 to infer to both of these conceivable populations with equal accuracy
 (standard errors). Or to make matters worse, suppose that I suddenly
 produce a fifth "state": Alberta. Now it raises the question whether
 the conceptual population is "geo-political units in North America"--
 or the entire Western Hemisphere.
 
    I can't imagine that there is much wisdom in attaching a number accurate
 to two decimal places when we can't even be certain whether it is referring
 to an "inference" to the Southern U.S., North America or the Western
 Hemisphere.
 
    Now, if you think I am playing with your head and will suggest a way
 out of this dilemma that rescues the business of statistical inference
 for us, let me assure you that I have no solution. In spite of the fact
 that I have written stat texts and made money off of this stuff for some
 25 years, I can't see any salvation for 90% of what we do in inferential
 stats. If there is no ACTUAL probabilistic sampling (or randomization)
 of units from a defined population, then I can't see that standard errors
 (or t-tests or F-tests or any of the rest) make any sense.
 
    Does any of this apply to TVAAS? Just this. If one is worried about
 "stability" (in any of the many sense in which the word could be
 interpreted) then why not simply compare teachers' scores across all years
 for which data are available. That would answer in very straightforward
 ways whether the ranking of teachers jumps around wildly for whatever
 reasons or is relatively steady.
   (I hasten to add that I don't approve of such things as ranking teachers
 with respect to their students' test scores.)


From:         Harvey Goldstein 
 
Well, I enjoyed Greg Camilli's imaginary conversation, but of
course the reality is that standard errors are not things
statisticians invented to make life difficult. Most
non-statisticians have little difficulty in understanding that
if you only have a measurement on 1 student there ain't much
to be said about the rest. The bigger the sample the more
confident you become that what you have observed is a good
guide to what you would get on repeated samples with also
suitably large numbers...assuming of course that you adopted a
sensible randomly based sampling strategy.
Now we come to the philosophical bit. Social statisticians are
pretty much forced to adopt the notion of a 'superpopulation'
when attempting to generalise the results of an analysis. If
you want to be strict about things then the relationship you
discovered between parental education and student achievement
back in 1992 from a sample of 50 elementary schools in Florida
can only give you information about the physically real
population of Florida schools in 1992. Usually we are not
interested just in such history, but in rather more general
statements that pertain to schools now and in the future...we
may be wrong of course and that is why we strive to replicate
over time and place etc. BUT the point is that, getting back
to value added estimates for a school, if we want to make a
general statement about an institution we do have to make some
kind of superpopulation assumption....what we happen to
observe for the students we have studied is a reflection of
what the school has done, and would have done, for a bunch of
students, given their measured characteristics such as initial
achievement. The more students we measure the more accurate we
can be and that's why we need an estimate of uncertainty
(standard error).
 


From:         Harvey Goldstein 
 
Gene Glass also takes me to task on standard errors and raises
the interesting question of when a sample should be considered
as having a reference population and when not. There is no
general answer...it depends on what you want to do.
As I said in my response to Greg, I cannot easily see how you
can have empirical social science without assuming that the
units (people, schools etc) you happen to have measured are
representative (in the usual statistical sense) of a (yes)
hypothetical population whose members exhibit relationships
you want to estimate. Such populations must (I think) be
hypothetical because they have to embrace the present and
future as well as the past when the data were collected. The
issue is therefore the general philosophical issue and not a
statistical one - statisticians simply try to provide tools
for making inferences about such populations.
 

From:         Greg Camilli 
 
Harvey reponded to a previous post with the following:
 
>Well, I enjoyed Greg Camilli's imaginary conversation, but of
>course the reality is that standard errors are not things
>statisticians invented to make life difficult. Most
>non-statisticians have little difficulty in understanding that
>if you only have a measurement on 1 student there ain't much
>to be said about the rest. The bigger the sample the more
>confident you become that what you have observed is a good
>guide to what you would get on repeated samples with also
>suitably large numbers...assuming of course that you adopted a
>sensible randomly based sampling strategy.
 
Smaller is better, I agree. Another issue is whether it is the
correct standard error, and still another is whether the SE
has a meaningful referent. If the sample consists of all kids
in the system, how can imagining a larger group possibly
create more information. If I want to understand the behavior
of my three car (I wish), how would it benefit me to imagine
I had a fourth? IMO, this is not a statistical issue at all.
Population has always been a heuristic device.
 
>Now we come to the philosophical bit. Social statisticians are
>pretty much forced to adopt the notion of a 'superpopulation'
>when attempting to generalise the results of an analysis. If
>you want to be strict about things then the relationship you
>discovered between parental education and student achievement
>back in 1992 from a sample of 50 elementary schools in Florida
>can only give you information about the physically real
>population of Florida schools in 1992. Usually we are not
 
Your option is to generalize to physically unreal schools.
 
>interested just in such history, but in rather more general
>statements that pertain to schools now and in the future...we
>may be wrong of course and that is why we strive to replicate
>over time and place etc. BUT the point is that, getting back
>to value added estimates for a school, if we want to make a
>general statement about an institution we do have to make some
>kind of superpopulation assumption....what we happen to
>observe for the students we have studied is a reflection of
>what the school has done, and would have done, for a bunch of
>students, given their measured characteristics such as initial
>achievement. The more students we measure the more accurate we
>can be and that's why we need an estimate of uncertainty
>(standard error).
 
I think you said in your ensuing post that this really isn't
a statistical issue. Generalizing beyond known populations
is risky business, and requires more than statistical knowledge.
This was the focus of the long and interesting dialogue between
Cronbach and Campbell. Standard errors have something to do with
the precision of estimates. Perhaps they convey something about
how well a model fits certain data. You might want to argue, on this
basis, that the model is likey (or not) to generalize; but model
fit at one instant does not *logically imply* model fit one second
later.
 
The standard errors will apparently be used to measure whether
statistically significant progress is being made by schools that
fail to meet the standard (whatever that turns out to be), so it
is important to be clear about what SEs mean. I find it fascinating
that they are being used as policy tools with legal implications.
In this regard, it is important to understand what drive the SEs.
I'm guessing that missing data will add to SEs (it really would be
helpful if the TVAAS staff would respond), and am sure that unit
size will decrease SEs. Thus, standard errors for schools
will typically be smaller for districts than for schools than for
teachers than for students. As far as I can tell, only certain
districts are required to make statistically significant progress;
this may turn out to be a pretty easy criterion to satisfy.
 

From:         Leslie McLean 
Subject:      The law demands std. errors
 
The discussion of standard errors has gotten so involved that a look at
the Tennessee legislation should tell us where standard errors are
needed and what interpretations reasonable people ought to be able to
put on them.  Below, the text is from Sherman Dorn's post and Les
McLean's comments are in upper case letters.
 
 
        (b) The goal is for all school districts to have mean gain for
each measurable academic subject within each grade greater than
or equal to the gain of the national norms.
     HOW WILL ANYONE DECIDE WHETHER THE MEAN GAIN
IS GREATER THAN OR EQUAL TO THE GAIN OF THE
NATIONAL NORMS?  PUBLICATION OF "STANDARD
ERRORS" MUST MEAN THAT AN ERROR BOUND WILL BE
ESTABLISHED AROUND THE NATIONAL NORMS--PERHAPS
1.5 TIMES THE MEDIAN STD. ERROR PER GRADE--ONE
"HARVEY", OR 2.0 STD. ERRORS--ONE "DORN".
 
        (c) If school districts do not have mean rates of gain equal
to or greater than the national norms based upon the TCAP tests
(or tests which measure academic performance which are deemed
appropriate), each school district is expected to make statistically
significant progress toward that goal.
OK, GANG, THE VEIL IS LIFTED FROM OUR EYES--THERE IS
NO SUCH THING AS "STATISTICALLY SIGNIFICANT
PROGRESS" WITHOUT STANDARD ERRORS AND THE
ASSUMPTION OF SAMPLES FROM SOME POPULATION.
 
Schools or school districts which do not achieve the required rate of
progress may be placed on probation as provided in $49-1-602.  If
national norms are not available then the levels of expected gain will
be setupon the recommendation of the commissioner with the
approval of the state board.
YO, COMMISH! I DO NOT ENVY YOU YOUR TASK.
 
        (a) Value added assessment means:
 
        (1) A statistical system for educational outcome
assessment which uses measures of student learning to
enable the estimation of teacher, school, and school district
statistical distributions; and
 
        (2) The statistical system will use available and
appropriate data as input to account for differences in
prior student attainment, such that the impact which the
teacher, school and school district have on the educational
progress of students may be estimated on a student
attainment constant basis.  The impact which a teacher,
I COULD WRITE A RATIONALE FOR A "STATISTICAL
SYSTEM" THAT DID NOT NEED STANDARD ERRORS, GIVEN
THAT THEY TEST ALL THE STUDENTS.  IT WOULD
CONTAIN CAREFUL, MODERN DESCRIPTIVE STATISTICS
THAT WOULD GLADDEN JOHN TUKEY'S HEART.
 
 
        (a) On or before July 1, 1995, and
annually thereafter data from the TCAP tests,
or their future replacements, will be used (NOTICE THE 'WILL'--
THE LANGUAGE IS NOT JUST PERMISSIVE HERE) to provide
an estimate of the statistical distribution of
teacher effects on the educational progress of
students within school districts for grades three (3)
through eight (8).
HERE WE ARE AGAIN--THESE GAINS ARE TO BE
INTERPRETED AS "TEACHER EFFECTS".  PEACE, TVAAS,
BUT I DO NOT BELIEVE THAT ANYONE'S MODELS AND
TECHNIQUES ARE YET GOOD ENOUGH TO ISOLATE THE
TEACHER EFFECT FROM ALL THE OTHER EFFECTS ON
STANDARDIZED TEST SCORES IN SCHOOLS WITH ALL
THEIR COMPLEXITY.
     NEXT TO THIS CONCERN--IT IS A CONCERN ABOUT
VALIDITY AND IS NOT VAGUE OR COMPLEX--THE
DEFINITION AND ESTIMATION OF STANDARD ERRORS IS
TOO SMALL A MATTER TO TAKE OUR TIME.


From:         Gene Glass 
 
     I quite agree with Les McLean that set over against questions of the
 validity of TVAAS "teacher effects" the matter of "standard errors"
 is of lesser importance.
 
     Here's a validity question. I have asked it twice here--so that TVAAS
 could address it, but have not seen an answer.
 
 
      A passage from the law makes it highly likely that what TVAAS does
  is measure pre-year achievement and partial (via some form of least-squares
  estimation) it out of post-year achievement for each teacher's class.
  Although this makes the resulting "gains" uncorrelated with pre-year
  achievement, it does not make them uncorrelated with pre-year ability--
  abiltiy and achievement not being the same thing.
 
      Consequently, teachers working with classes of different average ability
   can not be said to be evaluated fairly.
 
 
   (This short discussion doesn't even raise very serious matters of
    "errors of measurement" in any pre measure that will also lead to
    a breakdown in "equating" the teachers at the start of the year. Nor
    does it raise questions of how a third grade teacher can teach
    skills or understandings that won't come to fruition on tests until
    the fourth and fifth grade. Does the TVAAS model really detect this
    and allocate proper credit to the source of the knowledge? And so on
    through similar concerns about validity.)


From:         Sherman Dorn 

The TVAAS staff noted, accurately, that the 1992 legislation is inclusive rather
than prescriptive in how it describes the role of TVAAS in evaluation.
Nothing forbids the state from using additional statistics as part of
 evaluation,
either on the system level or for personnel evaluation.  Nonetheless,
I think it is clear from the text of the law that TVAAS was always
intended to serve a central (and, from my reading, THE central) role in
program evaluation, and that the threat of probation and public censure
from TVAAS is the major stick in the accountability system in
Tennessee.  Consider these aspects of the legislation:
 
        (1) Only value added assessment is rigorously defined in the
        legislation, through citations to peer-reviewed
        literature and definitions of threshold levels for students'
        inclusion in teacher effects.  (By contrast, the law talks about
        attendance and dropout rates without ever defining them, or
        how to apportion responsibility for students.)
 
        (2) The only defined standard in the law (that of meeting
        national norm gains where available) is directly related to
        TVAAS.
 
        (3) Only three topics of measurement (TVAAS, attendance,
        and dropping out) were mentioned in the legislation.
 
Now, it may be that school systems, and the Board of Education and
new Commissioner of Education, will create all sorts of additional
statistics to be part of program and personnel evaluation.  But from
where I sit on this drizzly night, I think that's unlikely -- and, moreover,
the legislation as it stands currently sends a very powerful signal,
one that reinforces the high-stakes nature of the current set of
annual tests in Tennessee.


From:         Harvey Goldstein 
 
Les McLean's comments have inspired some more
thoughts.
In the simplest value added model, an outcome score is
regressed on an input score so that generally each
school will have a different regression line - perhaps
with varying slopes but in the basic model with parallel
slopes so that schools can then be ranked on the
resulting regression intercepts. (The actual analysis is
a bit more complex but this simple model captures the
essence). We find, typically, that the variation among
these intercepts is relatively small compared to the
residual variation of student scores about the
regression lines for each school (5% - 30% depending on
which educational system you are studying). In
addition, the regression itself will account for
quite a lot of the variation in outcome...maybe as
much as 50-60%. This means that there is a substantial
remaining variation (among students) unnacounted for and
it is this residual variation which determines the
standard error values. Thus, e.g. if this residual
variation was zero, we would exactly predict each
schools (relative) mean and the standard error of that
prediction would be zero. This would mean also that once
we knew each student's input score (and anything else we
were able to put into our regression model) and the
school that student was in we would have a perfect
prediction of the student's outcome. Of course, we are
nowhere near that situation and it is this uncertainty
about the individual prediction that translates into
uncertainty about the school mean (think of the mean
roughly as the average of the student residuals about
the regression line for each school). If you took
another bunch of students with exactly the same set of
intake scores you would NOT therefore expect to get the
same set of outcome scores - this is what the
uncertainty implies - nor the same mean for the school.
In the absence of being able to predict with certainty
we have to postulate some underlying value for each
school's mean (otherwise we are pretty well lost) which
we can think of as the limit of a series of conceptual
allocations of students to the school. Thus an estimate
of uncertainty, conventionally supplied by calculating
the appropriate standard error,is important if you want
to make any inference about whether the underlying means
are different and, more importantly, to set limts
(confidence intervals e.g.) around the estimated
difference for any two schools or around the difference
between a school's estimate and some national norm.
Hence my original remark some time ago that when you did
just that you found that most institutions could not
statistically be separated, and I suspect also for TVAA
that very many cannot statistically be separated from a
National norm, whether they are actually above or below
it.
It would be good to hear from the TVAA people on this
issue.
 
 

From:         Rick Garlikov 

Since Sherman and others read the Tennessee law he quoted the way
they do, it IS extremely important for the state of Tennessee to
make clear that is NOT the way it should be read.
 
That being said, it is either through different experiences or
just a mere quirk of psychology, but *I* don't get the same
meaning out of what Sherman quoted as he does, and as others seem
to.  Even TVAAS, in their commentary on the law, did not point
out what I think is the single most important word --
"rebuttable".  I don't have the language available at this time,
but the phrase was something like "It is the rebuttable
presumption that a school district should have [average gain
scores...]..." and that if they don't, AND don't seem to be able
to AND if there doesn't seem to be any compelling reason (i.e.
rebuttal) why they don't, THEN ....
 
Further, the consequence then is that the state superintendent
(or whoever -- I forget) MAY recommend to the state school board
that people be removed from offices, and the State School Board
has to concur with that recommendation.
 
If Tennessee is anything like Alabama in these sorts of matters,
it would take an act of God for all these things to happen in a
way that would actually end up in somebody's removal --
regardless of how abysmal their district's test scores are.
There would be a bunch of excuses given that would count as
"rebuttal" whether they were or not.
... 
 
So, not only do I not see the language's saying what Sherman
sees, but I don't see the POLITICAL reality's being what Sherman
sees, though I understand his expectations sometimes are met --
and though I see that WITHOUT explicit and REPEATED policy
statements to the contrary there can be various sorts of
pressures to have tests drive the curriculum and instruction in
undesirable ways.  Where Sherman and I disagree is whether tests
will drive the curriculum even if there is a clear policy
STATEMENT to the contrary.

From:         Leslie McLean 
 
  First: Please dissociate names of persons from std. errors (Harveys
and the like).  When I think of it, I wouldn't like to be named for an
error either.  Consider this one:  -1 = One Les.
 
Second: Harvey Goldstein's exposition on standard errors (17 Jan,
"Standard Errors: yet again") may have been more than some wanted, but
I found it instructive and thought-provoking.  If you deleted without
reading, reconsider--it gets at the heart of the matter of TVAAS.
   While still wanting to retain the concept of the sample from some
(unspecified) population, Harvey's main lesson for us was to highlight
the crucial role of the model adopted by the statistician in
estimating scores--gain scores, in the case of TVAAS.  A model is a
formula that the statistician considers a reasonable try at relating
the desired quantity, the 'gain' in achievement (not directly
measurable because of nuisances such as social class and prior
learning) to aspects of schooling, such as teacher competence.
 
   Advised by statisticians with wide experience outside of education
(and maybe in education--we have not been told), the policy-makers
decide to give the statisticians their head and to accept their
estimate of 'gain', knowing that the formula will be complex and the
procedures well beyond the understanding of all but a very few.  The
statisticians make a persuasive case that their formula and their
procedures will provide the policy-makers with an estimate of gain
that will distinguish the bad teachers from the poor from the average
from the good from the excellent.  "National norms" are invoked,
unspecified, but responsibility given to the Commissioner of Education
to provide norms if the national government lets the side down.
   All this tedious repetition is needed to give a context for Harvey
Goldstein's description of standard errors.  In essence (correction,
Harvey, please, if needed) the errors are S&E, not SE--errors of
Specification & Estimation, not of sampling.  A 'specification' error
is made when our model, our formula, does not accurately link the
target (the gain) with the data (the item responses or scale scores
plus proxies for prior learning and social class and the like).  We
ALWAYS made a specification error--the only question is how large.  If
we limit ourselves, as in the TVAAS, to linear models, and we try to
estimate gains across big, complex societies such as states, the error
can be huge--and there is not consensus how to estimate the size of
the error.  Here is a source of error.
   Even though they do not sample students and schools, sampling
cannot be avoided--people are absent, times of testing vary, the tests
cannot possibly cover all the content (hence content sampling), items
are omitted, test booklets get lost, some teachers do not cover the
material on the test, ..., and so on and so on and so on.  This is why
we do not use a very simple formula:
       Gain = (Avg. score end - Avg. score beginning)
After all, when we test everyone, and when the goal is to measure
gains by these students this year in these places with these teachers,
who needs an error term?  With well-constructed tests, the measurement
errors will cancel out when we calculate school and class means.
Oh--there is measurement error in individual pupil scores, but we can
report that (from the test publisher's manual) and besides, these
scores don't count in the student's grade--the teacher does not get
them in time, and even if they do they do not use them.
 
   Ok, so I seem to have lost the tenuous thread of the argument--NOT
SO! We have learned over the years that the simple formula is more
likely to mislead than to lead--to distort our view of gain rather
than to clarify it.  Raw score comparison tables (called 'League
Tables' in the UK, after the rankings of sports teams), however
compelling they seem, are statistically invalid, immoral, racist,
sexist and stupid.  Apart from those few flaws, they are fine.  But
would Tennessee put up with such poor procedures?  Not on your
life--scaling, imputation, hierarchical linear models and prayer are
brought into play.  Here is another source of error.
... 
   Pick up the tread again, the two of you who are still reading.  All
this talk of standard errors and models and politics keeps coming back
to one key aspect: VALIDITY.  Do those numbers represent gains in
achievement? The formulas and procedures are complex enough that
evidence is needed.   Even if they do, how accurate are they--and I
mean how much do they tell us about better learning, class-by-class,
teacher-by-teacher; or has the TVAAS traded in science for voodoo?
Without a better explanation, the use of these scores to label
teachers as competent or incompetent seems a lot like sticking pins in
dolls.
 
  It is possible to validate the numbers--but it would take a lot of
thinking, a lot of hard work and maybe 0.01 of the budget of TVAAS.


From:         Greg Camilli 
Harvey Goldstein wrote: 
>In the absence of being able to predict with certainty
>we have to postulate some underlying value for each
>school's mean (otherwise we are pretty well lost) which
>we can think of as the limit of a series of conceptual
>allocations of students to the school.
 
I think we're lost when we accept statistical inferences
based on data that weren't observed, and moreover, do not
exist conceptually. If "all the students in the school"
doesn't really have that meaning, then we are playing a
game with language.
 
>Thus an estimate
>of uncertainty, conventionally supplied by calculating
>the appropriate standard error,is important if you want
>to make any inference about whether the underlying means
>are different and, more importantly, to set limts
>(confidence intervals e.g.) around the estimated
>difference for any two schools or around the difference
>between a school's estimate and some national norm.
>Hence my original remark some time ago that when you did
>just that you found that most institutions could not
>statistically be separated, and I suspect also for TVAA
>that very many cannot statistically be separated from a
>National norm, whether they are actually above or below
 
If we can get away from the superpopulation for a moment, we
can begin to analyze what drives the standard error. It
certainly isn't sampling error; nonetheless, it is a quantity
that exists in a real sense. As you've implied above, SEs
have something to do with model fit. Thus, we should be
interested in those things that cause models to fit more
loosely to the data. District size is certainly one factor;
but correlation of effects within the model will also inflate
SEs. Effects like teachers within schools, teachers with
school, schools with district might be some examples. As Gene
implies, separating these effects may take some doing.
 

From:         Greg Camilli 
Rick G.:
>  Greg says that his main concern is "how testing programs shape
>school culture and curricula."  Greg, cannot school culture and curricula
>also shape testing programs?  And couldn't the dominant cause and effect
>relationship flow in that direction, more than in the direction of the
>tests driving curriculum and instruction?  If not, why not?
 
For a number of reasons, testing programs usually affect schools, and not
the reverse. First, testing programs are *intended* to shape curricula.
Second, large-scale tests are mainly written to assess only skills that cut
across curricula, perhaps with the message "what should be taught." Thus,
the individual character of schools and districts can get lost. Maybe you
mean "But isn't it possible that they affect testing program?" to
which I would answer maybe it happens somewhere, in some superpopulation,
but not where I come from. Perhaps in an ideal world.
 
Maybe you ask "Wouldn't it be possible to change testing programs so that
they address the needs of schools" to which I would reply "Yes." But I
hasten to add that there is a distinction between local and global
testing. I'm sure a lot of really good district assessment programs are
in place. If you imply that far from being agents of change, global
testing programs may themselves need to be reformed, then I would also
agree.
 

From:         Rick Garlikov 
Subject:      Response from Bill Sanders, TVAAS
 
    From TVAAS and Bill Sanders.
 
Regarding TVAAS:
 
Several recent queries have dealt with the model(s) used in TVAAS.
Anyone interested in learning specifically how the model--in particular,
the teacher model--is defined and how the estimates are obtained,
there is an explicit definition in the paper cited in an earlier post,
"The Tennessee Value-Added Assessment System (TVAAS): Mixed-Model
Methodology in Educational Assessment" (Sanders and Horn, _Journal
of Personnel Evaluation in Education_ 8:299-311, 1994) on pages 305-309.
For those of you interested in how standard errors are obtained, property
2(c) on page 306 of the same article details that information.  If you
are unable to obtain a copy of this article, you may write to the UT
Value-Added Research and Assessment Center, P.O.Box 1071, Knoxville, TN
37901-1071.
 
To Gene Glass:  I hope that reading this article will give you a more
accurate picture of what we are doing.  After you have looked at it, you
may want to restate your questions.
 
To Leslie McLean:  Your plots of standard errors as calculated make no
sense.  Middle schools in the example school system we provided have more
students than intermediate schools in almost every case.  Thus, their
standard errors tend to be lower.  Middle schools also have smaller
expected nominal gains.  Therefore, your attempt to show a relationship
over grades is nonsense.
 
To Harvey Goldstein:  Thank God for you and your insights.  I am glad to
learn that you in the UK are obtaining comparable sensitivities using
similar approaches.
        As evidenced by the estimate mean gain and the relatively small
standard errors, clearly many schools can be distinguished as deviations
from the average school within a district.  Even though this is not the
objective of TVAAS, it does show relative sensitivity.
 
To those of you worrying about conclusions reached from a specific type
of test, let me share with you recent findings (as yet unpublished)
resulting from the merging of data from assessment instruments other
than TCAP  into the master database at UTVARAC.
 
We have recently completed merging the 10th grade PLAN (previously known
as the PACT) and the 12th grade ACT scores for the last three years and
the Tennessee Writing Assessment data into the master database.  This
database is now comprised of more than three million records.  What we
have found is that the differences among school systems in the scores of
10th and 12th graders are huge, even after holding 8th grade achievement
level constant. Further, the findings are relatively consistant
regardless of the test data used.
 
Additionally, writing Assessment data is being analyzed in conjunction
with the more traditional forms of testing to evaluate how much unique
information is available from the writing assessment.  This work will
continue, and we will be writing and reporting more of it in the
next several months.
 
To those of you who are concerned about the understanding of the TVAAS
reports by Tennessee educators, let me share observations based on scores
of phone calls and numerous presentations across our state.  Educators'
understanding and attitudes vary widely.  In those systems in which
superintendents, supervisors of instruction, and/or principals have
worked to learn and share the diagnostic value of the analyses,
positive atitudes and progressive plans are leading to improved
academic growth opportunities for thousands of Tennessee youngsters.
 
As educators strive to improve gains, they have identified the following
practices as some of the primary impediments to be overcome:
 1) excessive re-teaching;
 2) failure to communicate over grades;
 3) the enormous effects of building change
  (see Sanders, W. L., et al.(1994). Effects of Building Change on
  Indicators of Student Academic Growth, Evaluation Perspectives_. 4.3-7);
 4) "lock-stepping" instruction to the detriment of many high achieving
  students, etc. etc.
 
Finally, for those of you who have complained about a timely response to
technical matters relating to TVAAS, if you could have forwarded a magic
cure for flu to me two weeks ago, then my responses would have been far
more punctual.
 
As I stated in my original response to Dorn, we will respond to
legitimate criticism from all of you to the best of our ability.  That
still stands.  However, I do not feel it is necessary for me to rewrite
and post to this forum those things which have been previously published
and rigorously reviewed. We will furnish citations, instead.
 

From:         Greg Camilli 
 
Les,
 
I think your distinction between SE and S&E is a clear and
elegant statement. It is a must-read for anyone interested in
how statistical models are likely to behave in policy contexts.
I'd like to throw in two additional cents:
 
1. Because some statistical models are complex, and understood by few,
   it is ironic that this initially evokes more (rather than less)
   credibility. The downside (or upside depending on how you look at
   it) is that when a small crack in the model's facade appears,
   the public and policy makers can be very unforgiving. A relatively
   small equating anomaly in a New Jersey state test nearly caused
   the demise of the testing program. Moreover, when such a crack can
   be patched, it creates an atmosphere in which technical personnel
   are *less* motivated to diagnose future problems.
 
   I think TVAAS is certain to encounter a related problem with its
   "linear metric." How is it, the press may ask, that gains are so
   much larger in the earlier than the later grades? Does this
   mean that students aren't learning very much in high school?
   Moreover, because the standard errors are likely to be different
   across districts, larger districts might have to achieve smaller
   gains to be consistent with the law. Does this imply different
   standards for different districts? (I recognize that larger
   districts have to pull up more kids to achieve a SE's worth of
   gain -- but I'm not sure this type of argument would wash since
   a SE may be only a baby step toward the national average.)
 
2. The "natural" sample that exists on any given day does, I suppose,
   give rise to a superpopulation of the sort that Harvey Goldstein
   writes of. However, this is not the population about which most
   people think of when evaluating gains since, as Bill Hunter points
   out, it is not a random sample from the school's student body.
 
 

From:         Bill Hunter 
 
Per Greg C.:
 
> 2. The "natural" sample that exists on any given day does, I suppose,
>    give rise to a superpopulation of the sort that Harvey Goldstein
>    writes of. However, this is not the population about which most
>    people think of when evaluating gains since, as Bill Hunter points
>    out, it is not a random sample from the school's student body.
 
 
I need to clarify a bit.  I think it is not the case that a
sample of convenience "gives rise to" or "implies" a population
of any sort (unless one chooses to regard the sample _as_ a
population).  As far as I can tell this thinking is exactly
backwards--samples derive their meaning and existence from
populations: I cannot see that the reverse order has any meaning
at all.  I also question the utility of Harvey G.'s conception of
such samples as samples from a population in time.  This _might_
make sense in a time/space  of great stability, but I see little
reason to believe that children four or five years from now will
have experiences of the world (especially the world of
information) that is comparable to children of today (or five
years past).  The kinds of changes that required revision and
re-norming of intelligence tests every 15 or 20 years half a
century ago now take place in five years or less--probably about
the same time scale that would be required to conscientiously
develop and renorm the test.
 
Moreover,  I think it is not just that such a sample is not a
random sample from some _specific_ population (as Greg suggests
above), but that it is not a random sample of ANY population for
two reasons:
1) the process of selection did not insure equal and independent
likelihood of selection for all members of the population and,
more importantly,
2) no population was specified (to which the above process was
not applied)
 
(Sorry about the double negative.  I'll have to stop watching Seinfeld.)

From:         Leslie McLean 
Subject:      Error plots: clarification
 
On January 18, Bill Sanders wrote (via Rick Garlikov--and along
with many other topics):
 
>To Leslie McLean:  Your plots of standard errors as calculated make
>no sense.  Middle schools in the example school system we
>provided have more students than intermediate schools in almost
>every case.  Thus, their standard errors tend to be lower.  Middle
>schools also have smaller expected nominal gains.  Therefore, your
>attempt to show a relationship over grades is nonsense.
 
It was indeed the point I was making--that the plot (or
correlation) over grades made no sense.  That is why I argued that the
within-grade correlations were the ones to look at--and that they were
around 0.0.
   BTW, if means in a table are based on widely different Ns, you
would do your readers a good turn to say so, don't you think?  Your
remark that "middle schools also have smaller expected nominal gains" is
ambiguous and interesting.  In what sense "expected"; in what sense
"nominal"?

From:         Sherman Dorn 
 
>The Tennessee's 1992 Educational Improvement Act lists several
>references pertaining to the technical aspects of TVAAS's "mixed
>model methodology" for value-added learning assessment.  In addition
>to the citation coauthored by Professor Sanders (posted 1/18/95) are
>the following:
 
According to both Sanders & Horn (1994) and MacLean et al. (1991), the
Henderson article is the root of the mixed model Sanders and MacLean
developed.  (I also think Sanders mentioned this in a previous post.)
I would hope that someone else, preferably with some more appropriate
statistical background than I, would take the time to read the Sanders &
Horn, MacLean et al. (American Statistician, 1991), and perhaps Henderson
or the SAS Stat chapter on their PROC MIXED in order to help us here.
 
Conceptually, I gather the rationale for using a mixed model on gain scores
is as follows:  using raw scores is unfair to evaluate teachers because of initi
capacity of students to perform on tests.  Using just gain scores is better,
but one needs to view various "effects" as random because of (a) individual
students' varying responses to a teacher, and (b) measurement error.
According to Sanders and Horn, the mixed model solves the problem of
regression to the mean and gain scores through the viewing of effects as
random.  They also claim that the calculus behind TVAAS is able to
use partially-censored student records, eliminating the problems with
complete-data analysis when, for example, Joey had the flu on the day his
school was conducting the math tests and thus all scores for him would be
thrown out.
 
My statistical concerns about this are less with the model (though I've tried
to puzzle through the matrix algebra) than with how it connects
with real-world schooling and learning.  Gene noted a long time ago his
supposition that one would expect low-performing
students on tests to have lower gains than students who initially perform
highly on the tests.  TVAAS folks have vigorously disputed this, claiming
that their records indicate low-performing students can expect, on average,
equivalent gains to higher-performing students.  I am assuming, though I
have not asked, that this information is individual student scores,
rather than an ecological view through teacher or school effects measured
against average initial scale scores.
 
My concern is the use of gain in SCALE scores, which are
derived from a cross-sectional 1989 norming of the test used for
TVAAS -- the CTBS.  To put it simply, the third grade norming
population is not the equivalent of the second grade norm population
aged a year.  (Not only do age cohorts' experiences differ, but flunking
students and migration change the age-grade composition of grades.)
Thus, we really don't know whether a scale score of, say, 700 on the
third-grade CTBS would be the same thing as if the 1989 second grade
had been tested a year later in the third grade.  Probably both the mean and
the standard deviation would be different, which means that scale scores
would undergo some linear transformation.  For this reason, as well
as Gene's concern, I have suggested to UTVARAC staff that they
use previous year scores as an independent variable, with this year's
scores as the dependent variable, rather than the gain score approach.
According to them, adding a variable and rerunning the entire system
would not be very onerous at all.  Thus far, UTVARAC staff have not
responded to this suggestion.

=========================================================================
Sherman Dorn: 
Now, as an historian, I would be remiss if I didn't point out
that Tennessee already had high-stakes testing well before
1992 and the Educational Improvement Act.  And TVAAS
is not, in itself, a test.  It is a statistical system for analyzing
test scores.  My concern is that TVAAS' existence will
cement in place any mutual distrust between teachers and
policymakers that currently exists, and exacerbate
organizational problems that TVAAS cannot respond to.  I
should note, also, that with Lamar Alexander as governor in
the 1980s, Tennessee was at the forefront of the earlier wave
of school reform, with higher graduation standards, the
career ladder program, and the first wave of high-stakes
testing.  As with the others, I doubt that TVAAS will
facilitate much improvement, in its current form at the
center of educational evaluation.
___________
Sherman Dorn
Vanderbilt University
dornsj@ctrvax.vanderbilt.edu
=========================================================================

From:         Gene Glass 
Subject:      Glass Replies to William Sanders about TVAAS
 
  On Wednesday, January 18th, William Sanders wrote the
  following in response to my repeated requests that he
  address, among other questions, the matter of how the TVAAS
  deals with ability differences in students who happen to
  appear in particular teachers' classes:
 
  "To Gene Glass:  I hope that reading this article will give
  you a more accurate picture of what we are doing.  After you
  have looked at it, you may want to restate your questions."
  (Sanders)
 
     I don't appreciate being fobbed off in this manner and
  I am in an even more atrabilious mood for having traipsed to
  the library to copy the article referred to (Journal of
  Personnel Evaluation in Education 8:299-311, 1994). It
  contains no adequate information on the question I asked.
  Indeed, it contains six superficial pages on teacher
  evaluation and a four-page statement of a completely
  unextraordinary mixed effects linear model. (I too, Mr.
  Sanders, can write mixed effects models--Glass & Hopkins,
  Statistical Methods in Education & Psychology, 2nd Edition,
  1984, pp. 465-474; so don't imagine that I am intimidated
  when someone flashes four pages of equations in my face
  instead of being responsive to my inquiry.)
 
     The question is whether the TVAAS system takes adequate
  account of differences in student, and hence, school class
  ability (as measured by generally accepted ABILITY tests).
  An examination of Sanders and Horn's article reveals clearly
  that it does not; I repeat, it does not. In my opinion, the
  implementation of the system is unfair to teachers for this
  reason, and very likely other reasons.
 
     The S&H article makes a few references to student
  ability:
 
  1. An early 1980s application of the method "rendered the
  following findings: .... 5. Student gains were not related
  to the ability of achievement levels of the students when
  they entered the classroom." (p. 300) The documentation for
  this peculiar finding (which if taken seriously would, of
  course, imply that bright and dull students make the same
  achievement gains in a school year--a palpable and self-
  contradictory absurdity) is an unpublished report (Working
  Paper No. 199) from the UT College of Business
  Administration (McLean & Sanders, 1984). (It is an irony
  in need of clarification that the McLean with whom Sanders
  has collaborated on some statistics articles is not the
  McLean (Les) who has here taken serious issue with the TVAAS
  approach.) I don't consider unpublished working papers as
  adequate documentation for so extraordinary an assertion.
  They are unreviewed and unpublished. So reference to the
  article is no more helpful than is repeating the
  unsubstantiated assertion made weeks ago here that student
  ability is unrelated to teacher effects in the TVAAS system.
  (How, I might ask, does TVAAS imagine that the positive
  correlation between ability and achievement arises in the
  world if more able students do not learn at a faster rate
  than less able ones?)
 
  2. In the section of the paper entitled "Problems of using
  Student Achievement Data in Educational Assessment," Sanders
  and Horn write: "Since random assignment of students to
  teachers is usually not practiced and seldom is possible,
  simple means of class achievement test scores are seriously
  biased by many factors other than teacher influences that
  affect student learning. Travers (1981) listed (1) teachers
  influences, (2) parental influences, (3) genetic endowment,
  (4) other school influences, and (5) availability of
  materials as being some of the most important factors that
  determine the rate of student learning." (p. 304) The
  reference to Travers (1981) is to a chapter (referenced by
  page numbers or title) in the Handbook of Teacher Evaluation
  edited by Jason Millman, not "Millmen" as Sanders and Horn
  report.
 
  3. Sanders and Horn go on to write at the bottom of page 304
  that "Obviously, any system that will fairly and reliably
  assess the influences of teachers on student learning must
  partition teacher effects from these and other factors." The
  reference to "these and other factors " being to both
  Travers's list and Bingham et al.'s unremarkable assertion
  that student test scores are influenced by many things
  including family characteristics, personal characteristics
  and the like. They go on: "However, it is a hopeless
  impossibility for any school system to have all the data for
  each child in appropriate form to FILTER (emphasis added)
  all of these confounding influences via traditional
  statistical analysis." (p. 304)
     We learn two things here: 1) that Sanders and Horn will
  soon tell us that they are not going to employ any measures
  of these student background characteristics in their system
  and that they regard their statistical analysis as more than
  merely "traditional."
 
  4.  The next paragraph delivers the news: "Using a different
  approach, the three studies conducted by Sanders indicate
  that these influences can be FILTERED (emphasis added again)
  without having to have direct measures of all of the
  concomitant variables. By focusing on measures of academic
  gain, each student serves as his or her own 'control'--or in
  other words, each child can be thought of as a 'blocking
  factor' that enables the estimation of school system,
  school, and teacher effects on the academic gain with the
  need for few, if any, of the exogenous variables." (305)
  This is an incredible and patently ridiculous claim. It says
  that "gain scores" gotten by correcting posttest achievement
  scores via least-squares mixed model estimates from pretest
  achievement scores can be safely assumed to control for
  "exogenous" (merely, in this context, "not measured")
  variables such as social class, race, intelligence, culture
  and many others. It is said that pretest achievement scores
  will be the "filter" through which these exogenous
  influences operate on the posttest scores, hence, the
  student background characteristics can be ignored. If this
  is an assumption of the system, it is clearly contestable
  and nearly certainly false; if it is a belief about what
  accounts for variation in achievement scores, it is
  incompetent.
 
  5.  They go on: "In an attempt to partition the teacher and
  school effects from the partial confounding with class
  ability level, the well-known linear model techniques of
  analysis of covariance and ordinary multiple regression have
  been suggested by Millman (1981) and others. The obvious
  intent was to adjust differences that exist among students
  to enable a fairer evaluation of teachers. However, if these
  simple approaches are applied, and even if all of the
  concomitant data were available, still unanswered is the
  well-known problem of regression to the mean of the teacher
  effects that would provide unfair rankings of teachers with
  varying quantities of student achievement records." (p. 305)
  Millman (spelled correctly this time) is cited but the
  bibliography contains only the citation "Millman, J (ed.).
  (1981). Handbook of Teacher Evaluation. Beverly Hills:
  SAGE." I presume that Sanders and Horn are referring to a
  specific chapter in the Handbook that Millman edited, though
  they don't cite one. I also assume that the chapter in
  question is in fact Millman's own chapter on using student
  test scores to evaluate teachers. When Millman and Darling-
  Hammond edited the Second Handbook on Teacher Evaluation,
  Jay Millman asked me to write the chapter on Using Student
  Test Scores to Evaluate Teachers. I will be happy to send an
  email copy of this chapter to anyone who requests it. As
  those who know me might guess, I had nothing good to say
  about the practice. But back to Sanders and Horn--the last
  sentence quoted above beginning with "However, ..." is
  simply unintelligible to me. What is clear, however, is that
  far from lacking "all of the concomitant data," the TVAAS
  has NONE of the concomitant data.
 
     It is clear in the Sanders and Horn piece that they
  believe that their mixed model estimation procedure solves
  problems of unmeasured exogenous variables (ability, social
  class, race and the like) and provides a fair comparison
  among teachers: "If the problem (of estimating teacher
  effects) is viewed not as a fixed-effects problem but rather
  as a mixed-model problem with both fixed and random effects,
  then much established theory and methodology exist that
  offer solutions to many of the problems that have been cited
  as reasons for not doing educational outcome assessment from
  student achievement data." (p. 305) This is absolute
  nonsense. If information on intelligence of the class of
  students, their family life and the like is not measured and
  included in the model, it does not somehow magically appear
  compliments of solving the normal equations. The only way
  that such background influences can be assumed to enter the
  TVAAS system is through the "filter" of pretest scores; to
  think that this "filter" is sufficient to correct for the
  background characteristics is simple fantasy.
 
 
     It is clear from reading the TVAAS responses to
  questions posed here over the past few weeks and from
  reading their latest published exposition of the method that
  they have no appreciation of the validity question
  whatsoever. Nor do they appear to have studied and taken
  into account the psychometric literature on these problems,
  which is abundant and well-known: Harris, C.W.(ed.) (1963)
  Problems in Measuring Change. Univ of Wisconsin Press;
  Cronbach, L.J and Furby, L. (1970). How do we measure "change"--
  or should we? Psychological Bulletin, Vol. 74, 68-80.
  Cronbach, L.J. (1982) Designing evaluations of educational and
  social programs. Jossey-Bass.  These are a minimum set of
  references for grappling with these problems.
 


From:         Greg Camilli 
 
Communication does seem to be a problem lately. Rick doesn't understand
our differences (or those with Tom and John). I don't understand them,
and Rick's latest communication is even more baffling to me. I also don't
understand Harvey's position on superpopulations, though doubtlessly
generations of statisticians assume this as axiomatic. (As for William
Sanders, I wouldn't classify his post as serious communication at all. I
will, however, read the articles in good faith.)
 
In a strange way, Rick and Harvey are saying something similar. Harvey
talks about superpopulations; these are entities that don't exist, except in
the imagination. Yet it is contended that it is a "reality in the sense that
further batches of students are samples from it. How else would you make
sense of anything?" A lot of people have sought to answer this question,
among them Alan Birnbaum who paraphrased the likelihood principle as
the "irrelevance of outcomes not actually observed." He went on the write
of the "immediate and radical consequences for the everyday practice as
well as the theory of informative inference.' As for the superpopulation, it
exists in ones mind as a vehicle for generalization. But generalization itself
requires more worldly knowledge. For example, consider the standard
error of statistic calculated from a poll during an election. You might say a
population exists, but only for a limited amount of time. Experience with
the rate of change in public sentiment (and the way the question is asked) is
required for a valid generalization. Happily, however, we are in full
agreement on the role of specification error, as masterfully  articulated by
Les. (Because William Sanders thanked God for Harvey, one assumes he
also agrees on this point. I'm also thankful for Harvey's participation.)
 
Rick has a world in mind where he suspects that the rest of us presume
that "of course teachers will try to psych out the test, and they ought to do
that for their own personal protection." Some of us  "might even think that
this is the best way to get good scores on such tests or that it is natural for
many teachers to believe that -- even if they feel it is an improper way to
teach. I have been assuming that it is NOT the best way to get test scores,
and it is NOT the best way to teach." And further "In any case, it seems to
me to be a dishonorable and a very strange for a profession as a group to
acquiesce to a bad policy in such a way that they spend more effort coping
with it than trying to correct it. You make it sound like the legislature, the
media, and the public do not care whether tests are predictable or not. I
don't think it is idealistic to think that the public's or media's
interest is that unconcerned about this whole thing."
 
Rick, the educational profession cares that teachers do not teach to tests
because it dilutes the content of the curriculum. Most people in testing
programs also believe this. It seems to me that you are arguing with
someone who thinks tests should be used to dumb down the curriculum. I
believe as you believe, but not all you believe. I also think tests can be good
or bad, taught to or not taught to, and that some may cheat while others
struggle to maintain integrity. And I think that one can label anything as
political.  It's a term that is often used in lieu of a sensible explanation of
how something has come to pass, and what can be done about it. Most
people think this way. (There, I've done it, I've created a superpopulation.)
 
When I read your messages, it strikes me that you are reflecting on how
people are thinking and the language they are using, rather than the
content of the messages.  The blanket characterization about educators is
out of line, in my opinion. I think we are trying to expose and correct bad
policies, that's what this discussion is about, and it is not without
precedent. In 1988, John Cannell published an article "Nationally Normed
Elementary Achievement Testing in America's Public Schools: How All 50
States Are Above the National Average." Mr. Cannell is a doctor (MD)
not a measurement specialist. This article (whose gist is given in the title)
stirred widespread attention inside and outside the academic community.
The effect he noted is now widely called the "Lake Woebegone" effect.
Anyone familiar with this article, knows the public is interested in good
testing practice. The problem has been researched extensively, in both
universities and testing companies. The results are clear: testing
programs, whether the test is good or bad, can have unintentionally
harmful effects.
 
Perhaps the communication problem results from a "type" mismatch of our
arguments based on experience with your arguments based on a more
technical sorting of language and logical form. Finally, I'm not bemoaning
TVAAS. I don't yet know how well the program works, but I do ask for
information and I am skeptical. Moreover, the TVAAS staff can probably
learn more from our skepticism than warm congratulations on a job well
done. They too are involved in a sorting language (say test scores) and
logical form (say statistical model) and could benefit from our experience.
 



From:         Scriven@AOL.COM
 
Aha! At last the battle has been joined. I look forward to a response by Bill
S. However, I do hope that both parties, or at least the spectators, can keep
in mind the need to go forward with the basic question: is there a feasible
method to get at least an approximate estimate of the extent to which
teachers are contributing to student learning?
 
Whether Bill has got it or not, there's a worthwhile task here-getting the
best feasible model. If it's not very good, that isn't important. As long as
it's better than failing to bring in outcomes to teacher evaluation, and not
too expensive, then it's worth having. Remember that even now no teacher is
at risk on the basis of the TVAAS results alone: there has to be other
confirmatory data. (Sure, we need to look at whether that gets biased by
knowledge of the statistics results, but if not, it can easily be done
independently.)
 
So I hope we can keep our collective eyes on the goal of getting an idea of
whether any errors here are correctable to the point where we have a useful
device for correcting the appalling alternative of judging teaching without
reference to outcomes.
 
Michael Scriven



From:         Greg Camilli 
Subject:      CTB Scale scores
 
I thought that some of you might want to take
a look at some statistics regarding the metric
of the scores that TVAAS uses. Below, I've given the
mean, median and standard deviation of the IRT metric
for fall reading comprehension as reported in the CTBS/4
Technical Bulletin 1 (1989).(I hope this isn't too far
out of date.)
 
Grade    mean    median     STD
 
1        473     481        84.3
2        593     606        81.1
3        652     657        59.6
4        685     694        53.6
5        707     714        48.6
6        725     730        43.8
7        733     738        43.6
8        745     750        43.1
9        760     764        38.6
10       770     774        39.6
11       776     780        38.2
12       780     782        38.0
 
If you plot these data by grade, some interesting possibilities
emerge. For example, one wonders why students below average
gain as much as students above average. The explanation I see
is that there is much less room for growth at higher grade
levels, but this is a function of the scoring metric. A
transformation of scale might lead to different results.
 



From:         Sherman Dorn 
 
Michael Scriven writes:
 
>Whether Bill has got it or not, there's a worthwhile task here-getting the
>best feasible model. If it's not very good, that isn't important. As long as
>it's better than failing to bring in outcomes to teacher evaluation, and not
>too expensive, then it's worth having.
 
This assumes that (a) there is no alternative to TVAAS to bring outcomes
into teacher evaluation, and (b) those of us who criticize the present place
of TVAAS in Tennessee official policy [and the potential for similar
statistical systems elsewhere] are therefore against judging teachers by
what students learn.  Neither is true.  (Counterexample:  go read
the issue of Exceptional Children, vol 52 [1986], on formative evaluation.)
 
Also, the state of Tennessee has spent millions over the past ten years,
 including
the money spent to develop TVAAS, in having students complete annual
high-stakes tests.  I would call that rather expensive.
 
Furthermore, the state of Tennessee has, through its promotion of high-stakes
testing, demonstrated to its teachers the profound mistrust policymakers
have of them, further eroding the legitimacy in teachers' eyes of ALL
attempts to judge them by students' outcomes.  I would call that rather
expensive.
 
>Remember that even now no teacher is
>at risk on the basis of the TVAAS results alone: there has to be other
>confirmatory data. (Sure, we need to look at whether that gets biased by
>knowledge of the statistics results, but if not, it can easily be done
>independently.)
 
What you've written reads to me as the following:  "I don't care if
TVAAS is a bad model.  It's a model.  Besides, it's a model that doesn't
really matter.  So let's go on working with it."  This is beginning to
sound like the kettle defense.  ("I never touched the kettle; I only
borrowed it; it was broken when I first took it.")  I hope that's not what
you meant.



From:         Leslie McLean 
Subject:      Validity, again (no apology)
 
   As I have come to understand the TVAAS, it goes like this:
 
   These numbers are measures of gain in student achievement over a
school year, and our PROCEDURE entitles you to read them as measures of
teacher competence--the larger they are, the better the teacher, and v.v.
 
   Many of us, based on lots of experience, are extremely skeptical of
this claim, and so far the explanations have only increased our
skepticism.  The procedure is SO questionable (from test content to
scaling method to model-fitting and adjustment) that any interpretation
such as I have given above should be withheld pending supportive
evidence--talk about "meaning and values in measurement and evaluation"!
 
  As many posts have argued, the TVAAS numbers become the only evidence
to get attention, not because the TVAAS folk say so but because the sheer
size and bulk and prestige of the enterprise drive out other measures.
Messick ended his speech with a quotation from one Liam Hudson (1972),
from his book, The Cult of the Fact (p. 125).  Hudson argued that social
science:
     ...should be pictured not as a society of good men and true,
     harbouring the occasional malefactor, but rather, as one in
     which everyone is searching for sense; in which differences
     are largely those of temperament, tradition, allegiance and
     style; and in which transgression consists not so much in a
     clean break with professional ethics, as in an unusually high-
     handed, extreme or self-deceptive attempt to promote one
     particular view of reality at the expense of all others.
 
 
More strenuous efforts would appear to be required to avoid such a
transgression in Tennessee.



From:         Leslie McLean 
Subject:      More on validity (with apologies)
 
     All the posts about statistics, including Gene Glass's response to
Bill Sanders, may have left some of you wondering whether it all
matters, or what it means.  Michael Scriven reminds us that "outcomes"
(by which he means what students learn) must not be left out of the
data when the competence of teachers are assessed.  He gives us a
version of what I heard Patrick Suppes say many years ago, "Anything
worth doing is worth doing badly".  Suppes (a mathematical logician by
trade) was quite serious, but his context was the very early days of
computer-assisted instruction (remember CAI?).  The context today is
VERY different--high-stakes testing programs with real consequences
for teachers and school officials.
    Peace, Michael, but I do not think we can advocate any measure of
teacher effectiveness that has not been tested and validated against
several different ways of judging teacher competence.  Patrick Suppes
was working at the leading edge of research, with no serious
consequences for anyone if he was wrong.  He made many contributions,
and  quite a bit of money, and no one should be critical of his
entrepreneurial spirit and his creativity.  Neither should we confuse
the context of the late 1960s with the legislated accountability of
the middle 90s.  Oh, I know, you work with Dan Stufflebeam all the
time and you know all about the middle 90s.  Your post does not
reflect this knowledge.  (As you will see, I do not believe that we
yet have a measure, or set of measures, valid enough to include in the
evaluation of teacher competence.)
   So what about all this statistics talk? Gene Glass has indicated
his reservations about TVAAS's "mixed models" (perhaps I understate
the case).  What ALL the regulars on Edpol should consider (those that
have not already done so) is that statistics has moved on well beyond
"mixed models", in response to the complexity of social
groupings--ESPECIALLY students within classes within schools within
districts within states (within countries!).  Since Gene has listed
his excellent book (with Hopkins), let me cite my own contribution to
the post-mixed-model literature, an application of multilevel models
(with the essential contribution of Harvey Goldstein): McLean, et al.,
(1988) The reliability of the oral examination in internal medicine.
Journal of the Royal College of Physicians and Surgeons of Canada, ...
...).  The study is in education (of medical specialists) and is in
the mainstream of generalizability theory.
 
There have been two main channels of development and two major
tributaries to better understanding of school achievement:
   1. Murray Aitken and Nick Longford's early work, parallel to Harvey
Goldstein's (e.g., 1986, J. Royal Stat. Soc. B, 149: 1-43).
Tributary.
   2. Harvey's book, the first, Multilevel Models in Educational and
Social Research.  New York: Oxford U. Press, 1987.  Harvey's group at
the Univ. of London Institute of Education offers computer software
(ML2 and ML3) and training.  Mainstream.
   3. Bryk and Raudenbush, at about the same time: Bryk et al. (1986)
An introduction to HLM: Computer program and user guide.  University
of Chicago.  (HLM: Hierarchical Linear Models--see previous Edpol
posts).  B&R and Co. offer computer software for fitting two-level
models and do extensive training.  Mainstream.
   4. Nick Longford and Murray Aitken's later work, extending their
earlier models and offering computer programs.  Tributary.
 
   So what? We have been led to believe that the TVAAS utilise the
latest developments in statistical models in order to produce
estimates of gain that truly reflect what students are learning.  Yet
the published sources cited fail to support this claim--in fact the
sources fall quite far short.  Either Dr. Sanders and his associates
are hiding their light under a bushel or their claims are not
supported--these seem to me the only alternatives.  Throughout the
interchange on Edpolyan (I hesitate to call it a discussion, given the
reticence of the TVAAS folk), a nagging suspicion has grown among
those of us who struggle with test scores and models thereof that the
TVAAS structure is a house of cards.  Dr. Sanders will no doubt regard
this as another bit of "nonsense" from me, but his contributions have
yet to add to the "sense".  Given the salience of the system in
Tennessee, we all hope to have the sense revealed to us.


 


From:         Scriven@AOL.COM

Of course, anything worth doing should be done as well as we can, and I'm
counting on you, Gene, and Bill S to get the best measure we can. But it
seems to me you're talking unrealistic standards, Les, when you say that we
don't yet "have a measure... valid enough to include in the evaluation of
teacher competence." One must look at the validity of the measures we
currently use in order to see whether TVAAS is "valid enough".
 
Short of finding a teacher in flagrante delicto with a student, what we use
is a sorry bunch of variables, ranging from the process variables observed on
a visit to the classroom, some of them vaguely correlated with learning gains
via process/outcome research (none of these are legitimate), through reports
from parents, noise heard through the classroom walls, all the way to
evidence of enrolment in post-grad studies.
 
I think TVAAS still looks like a better addition to that pile than most of
the stuff in it at the moment.
 
In any case, I'm suggesting that we should judge it in terms of whether it
enables us to do better, not just point out that it doesn't do as well as
would be desirable.


From:         Scriven@AOL.COM
Subject:      Re: Dorn on Scriven
 
No, Sherman, what I'm saying isn't much like your version of it. If someone
has a better student-outcome measure of teacher merit than TVAAS, let's show
it's better, not just point out imperfections in TVAAS. I know of many
allegedly better efforts, but none that hold up under the kind of heavy fire
that TVAAS is getting here. If you have a better one in mind, try explaining
it here and let's see. In any case, whether you do or not, the standard for
criticism of it and TVAAS has to be whether it's better than nothing in the
outcome dimension, not whether it's flawless.
 
As to your claims that (i) I'm assuming that all critics of TVAAS  are
"against judging teachers by what students learn" and that (ii) in pointing
out that there's a safety net for teacher evaluation against errors in the
TVAAS model I'm saying the TVAAS is of no value, your logic in imputing them
to me seems pretty far-fetched. In any case, I did not and do not intend any
such implications.
 
I'm trying to get us to keep reasonable standards in mind, not ideal ones;
that doesn't seem to be an effort that deserves the kind of overkill attack
you've launched on the suggestion.
 
By the way, I thought we were trying to avoid being condescending to each
other on this board. It seems to be pretty condescending to say to me "Go
read X" (and you'll see how wrong your assumptions are), rather than give
reasons here and now. I could throw references at you, too, but I try to get
the points across with a summary of reasons.



From:         Sherman Dorn 
Subject:      TVAAS
 
Michael Scriven writes:
 
>If someone
>has a better student-outcome measure of teacher merit than TVAAS, let's show
>it's better, not just point out imperfections in TVAAS. I know of many
>allegedly better efforts, but none that hold up under the kind of heavy fire
>that TVAAS is getting here. If you have a better one in mind, try explaining
>it here and let's see. In any case, whether you do or not, the standard for
>criticism of it and TVAAS has to be whether it's better than nothing in the
>outcome dimension, not whether it's flawless.
 
There are two pieces to this "better than nothing in the outcome dimension."
One piece is what you have explicitly described -- trying to include information
about student learning in teacher evaluation.  Gene's and Les' criticism get
at that, and Gene has described at least one alternative.
 
A second piece, however, deals with the legitimacy of the tool within
school cultures.  If something (or a set of different somethings) is so bad that
it poisons the atmosphere for further attempts to include student outcomes
in teacher evaluation, then, yes, I'd argue that it can be possible to
be much worse than nothing in the outcome dimension.  It is very difficult
from the outset to get teachers to pay attention to all their students, and
the umpteenth attempt at accountability will just be treated as crying wolf.
 
The basics of formative evaluation, or curriculum-based assessment, as
described in special education journals and as I've described a few months
ago, is for a teacher to test students frequently and make instructional
decisions based on whether an individual student is meeting a pre-specified
goal.  Yes, this depends on competent teaching, appropriate selection of
goals (from what I gather, many teachers are timid about goals), and tests
that are decent and relatively easy to conduct and score appropriately.
 
But there are several advantages:  the information on the student shows
frequently-gathered information in a situation that, after a few times, should
be immune to practice effects, can be adjusted to be sensitive to student
progress for very low-performing children, is much more likely to be
seen as legitimate information by teachers than annual test scores,
and can be used for evaluation of the teacher (who should be selecting
appropriate goals, assessing appropriately, and making decisions in
response to student progress or lack thereof).  This type of assessment
for students was developed in a context of individualized programming
(for special education students), but there are also forms appropriate
to whole class assessment, and you could make parallel decision rules --
create tests that ask students to perform tasks related to curriculum
for the entire year, assess frequently, and respond instructionally to
what you see.  (And, at the teacher evaluation level, see if the teacher
is responding to the information.)
 
>As to your claims... I did not and do not intend any
>such implications.
 
I will certainly accept that, and mea culpa if it read as too personal a
 comment.
It did, however, seem rather strange to be arguing that poor methods are
better than nothing in a policy context.  Maybe I could accept that if TVAAS
were used as evaluation in a single or a few school systems.  But this is
something that affects several hundred thousand children, and thousands of
teachers.
 
In response to Les McLean, Michael Scriven writes:
 
>Short of finding a teacher in flagrante delicto with a student, what we use
>is a sorry bunch of variables, ranging from the process variables observed on
>a visit to the classroom, some of them vaguely correlated with learning gains
>via process/outcome research (none of these are legitimate), through reports
>from parents, noise heard through the classroom walls, all the way to
>evidence of enrolment in post-grad studies.
 
As an historian, I see the causal relationship here in a very different way.
 Yes,
teachers are not evaluated on what children learn, but that's not primarily
because we have bad outcomes measures.  We have bad outcomes measures
because schools are not designed to pay attention to outcomes.  Even when
standardized testing spread in the 1980s, it was frequently not designed to
assess teachers but rather punish students for not learning (e.g., the
 proficiency
test I had to pass in California in order to graduate).  Daniel Calhoun writes,
in THE INTELLIGENCE OF A PEOPLE, of the ways in which our views
of intelligence, and ways of judging what children have learned, have typically
been a way of blaming children (or adolescents or adults) for not learning.
List subscribers here in the past week have volunteered anecdotes of how
tests supposedly designed to help children by making teachers accountable
instead have created pressures which (I believe) have no business in a school.
I don't think it's because people have not yet come up with the perfect
statistical tools to turn those test results into good outcomes measures.
I think it's because schools have a tendency to sort and blame when the
heat's on.  With this tendency, I'd rather not rely on something like the
TVAAS to get teachers to pay attention to students.


From:         Rick Garlikov 
Subject:      Re: Validity, again (no apology)

Les,
    TVAAS has given at least one non-mathematical argument to demonstrate
the reasonableness of their approach; that is that their numbers match
the evaluations supervisors make.  And although Harvey doesn't see any
purpose or point in my previous request about this matter, it seems to
me that one of the implications of TVAAS's claims is that they could
quite accurately predict the test scores of students, given sufficient
information about that student's past performance relative to his/her
classmates, and given the information how those classmates have done on
a test the student in question does not take, or has not yet taken.  I
suggested a trial in which TVAAS makes such predictions based on tests that
are taken but withheld from them until after the predictions.  Harvey thinks
this is nonsense.  But it seems to me to be a way of demonstrating whether
TVAAS has some sort of statistical power or not, without arguing about the
mathematics.  As I said in my first post about all this, in science math
is only a guide, not a proof of how the real world behaves.  Even in
physics, they have to do the experiment to try to confirm, not that the math
is right, but that it is the right math.   I think Gene would agree with
this in those cases where he thinks math is applicable at all.  And, like
Gene, I would agree that there are far more important aspects to evaluating
teachers than what can be described or computed mathematically.  But TVAAS
agrees with that.  Their focus is only about one aspect of teaching.
They are, to use my baseball analogy, only computing batting averages; they
are not claiming batting averages are the only measure of the value of a
baseball player.  And they are using every means at their disposal to try to
make that point abundantly clear.
    But with regard to the math, I think a non-math means of demonstrating the
reasonableness of the method is perhaps more important than mathematical
arguments about the math itself.  Especially for those, like me, who
have no way of following the arguments Gene, Harvey, Greg, Les, et al,
can make, but which don't get to the heart of whether even the most
impeccable math applies or not.


From:         Sandra P Horn 
Subject:      Response to Glass from Wm. Sanders
 
Response to Glass's Jan. 19 post from William Sanders:
 
Sanders:
  "To Gene Glass:  I hope that reading this article will give
  you a more accurate picture of what we are doing.  After you
  have looked at it, you may want to restate your questions."
 
Glass:
     "I don't appreciate being fobbed off in this manner and
  I am in an even more atrabilious mood for having traipsed to
  the library to copy the article referred to (Journal of
  Personnel Evaluation in Education 8:299-311, 1994)."
 
Sanders: There was certainly no intent on my part to fob you or
anyone else.  What has been disturbing to me (and still is), is how
you or anyone else could assume to know what we are doing in TVAAS
without examining relevant materials and then write criticisms based upon
your own assumptions in such declarative tones.  Now that at least you
have looked at the paper to which we referred, we can begin to discuss
your criticisms from the perspective of the model(s) which we are using.
 
Glass:
"It is said that pretest achievement scores will be the "filter"
through which these exogenous influences operate on the posttest
scores, hence, the student background characteristics can be
ignored. If this is an assumption of the system, it is clearly
contestable and nearly certainly false; if it is a belief about
what accounts for variation in achievement scores, it is
incompetent."
 
Glass:
"The only way that such background influences can be assumed to
enter the TVAAS system is through the "filter" of pretest scores;
to think that this "filter" is sufficient to correct for the
background characteristics is simple fantasy."
 
Sanders:
Observe that we fit the model (including all teachers over all
subjects over all grades simultaneously within each school system)
to the entire observational vector available for each student with
the appropriate variance-covariance structure within the r-matrix.
By so doing, one can view each student vector to be "like" an
incomplete block with the analogy to the analysis of incomplete
block designs.  After obtaining teacher and school effects this
way, what justifies our claim that most of the socio-economic
confoundings have been filtered?
 
     We do ex post facto analyses relating the teacher or school
effects to variables that have been accepted by some (at least) to
proxy socio economic status.  We have found the following based
upon the state-wide analysis.  There is no relationship between the
school effects and: (1) the percentage of students receiving free
and reduced lunches in the school; (2) the racial composition of
the student body; (3) the location of the building as to urban,
suburban and rural.
 
     Also, the relationship between the school effects and the
school means are extremely low as can be seen in the following
table.  {No we do not have state-wide a measure of student ability.
However, in our early studies we found that the inclusion
contributed virtually nothing}.
 
  Simple coefficients of correlation between school effects (math)
and the mean for each school (3 year averages).
 
      Grade       r     N (no. of schools)
        3       .056        898
        4       .081        888
        5       .101        865
        6       .163        673
        7       .105        509
        8       .078        510
 
With relationships of this magnitude, it certainly is easy to show
numerous examples of schools across the entire spectrum of mean
achievement scores who have obtained excellent gains.
 
     The statements that we have made are not mere assumptions but
rather come from the results of the data analysis!! However, if
needed, the modeling process which we deploy does not in any way
preclude the use of other covariables. Let me restate that, to date,
we have not found the need for these additional variables.  If, in
the future, using other test data, a need is found to insure
fairness, then the models will be expanded to include them.
 
Glass says, "How, I might ask, does TVAAS imagine that the positive
correlation between ability and achievement arises in the
world if more able students do not learn at a faster rate than less
able ones?"
 
     One of the major findings from the state-wide analysis of the
data is that most (not all) of our school systems do not have the
curricular stretch to allow the highest achieving students to
express gain commensurate with the gain obtained by the average and
below average students.  However, in other systems, we find this to
be not true!  In those systems, students of all achievement levels
are making satisfactory gains.
 
Glass:
It is clear from reading the TVAAS responses to questions posed
here over the past few weeks and from reading their latest
published exposition of the method that they have no appreciation
of the validity question whatsoever.
 
That is a heavy charge.  Let me share several additional findings.
 
1. As part of the second study (data from Blount County), the
principals were asked to forecast whether each teacher would
profile in the top, mid or bottom third of the Blount County
distribution.  The principals forecasted about 90% of the bottom
profiling teachers, and could distinguish between the top and mid
groups of math teachers; however, they could not distinguish
between the top and mid reading and language arts teachers.
 
2.  As was mentioned in the most recent post, we now have merged
the writing assessment data into the master data base.  Even though
the studies are not complete, it appears that we could substitute
this data for the language arts data from TCAP without appreciably
changing the rankings of schools within a system.
 
3. We have data from the 10th grade TCAP and from the 10th grade
PLAN tests.  The data could be interchanged with virtually no
change in the rankings of systems and schools.
 
4. When we have had direct knowledge of change in educational
practice, then we have observed change in the effects.  For
instance, Knox County, which has a middle school system, has had
severe retardation in the gains for 6th grade (the first year of
the middle school).  This past year a major effort was launched to
improve communication between feeder schools and receiving schools
such that instruction could be provided earlier in the school year
commensurate with where the feeder schools had left off the
previous year.  After this effort, the Knox County 6th grade gains
improved appreciably.
 
6.  One of the administrators at Tennessee State University (TSU)
gave us list of several schools in which they had assisted in
developing a "hands on approach" to teaching science.  In all of
these schools, we found that the cumulative gains (gains summed
over grades) were greater than 100% cumulative norm gain.
 
7.  For those schools that are offering pre-algebra in the seventh
grades, their gains are considerably higher than those which are
not.
 
     Does any of this speak to validity?
 
 
 
Cost
 
Some have been addressing this question.  Let me mention a cost
that needs to have the most consideration.  That is the cost of
denying human beings a chance to be competitive in later life
because they were unfortunate enough to have attended schools and
systems that were not offering a competitive academic program.
 
To me the purpose of all formal education is to provide as many
human beings as possible, as many choices in life as possible.  As
I mentioned in the previous post, we found that tremendous
difference exists among our school systems in mean ACT scores, when
considering students that were at the same place academically as
measured with the 8th grade TCAP tests.  For example, considering
only the top quartile of eighth graders, the top two or three
systems will have mean ACT math scores of 27 while other system
means will be 18.  Why?  Because most of these low averaging
systems do not offer accelerated math programs, no AP courses,
etc., etc.   TVAAS is not just about profiling teachers, but rather
is an attempt to measure those influences that are either
accelerating or impeding the academic progress of populations of
students.
 
     Many can quibble about our models, can quibble about the tests
and testing, can argue that the taking of the blood sample will do
more harm to the patient than the illness.  However, any reasonable
analysis of the totality of the data will confirm that this process
is fair and indeed has begun to bring pressure on school officials
to improve the educational process.
 
     No system, school or teacher is being asked to do more in
Tennessee than is presently being done by educators working under
similar circumstances. The extreme variation in effectiveness that
exists among systems, schools and teachers is THE problem in public
education in Tennessee.  Attempts to attribute these findings solely to
the testing or the statistical methodolgy will require extreme
contortions of logic to evade the mountains of evidence to the contrary.
 
William L. Sanders
Director and Professor
University of Tennessee Value-Added Research and Assessment Center



From:         Sandra P Horn 
Subject:      Re: Les McLean on TVAAS
 
There have been several messages similar to this one that originated from
Les McLean.  I would like to respond to it as a member of the TVAAS team,
but not as a spokesperson for the team or for the UT Value-Added Research
and Assessment Center.
 
On Sat, 21 Jan 1995, Leslie McLean wrote:
 
> Gene Glass's hypothetical procedure for local use of test scores in teacher
> evaluation ended with a modest claim--for a modicum of validity.  The
> post arrived just as I finished re-reading an ancient manuscript--Samuel
> Messick's presidential address to Div. 5 of APA in August 1974, "The
> Standard Problem: Meaning and Values in Measurement and Evaluation".  My
> it has stood up well! It contained reference to an even more ancient
> scroll, Lee Cronbach's writing on validity (with Meehl), going back to 1955
> (and thus more than 40 years's old).  The phrase that deserves emphasis
> here in 1995 is that we do not validate tests, but "AN INTERPRETATION OF
> DATA ARISING FROM A SPECIFIED PROCEDURE".  (That's from Cronbach's
> chapter in Ed. Measurement, edited by R.L. Thorndike, 1971, p. 447).
>
>    As I have come to understand the TVAAS, it goes like this:
>
>    These numbers are measures of gain in student achievement over a
> school year, and our PROCEDURE entitles you to read them as measures of
> teacher competence--the larger they are, the better the teacher, and v.v.
>
Then you have not been reading the posts we have sent you, or you are
selectively ignoring whole sections.  TVAAS uses student data
longitudinally over at least three years.  So far, reports have only been
issued for schools and systems.  Teacher reports will be issued for the
first time this year.  Teachers are only asked to attain normal gains for
their students.  Standard errors are employed so that a range of scores
fall within the definition of "normal gain."  We are working closely with
teacher and administrator representatives to insure the privacy of
individual teachers and to develop meaningful, useful reports for them.
 
 
>    Many of us, based on lots of experience, are extremely skeptical of
> this claim, and so far the explanations have only increased our
> skepticism.  The procedure is SO questionable (from test content to
> scaling method to model-fitting and adjustment) that any interpretation
> such as I have given above should be withheld pending supportive
> evidence--talk about "meaning and values in measurement and evaluation"!
>
We are attempting a dialogue with those of you interested in TVAAS.  We
didn't realize there were so "many" of you, since most of the posts
have come from a handful of people.  If you have questions, we continue
to attempt to address them.  What is it you find questionable,
specifically?  Asserting that the model is questionable certainly should
be withheld pending supportive evidence.  Talk about "meaning and values
in responsible discourse"!
 
>   As many posts have argued, the TVAAS numbers become the only evidence
> to get attention, not because the TVAAS folk say so but because the sheer
> size and bulk and prestige of the enterprise drive out other measures.
 
These "many posts" have originated from a very few individuals who may
not be representative voices.  TVAAS has not by any means driven out
other forms of assessment.  The Tennessee Writing Assessment has just
recently come on line.  Performance evaluation of teachers continues,
apace, and is the primary means of teacher evaluation in Tennessee.  As
we have stated repeatedly, TVAAS currently assesses the effects of
educational practice on the outcomes students portray in five subject
areas in grades 3 through 8, hardly the bulk of education in Tennessee.
 
 
> Messick ended his speech with a quotation from one Liam Hudson (1972),
> from his book, The Cult of the Fact (p. 125).  Hudson argued that social
> science:
>      ...should be pictured not as a society of good men and true,
>      harbouring the occasional malefactor, but rather, as one in
>      which everyone is searching for sense; in which differences
>      are largely those of temperament, tradition, allegiance and
>      style; and in which transgression consists not so much in a
>      clean break with professional ethics, as in an unusually high-
>      handed, extreme or self-deceptive attempt to promote one
>      particular view of reality at the expense of all others.
>
>
> More strenuous efforts would appear to be required to avoid such a
> transgression in Tennessee.
>
This is the part I consider insulting.  Categorically, we have never
broken with professional ethics.  We communicate with educators on every
level, within our state and across America, Edpolyan being only one
example, seeking input, reaction, responsible criticism, and an
understanding of what educators need in order to most effectively use
TVAAS data.
        TVAAS was created for the betterment of education.  That is its
purpose.  There are no ulterior motives, and no one here stands to profit
from some demonic plot to crucify teachers.  Nor are we self-deceiving.
We are fully cognizant of how achievement data have been misused by media
and others in the past.  TVAAS is an attempt to change that.  Could it be
that perhaps you are stuck in the old paradigm, that because this model
uses data from standardized tests that you are making assumptions based
on past usage rather than openly examining this new way of analyzing and
reporting outcomes?
        We have remained civil when those with whom we seek to communicate sink
to such off-handed insults as these and when others have accused us of
everything from arrogance to statistical ignorance rather than question
their own assumptions or attempt to work collegially toward solutions.
        It is not high-handed to refuse to engage in this type of low
insinuation.  Furthermore, it is not high-handed when our responses do
not come fast enough to satisfy your schedule.  We have tried to explain
why there are delays but, repeatedly, this is interpretted as a
personal issue rather than a logistical one.  Is there a reason why we
must waste time in this manner?
         I am not willing to abandon my belief in the ability of people
of good will to solve society's problems.  After all, that's what brought
me to the TVAAS team in the first place.  I just find it very
discouraging to find so little good will.  It never occurred to me that
this discussion would devolve into paradigm wars.
        We have two discussions going on here:  the philosophical aspects
of educational assessment and the statistical propriety of the TVAAS
model, both of which are vital.  I don't foresee much progress being made
on either front unless we base the discussions on belief in the good
will of our partners in seeking solutions to the problems in education
and, in the absence of evidence to the contrary, in the competence of
our fellow discussants in their respective fields.
        Perhaps this bickering is normal and I just don't know how to play this
game.  Forgive me.  I am new and perhaps naive.  But I, for one, find it
counterproductive and saddening whenever, in any circumstances, people do
not treat each other with respect, especially when egos preclude the open
exchange and consideration of ideas.
 
Sandra Horn


From:         Sandra P Horn 
Subject:      Re: Camilli's q's re: TVAAS
 
Dear Greg,
 
Sorry it has taken so long to get back to you on these important
questions.  In regards to the CTBS/4, which is used in the Tennessee
Comprehensive Assessment Program (TCAP) tests that provide the scaled
data for TVAAS:
 
1.  You asked us some time ago about how the TCAP was scored.  It is
scored by pattern (IRT) scoring in accordance with CTS's three-parameter
statistical model.
 
2.  You also asked what form of the CTBS/4--Benchmark, Battery, or
Survey--was used.  The full battery is used for the math, language
arts and reading tests.  Science and social studies are tested with the
survey versions.
 
3.  There has been considerable discussion on the issue of "teaching to
the test."  John Covaleski points out that the degre to which a test is
capable of being "psyched out"--by this, we understand that he means
"predicted"--by stakeholders has a great deal to do with how valid the
results will be.  We agree that this is true.
        The current TCAP tests, both the norm-referenced subtests and the
criterion-referenced subtests, are comprised of 70% new items every year,
the remaining 30% being drawn from a bank of repeat items from all
previous years' tests.  CTB certifies that the new items are equivalent
to the items they replace in scoring properties.  The purpose of
requiring new items is specifically to discourage the educationally
indefensible--and ineffective--practice of teaching to the test.  Those of
us involved with TVAAS anticipate that TVAAS will show that students who
are fortunate enough to have teachers who develop concepts and encourage
the discovery of connections do far better on all forms of assessment
than those under teachers who waste instructional time trying to "psych
out" the test.  If, indeed, this turns out to be the case when the
teacher reports are issued later this year, perhaps the preferred
strategy for improving test scores will more often center around improving
instructional strategies that meet the needs of each individual student,
and the teaching to the test strategy will be abandoned in the face of
evidence of its ineffectiveness.
 
We appreciate your continued interest in TVAAS.
 
Sandra Horn
 


From:         Alan Davis 
 
Rick recently argued that teaching to high stakes tests probably would
not have adverse consequences if the tests were good and if teachers
"knew that it is good teaching that gives the best chance for improved
test scores."
 
One of the principles of high stakes testing, documented by Linda McNeil
in Texas and supported by other research (see Eva Baker at UCLA) is that
things that are tested push out curriculum that is not tested.  So when
Jane Armstrong (Education Commission of the States) and I visited school
districts in Virginia in a study of elementary science instruction,
teachers in schools where standardized test scores were low in Reading
and Math informed us that they didn't teach much science at all, because
science either wasn't tested or the science scores weren't of muc
interest to parents.  About the same time, the Michigan Association of
Science Teachers pressured the state to start testing students in science
as part of the Michigan Educational Assessment Progam (MEAP) so that
teachers would begin teaching science -- it had been shoved out by
reading and math.
 
The consequence of this is that everyone wants their subject tested, but
to test everything that is important for kids to do in schools becomes a
matter of overkill.  Most will agree on the central importance of reading
and math, but most will also agree that kids should learn other things in
school that are less readily measured.  In high stakes states, reading
and math (and language arts, defined by the tests as editing sentences
containing mistakes) push out other instruction.
 


From:         Leslie McLean 
Subject:      Hommage a Sandra P. Horn
 
  Bravo, Sandra Horn, for your spirited responses on behalf of the TVAAS!
No insult was ever intended in my postings, but since insult was felt, please
let me apologize.
  I was particularly grateful for your clarification that you had not yet
reported scores by teacher as gains-attributed-to-teacher-competence, but
that you would shortly do so.  Some of us (perhaps even you) forget
sometimes that this is a listserv discussion and not a debate in the
Tennessee legislature.  We anticipate events--always a risky venture--but
that is the way of the scholar--however bothersome it may be to the
people who are on line for actions and justifications every minute.  As
unlikely as it may seem, Sandra, I have been the person who had to draft
responses for the Minister (read: Chief elected official) on occasion.  I
 would like to share with you sometime my reply to the school district
official who wrote to the Minister to complain about our "Perception
Bag", an unstructured set of curriculum materials on
perceptions--including bits of paper soaked in scents--a sick drunk, a
new Buick, ...
   Please let no one ever read my posts as suggesting that the TVAAS
staff are not concerned with the welfare of teachers and students.  You
have made your case well and I would call attention to it.
 
   I not only read your posts, I save them (most of them)!  For example,
here is an excerpt:
Sanders:
Observe that we fit the model (including all teachers over all
subjects over all grades simultaneously within each school system)
to the entire observational vector available for each student with
the appropriate variance-covariance structure within the r-matrix.
By so doing, one can view each student vector to be "like" an
incomplete block with the analogy to the analysis of incomplete
block designs.  After obtaining teacher and school effects this
way, what justifies our claim that most of the socio-economic
confoundings have been filtered?
 
     We do ex post facto analyses relating the teacher or school
effects to variables that have been accepted by some (at least) to
proxy socio economic status.  We have found the following based
upon the state-wide analysis.  There is no relationship between the
school effects and: (1) the percentage of students receiving free
and reduced lunches in the school; (2) the racial composition of
the student body; (3) the location of the building as to urban,
suburban and rural.
 
     Also, the relationship between the school effects and the
school means are extremely low as can be seen in the following
table.  {No we do not have state-wide a measure of student ability.
However, in our early studies we found that the inclusion
contributed virtually nothing}.
 
  Simple coefficients of correlation between school effects (math)
and the mean for each school (3 year averages).
 
      Grade       r     N (no. of schools)
        3       .056        898
        4       .081        888
        5       .101        865
        6       .163        673
        7       .105        509
        8       .078        510
 
With relationships of this magnitude, it certainly is easy to show
numerous examples of schools across the entire spectrum of mean
achievement scores who have obtained excellent gains.
 
     The statements that we have made are not mere assumptions but
rather come from the results of the data analysis!!  [end of clip from
TVAAS post]
 
   Thank you for the reports of your data analysis, which support what I
noted, that you are concerned about teachers and students. It appears
that some of us have not clearly communicated our concerns, since our
concerns antedate all the data analysis reported above.  We are
concerned, and our concerns are not laid to rest, by descriptions such
as, "we fit the model simultaneously ...over all grades ... to the entire
student vector".  How long must a student be in a teacher's class in
order to be counted amongst that teacher's "achievements"?  (Or do
students not move in and out of classes in Tennessee--in many Ontario
classrooms, there are more than 50% "in's and outs" in a year.  And you
fit the model over three years!  I agree--the incomplete block
design/metaphor is apt, and please forgive some of us if we sometimes
lose patience and feel "fobbed off" by simplistic explanations.
   Please do accept, Sandra, that those few of us who are trying to stay
in communication with you on these very important but technical issues
are people who also care about teachers and students--and colleagues who
are charged with implementing complex systems.  Some of us have likely been
at it longer than you have; not that this makes us wiser, but it does
give us a fairly wide context.  We listen to wise colleagues such as
Michael Scriven who remind us that teacher evaluation must somehow
include consideration of student achievement.  Our context, however,
includes some distressing and sceptisism-inducing experiences with
systems that begin with IRT scaling of multiple choice items by
three-dimensional models and carry on to models of exceptional
computional complexity to results said to have simple interpretations.
If you could share our experiences, I would hope you would not feel an
insult at our questions.  Let us keep up the dialogue.



From:         Sandra P Horn 
Subject:      TVAAS and moving students
 
Dear Les,
 
Thanks for your kind post.  It made me feel a lot better.  I have always
assumed that all of us were participating here with the common aim of
hashing out direction and means for insuring the best education possible
for our children.  It's a point I felt needed to be be made again, though--
that mutual respect and open inquiry are the means by which our aim can be
achieved.
 
As to your question on how moving students affect teacher reports, the
answer is that students must be in a teacher's classroom a minimum of 150
days during the school year in order to be represented in that teacher's
cohort.  So not only are teachers protected from being held accountable
for students who enter late,  they are also not held accountable for
students with excessive absences (students have a 180-day school year).
 
Please continue to ask.
 


From:         Gene Glass 
Subject:      Why TVAAS Exists.
 
       Yesterday, Sandra Horn wrote:
 
>   TVAAS was created for the betterment of education.  That is its
> purpose. There are no ulterior motives, and no one here stands to profit
> from some demonic plot to crucify teachers.  Nor are we self-deceiving.
 
       We can certainly credit the motives of Sandra Horn and William Sanders
  without accepting this somewhat oversimplified explanation of what caused
  TVAAS to be. Those of us who have studied the politics of similar efforts
  across the past 25 years have reason to doubt that TVAAS is solely an
  expression of a popular will to better education--even if it is far short
  of a profit-making daemonic plot.
 
       An alternative account of the creation of TVAAS holds that it was
  proposed by traditional foes of public-education funding in the Tennessee
  Legislature as the compromise required to pass career ladder legislation.
  The career ladder legislation provided big financial benefits for teachers.
  "If teachers are going to get big payoffs like that, then they better
  prove they are adding value to students' lives." Such is the political
  rhetoric of accountability.
 
       I think it is fair to say that the legislative genesis of TVAAS
  bore all the markings of suspicion toward teachers and schools and hostility
  toward increased education expenditures. This doesn't doom it to fail
  statistically or politically, but it is relevant context for understanding
  how it will operate, how its strengths and weaknesses will be seen and
  what its future is.
 


From:         William Robert Saffold 
Subject:      More TVAAS Questions
 
I have been "lurking" for some time now, but I have yet to see many of my
basic questions about TVAAS answered, or addressed sufficiently.  I would
love to hear (on-list) from the TVAAS staff or anyone else who can help me
understand exactly what is going on with value-added assessment.  Following
are some observations and questions:
 
1.  Dr. Sanders claims that scores are consistent from year to year, but
the variations in gains (or losses) are actually quite large.  Following
are a few actual district report cards in various subjects.  I have chosen
small, medium, and large districts just to show the variation in gains.  I
have included the explanatory material at the end of this section:
 
___________________________________________________________________
Hollow Rock--Bruceton Special School District--804 students
 
Language--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm     667.0   696.0   707.0   724.0   739.0   749.0   757.0
1991:        681.2   705.7   706.4   729.9   734.5   745.5   757.1
1992:        698.5   701.6   723.4   737.6   757.1   738.0   763.0
1993:        692.0   693.6   708.7   732.0   741.0   753.4   760.4
1994:        673.9   695.2   711.3   725.9   737.3   761.9   763.5
 
Language--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     29.0    11.0    17.0    15.0    10.0     8.0
1992:         20.4    17.7    31.2    27.2     3.5    17.5      130.6
1993:         -4.9     7.1     8.7     3.4    -3.7    22.3       36.5
1994:          3.2    17.7    17.2     5.3    20.9    10.1       82.6
 
3 Yr Avg       6.2R*  14.1G   19.1G   12.0R    6.9R   16.6G      83.2
Std Error      2.8     2.5     2.1     2.0     2.2     2.3
 
______________________________________________________________________
Campbell County School District--6,410 students
 
Science--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm:    655.0   690.0   709.0   732.0   745.0   756.0   765.0
1991:        663.7   688.5   697.0   708.5   723.9   733.1   752.6
1992:        672.9   694.7   707.5   719.5   724.0   743.2   761.2
1993:        660.5   683.8   709.6   713.5   738.5   740.6   755.8
1994:        673.6   698.0   706.6   723.1   716.4   736.0   757.0
 
Science--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     35.0    19.0    23.0    13.0    11.0     9.0
1992:         30.9    19.0    22.5    15.5    19.3    28.1      123.0
1993:         10.0    14.9     6.0    19.1    16.6    12.5       72.8
1994:         37.5    22.7    13.5     3.0    -2.6    16.4       82.4
 
3 Yr Avg      26.5R*  18.9Y   14.0R*  12.5Y   11.1G   19.0G      92.7
Std Error      1.5     1.3     1.3     1.2     1.2     1.1
 
____________________________________________________________________
Chattanooga City School District--20,159 students
 
Social Studies--Estimated Means
 
Grade           2       3       4       5       6       7       8
USA Norm:    652.0   691.0   713.0   735.0   745.0   749.0   761.0
1991:        664.7   686.1   717.3   738.9   736.7   737.8   754.2
1992:        669.1   693.9   718.4   743.9   740.8   750.5   756.7
1993:        645.8   690.4   721.4   726.2   729.3   751.5   761.2
1994:        642.8   674.7   705.9   733.6   739.6   747.4   754.0
 
Social Studies--Estimated Gains
 
Grade           3       4       5       6       7       8       %CUMGAIN
USA Norm:     39.0    22.0    22.0    10.0     4.0    12.0
1992:         29.3    32.1    26.5     1.9    13.5    18.9      112.0
1993:         21.0    27.2     7.6   -14.5    10.6    10.5       57.3
1994:         28.5    15.1    12.1    13.2    17.6     2.4       81.6
 
3 Yr Avg      26.3R*  24.8G   15.4R*   0.2R*  13.9G   10.6R*     83.6
Std Error      0.9     0.7     0.6     0.6     0.6     0.6
____________________________________________________________________
Mixed Model Analysis using Scall Scores from Norm-Referenced Section of the
TCAP
 
G=  Green Zone:  Estimated mean gain equal to or greater than national norm
Y=  Yellow Zone:  Gain below national norm by 1 std error or less
R=  Red Zone:  Below norm by more than 1, but no more than 2, std errors
R*=  Ultra-Red:  Below norm by more than 2 std errors
 
NG=  Negative Gain:  no percent-of-norm calculated
 
Slight variations noticed in this year's estimates when compared to last
year's estimates are a function of the fine tuning of the methodology.  THe
scores for previous years will change slightly as the most recent data are
incorporated, refining previous estimates.
 
[I HAVE TYPED THIS ALMOST VERBATIM FROM THE 1994 TVAAS REPORT--I hope that
I have typed all the numbers correctly.  Apparently, there is a
computerized copy of the data available, but the TVAAS staff have fixed it
so that no data can be manipulated or copied electronically.  I would love
to hear their explanation/justification for this.]
___________________________________________________________________
 
Dr. Sanders' position seems to be that all variations are the result of
educational practice (unless someone else proves otherwise).  The huge
variations from year to year would seem to undercut his position.  If the
same teachers are teaching (presumably) in much the same way each year,
what accounts for the variation in gains?  Could the measurement instrument
have anything to do with it?  Or could the model just be misspecified?
 
2.  Dr. Sanders (through Sandra Horn) claims that"
 
"When we have had direct knowledge of change in educational practice, then
we have observed change in the effects.  For instance, Knox county, which
has a middle school system, has had severe retardation in the gains for 6th
grade (the first year of middle school).  This past year a major effort was
launched to improve communication between feeder schools and receiving
schools such that instruction could be provided earlier in the school year
commensurate with where the feeder schools had left off the previous year.
After this effort, the Knox County 6th grade gains improved appreciably."
 
The following are gain scores for 6th graders in Knox County:
 
                Math    Reading         Language        SocS    Science
USA Norm        19.0    18.0            15.0            10.0    13.0
1992:           12.5    15.1            14.1             1.1     5.3
1993:            9.7     7.8             5.2            -7.0    12.1
1994:           16.6    13.9             4.3            13.2     6.3
 
3 Yr Avg        12.9    12.3             7.9             2.5     7.9
Std Error        0.3     0.3             0.3             0.4     0.4
 
Sixth graders experienced drops in gains in 1993, and recoveries in some
gains in 1994--but improvement in gains scores is not even across all
subjects tested.  Dr. Sanders provides nothing more than anecdotal evidence
to prove that any increase in gain scores is due to educational effort--and
he fails to mention the drops in language arts and science.  Comparisons
between gain scores in 1992 and 1994 show virtually no overall improvement.
 Gain scores in 1993 look more like the exception than the rule.  THe
change in gain scores could be, among many other potential factors, merely
natural year-to-year variations.
 
3.  I keep reading that TVAAS differentiates between district, school, and
teacher effects on student learning.  Does the student gain at the
classroom level equal the teacher's effect?  Does the student gain at the
school level equal the school's effect?  Does the student gain at the
system level equal the system's effect?
 
4.  Dr. Sanders admits that we don't even have a measure of student
ability.  How can we know that a student's innate ability has no effect on
gain?  (And if ability--or lack thereof--doesn't matter, why are special
education students excluded from teacher assessment?)  Dr. Sanders believes
that low-achieving students (who may or may not be low ability students)
can make "satisfactory gains" consistent with the national norm gain---but
a low-ability student may not be able to do so, at least at the same rate
as a higher-ability student (and the implicit assumption in the TVAAS model
seems to be that all students should learn at the same rate--the national
norm gain).  Thus teachers might have to work much harder with a group of
low-ability students in order to achieve acceptable gain scores.  This
discriminates against teachers with low-ability (as opposed to
low-achieving) students.
 
5.  Will the TVAAS staff please provide information about the correlation
between the results of the norm-referenced test and the
criterion-referenced test?  It would help to know how well the
norm-referenced test matches the Tennessee curriculum.  We could have some
objective evidence that the norm-referenced test sufficiently matches the
curriculum if it can be shown that students score roughly the same on the
two tests---at all grade levels and in all subjects.
 
Thanks,
Bob Saffold
Vanderbilt Institute for Public Policy Studies
Saffolwr@ctrvax.vanderbilt.edu


From:         Sandra P Horn 
Subject:      To Saffold Re: More TVAAS Questions
 
I have only this minute read the post from which this exerpt is taken, so
I can only reply to one item at this time and it is the one below:
 
On Wed, 25 Jan 1995, William Robert Saffold wrote:
 
>
> 2.  Dr. Sanders (through Sandra Horn) claims that"
>
> "When we have had direct knowledge of change in educational practice, then
> we have observed change in the effects.  For instance, Knox county, which
> has a middle school system, has had severe retardation in the gains for 6th
> grade (the first year of middle school).  This past year a major effort was
> launched to improve communication between feeder schools and receiving
> schools such that instruction could be provided earlier in the school year
> commensurate with where the feeder schools had left off the previous year.
> After this effort, the Knox County 6th grade gains improved appreciably."
>
> The following are gain scores for 6th graders in Knox County:
>
>                 Math    Reading         Language        SocS    Science
> USA Norm        19.0    18.0            15.0            10.0    13.0
> 1992:           12.5    15.1            14.1             1.1     5.3
> 1993:            9.7     7.8             5.2            -7.0    12.1
> 1994:           16.6    13.9             4.3            13.2     6.3
>
> 3 Yr Avg        12.9    12.3             7.9             2.5     7.9
> Std Error        0.3     0.3             0.3             0.4     0.4
>
> Sixth graders experienced drops in gains in 1993, and recoveries in some
> gains in 1994--but improvement in gains scores is not even across all
> subjects tested.
 
You are absolutely correct, and the fault is mine that this example was
included in Bill Sanders' post.
 
When Bill wrote out his response to questions regarding validity, he
forwarded it to me for a read-through.  He also sent another post that
told me to remove the item regarding Knox County schools because
upon checking the figures, he had found that the findings were not
consistant across all subjects, although improvements had occurred in some.
However, there were two items regarding Knox County, and I removed the
other one.  I then sent the post to Edpolyan without returning it to Bill
for checking.  I just assumed I'd deleted the right thing.  Upon reading
this post, I discovered my mistake.
 
As for why these "anecdotal" examples were provided, it was in answer to
those who find statistical validity "invalid" in terms of educational
assessment.
 
Saffold raises several questions that will required detailed responses.
We will check through his figures to be sure they are correct, although I
assume they are.  However, I am sometimes mistaken in my assumptions (see
above), so we will check them for typos.  We will attempt to answer each
point in detail.
 
I apologize to all of you and to Bill Sanders for my mistake.
 
Sandra Horn



From:         Greg Camilli 
Subject:      Re: Validity, again (no apology)
 
Rick Garlikov and Michael Scriven have suggested on more than
one occasion that any test scores are fallible; surely that is
no reason to abandon them in the assessment process. The problems
with the TVAAS program may be
 
1. The scale of the scores may have anomalies (more variance in
grade 1 than in grade 12),
 
2. The model may not take into account student background (e.g.,
family income, neighborhood, preschool, parents' education),
 
3. The model and its technical properties (e.g., standard errors)
may be understood by only a very few,
 
4. *Some* test scores may be corrupted by the unscrupulous, other
less wittingly,
 
5. The testing program may encourage some to teach the test rather
than the educational skills,
 
6. The population actually tested may not be well understood (e.g.,
the effects of absences and exemptions.
 
When I think about it, the differences between this program and many
other assessment programs is not that great, especially because the
high stakes provisions seem to be loosely coupled with legal
consequences. I don't remember reading in the material that Serman
provided whether there are graduation consequences for seniors
(like there are in other assessment programs.) However, I have
reacted to some of the more extravagant claims made for the model.
 
Suppose it were the case that many of these claims were untrue (e.g.,
disentangling teacher and student effects). And suppose we agreed
that test scores reflected the actual level of attainment for a
group of children that we would describe at length. Thus, a child
is tracked, and an unadjusted growth curve is plotted, or a teacher
is tracked. We don't pretend the information is unconfounded with
uncontrolable sources of variation.
 
Given this scenario, I have a couple of policy questions:
 
1. Is this information useful? If so, how?
 
2. Is it what the "public" or state adminstrators want? (As opposed to
what you, the reader, want.)
 
3. Are the *model's* claims essential for its role as a high stakes
test?
 
I ask these questions in good faith; they arose from consideration
of TVAAS, but extend far beyond. Can *large-scale* testing programs
can be used to evaluate teachers and students?
 
 



From:         William Robert Saffold 
Subject:      TVAAS & Legitimate Inquiry
 
Kathy Bolland's response to my post underscores the reason I have refrained
from entering the discussion until now.  Debates of this nature always seem
to end up forcing people into one of two positions:  either they deny that
there are any valid and/or cost-effective means of evaluating teachers, or
they accuse those who raise questions about assessment of trying to let
poor teachers off the hook.  I categorically reject this false dichotomy.
Raising legitimate questions about a new and poorly-understood model of
assessment hardly strikes me as an attempt to "kill the messenger."
Failing to raise basic questions about model construction and
interpretation of results strikes me as an abdication of our responsibility
as educators and citizens.
 
My point about the TVAAS and student abilities was this:  There are
differences in student abilities and those differences have an impact on
value-added scores.  If the TVAAS does not take student abilities into
account, then it is more favorably disposed to some teachers than to
others.  If students were randomly assigned to teachers, this point might
be moot--but we know that this is not the case.  If you were a classroom
teacher, would you want to take the risk that the TVAAS was unfairly
weighted against you?  Or would you raise the questions in hopes that the
issue might be remedied?  I choose the latter course.
 
Both Kathy and Rick seem to believe that I reject the value-added scores
themselves as indicators of student gain.  This is not so.  What concerns
me is that gains (or losses) at the classroom level might be ascribed
solely to teachers--without any acknowledgment that there are other factors
at work here.  Of *course* it is important to know that low-ability kids
(or high-ability kids for that matter) are not reaching their
potential--but is it reasonable to assume that *all* children, regardless
of their natural abilities, will achieve the same amount of gain (i.e., the
national norm gain) in each subject each year?  And is it reasonable to
tell all teachers, regardless of the types of children they teach, that
each and every class of children must meet or exceed that national norm
gain or they have failed as teachers?  The answer to both questions may be
"Yes"--but I (and, I suspect, many others) would like to see some evidence
to support this position.
 
I do not see these questions as an attempt to shift the blame for poor
student academic gains away from teachers and on to the model itself.
Quite the contrary--I think the model may very well be incredibly useful.
As I told Sandra Horn in an off-list post, I am sincerely interested in the
TVAAS.  My questions are meant to enlighten both myself and others as to
the way the model really works.  I hope they will be regarded in that
light.
 

From:         Sherman Dorn 
Subject:      TVAAS
 
I've been sitting on this for a few days, in part because of scheduling and a
 cold,
but also because I've been wanting to think about a few posts.  Rick Garlikov
 writes:
 
>        (3) if administrators and the media and public cannot be made to
>understand how to use these numbers as an indicator for this one narrow
>aspect of schooling, I don't see how the kind of thing Gene and Sherman
>describes is possible for them to apprehend and use at all.  It seems to
>me that what TVAAS is advocating is even easier for, say, administrators
>to understand than what Gene or Sherman is advocating.
 
This bewilders me.  In part, this is because I think what I wrote and what
Gene Glass wrote were very simple and easy to understand.  Maybe
that's my bias.  But in addition, from a political standpoint, simplified
statistical explanations may very well encourage oversimplified policies.
People who don't understand TVAAS numbers except as "well,
they mean something, and here's the ranking of your school" WILL
take the numbers as absolute, which is what NO ONE ON THIS LIST
has advocated.  When TVAAS numbers were provided the print media
this year, Nashville papers printed the rankings of schools without any
indication of which differences in rankings meant something, even to the
TVAAS statisticians.
 
In response to Gene Glass, William Sanders writes:
 
>     We do ex post facto analyses relating the teacher or school
>effects to variables that have been accepted by some (at least) to
>proxy socio economic status.  We have found the following based
>upon the state-wide analysis.  There is no relationship between the
>school effects and: (1) the percentage of students receiving free
>and reduced lunches in the school; (2) the racial composition of
>the student body; (3) the location of the building as to urban,
>suburban and rural.
 
This is provocative, but it is ecological reasoning.  You cannot extrapolate
from an analysis of system characteristics to student characteristics.
To demonstrate the irrelevance of a variable to TVAAS, UTVARAC
should see if adding individual characteristics such as race, sex,
family income, or the presence of a disability would change the rankings
of effects on various levels (teacher, school, or system).  Or, if one
merely wants to look at previous student scores, UTVARAC should be
able to calculate bivariate plots/regression equations for each grade, pairing
up the scale scores from the previous year to the current year (for
example, fourth-grade math computation test scale scores to those
students' math computation test scale scores for the prior year, when
they were in third grade).  What do the plots look like?  What are the
unstandardized regression coefficients?  If Sanders is right, the plots
should look very linear and (as importantly) the regression slope should
be just about 1 (i.e., kids with high prior scores should be gaining
the same as kids with low prior scores).
 
Later in the same post, Sanders writes:
 
>     The statements that we have made are not mere assumptions but
>rather come from the results of the data analysis!! However, if
>needed, the modeling process which we deploy does not in any way
>preclude the use of other covariables. Let me restate that, to date,
>we have not found the need for these additional variables.  If, in
>the future, using other test data, a need is found to insure
>fairness, then the models will be expanded to include them.
 
This assurance seems rather unusual for someone producing
publicly-consumed statistics.  In cases of controversy or doubt, the
most reputable statistical bureaus in the country produce alternate
sets of statistics with explanations of their assumptions.  For example,
the Census Bureau provides alternative population projections
(labeled simply high, medium, and low) for the near-term future,
and explain what assumptions led to each series.  I believe that the
Congressional Budget Office also provides alternative budget
projections, and the Labor Department provides scores of
alternative statistics one can choose from.  Even the Department
of Education, in its annual report on dropout/graduation
statistics, is beginning to include alternatives to their
three dropout "rates."
 
I see no reason why TVAAS could not produce alternative
analyses with different variables added.  It seems like the
responsible thing to do, since this is one of the most controversial
issues related to TVAAS.
 
Later yet, in response to Glass' question about general concerns with
validity, Sanders made several claims about TVAAS:
 
>2.  As was mentioned in the most recent post, we now have merged
>the writing assessment data into the master data base.  Even though
>the studies are not complete, it appears that we could substitute
>this data for the language arts data from TCAP without appreciably
>changing the rankings of schools within a system.
 
>3. We have data from the 10th grade TCAP and from the 10th grade
>PLAN tests.  The data could be interchanged with virtually no
>change in the rankings of systems and schools.
 
These two bits of information suggest that the relative rankings would not
depend on the tests mentioned.  It still assumes that these high-stakes tests
 are
the appropriate venues for judging student performance.
 
>4. When we have had direct knowledge of change in educational
>practice, then we have observed change in the effects.
 
This is a rather broad generalization.  In order to confirm the thesis of
improvement in effects if and only if changes in practice, you would
have to have definitions of "change in educational practice" and then
ask the question of every school in the state.  And even this statement
is not precisely true.  For when William Saffold questioned the evidence of
this in Knox County, Sandra Horn acknowledged the discrepancy:
 
>When Bill wrote out his response to questions regarding validity, he
>forwarded it to me for a read-through.  He also sent another post that
>told me to remove the item regarding Knox County schools because
>upon checking the figures, he had found that the findings were not
>consistant across all subjects, although improvements had occurred in some.
 
Please correct me if I am wrong about this:  Bill Sanders writes that they have
observed changes in effects whenever they "have had direct knowledge of
change in educational practices."  Knox County is one of those systems
where UTVARAC has knowledge of such changes.  Knox County does
not show such uniform improvement.  Sanders asked Horn to delete
the Knox County data from the response but, and again correct me if
I am misreading this, he did not revise the broader claim to reflect his
new reading of the data.
 
Considering that he acknowledges that Knox County does not fit his
prior statement, I would like to know what his current claim about
this form of validity is.
___________

From:         Leslie McLean 

Reply to William Sanders regarding hierarchical linear models (and
multilevel linear and non-linear models) and his kind invitation to visit
Knoxville.
   Alas, Dr. Sanders, it is mid-crunch-term and I am up to my fingertips
in students, on- and off-line.  Your invitation is much appreciated, and
I hope some of your fellow correspondents on EDPOLYAN will be able to
take you up on your generous offer.  As soon as classes end here
(mid-April--eat your heart out American colleagues), I'm off for an
isolated cottage on Prince Edward Island (off Canada's East Coast) to
finish a book about teachers and teaching with my dear friend and
colleague, Prof. Johan Aitken.  I hope you will keep up the dialogue with
Harvey Goldstein and others who can appreciate the work you are doing.  I
have read the paper in J. Pers. Eval. in Educ. and agree with Gene that
it does not answer (as it could not) all the important questions.  I have
not seen the Amer. Statistician article.  My position is still this:
  When it comes to the validity of numbers reported to schools, teachers
and students (not to speak of local newspapers, who often get the numbers
whether you send them copies or not):
  1.  The details are everything: items, scaling, regression models,
hierarchical models, treatment of missing data, rules for
inclusion/exclusion, reporting formats and details (with caveats, whether
ignored or not), follow-up meetings in districts/schools to discuss
interpretations and releases to the press.
  2.  The content of the tests is important, but not as important as the
way it is tested: the pedagogy suggested by the form(s) of high-stakes
tests.  NOWHERE is this more important than in the testing of language,
and nowhere are the tests weaker than in this area.  Most standardized
(read 'published') tests CANNOT (and hence do not) test
competence/proficiency in first language according to the best current
pedagogy.  In short, the "Reading" scores predict other reading scores
but not whether students can read their textbooks--and especially not
whether they can and do read books/newspapers/magazines.  The scores and
their non-linear scaled transformations tell us something about the level
of literacy of students, but they do not, and cannot, tell teachers how
to help the students who attain low scores.  Since I make this claim, I
make the corollary claim that gain scores are equally unhelpful to
teachers.
   Whether they are of any use to officials remains to be seen,
and TVAAS has said some are making good use of them.  It would be
wonderful to know some details about the improvements brought about by
the knowledge of the gains reported by TVAAS. It is devoutly to be wished
that the Tennessee writing assessment (I may have the name wrong) is not
eclipsed by the TVAAS.  Sanders and Co. may even wish to campaign on
behalf of their "competitors" in order to help keep assessments in
perspective.
 
 I offer some simple rules in this enormously complex enterprise:
 
   1. ANYTHING that demeans and diminishes respect for teachers is BAD.
   2. Test scores published in newspapers demean and diminish respect for
      teachers, though this is almost never the purpose of the publication.
   3. The validity of a test score diminishes by the cube of the
      distance of the test constructor from the classroom.
   4. The credibility of a derived achievement test score increases by the
      square of the number of terms in its predictive equation, multiplied
      by 10 times the number of times "exp" appears in the formula (whether
      direct or implied by 'log').
   5. The correlation between credibility and validity is -0.14161828.
         Corollary: The credibility of scores arrived at by item-response
                    models is approaching infinity so this may diminish the
                    accuracy of the above correlation estimate.
   5. The demand for published test scores is insatiable, esp. IRT scores.
   6. Discussion on public forums such as EDPOLYAN is pointless.
----------------------------------------------------------------

From:         Harvey Goldstein 

Returning after a few days absence I find another 100 or
so mesages about TVAA which I will slowly work my way
through. However a few people have referred to the
'mixed model' used by TVAA so it might be worth trying
some clarification of this (though when I thought I was
trying to be helpful about standard errors it seemed to
create more confusion than clarification!)
The mixed model, elaborated by Henderson, Harville and
others and the subject of the American Statistician
paper (1991) by (Robert) McLean et al is in essence just
the ordinary multiple regression model where the
coefficients are allowed to have (random) distributions
across the units defined in the model. Thus, e.g., if we
have a model with students and schools identified, we
can take a coefficient in the model, say a pretest score
coefficient and specify that it varies randomly across
schools and then estimate its variance. If we have
several of these we can also estimate covariances among
them. I include here the general linear model extension
to ordinary multiple regression which allows factors,
such as gender, or, say, type of school to be included
(as dummy variables). Then you could see whether the
gender coefficient (representing boy-girl difference)
varies from school to school. In the TVAA case the mixed
models of interest are known as hierarchical models or
multilevel models because, as I have described in my
example, they model a nested or hierarchical structure
where students are grouped within achools and you have
between student and between school variation. As I
understand it (and the TVAA group and I are
corresponding about the technical details) the basic
TVAA model is what is known as a repeated measures model
where the same students are repeatedly measured and you
formally model measurements grouped within students,
themselves grouped within schools. This is a 3-level
structure. The advantage of this formulation is that you
can isolate the influences at each level and study the
factors which may explain the variations. As a
by-product you can also estimate 'value added' scores,
for example for teachers or schools - together with
standard errors(!).
In the mid 1980's a number of investigators (Aitkin and
Longford, Bryk and Raudenbush, my group in London)worked
out efficient computational procedures that allow very
complex and large datasets to be analysed efficiently
and have led to important new insights. At least 3
software packages exist and the big groups (eg SAS)
are also now getting in on the act. Ita Kreft did a
review in 1990 and has an update due out soon in the
Americal Statistician of these packages. The journal of
educational and behavioural statistics is about to
produce an issue on this also and there is a large
literature about, with applications in education and
elsewhere, including a few expository texts. If anyone
is interested I can supply some introductory references,
and the recent book by Bryk and Raudenbush would be a
good start (hierarchical linear models, Sage, 1992). The
Mclean et al article is not in my view a good
introduction and also doesn't mention any of the
important multilevel literature since 1986.
In short, 'mixed models' are neither obscure nor really
difficult for anyone with a basic understanding of
multiple regression - and there are efficient, publicly
available software packages which are being used by
large numbers of quantitative social, biological and
medical scientists. Multilevel models have come to be
recognised as the basic statistical technique in school
effectiveness studies and there is a growing literature
there too.
 
>
From:         Sandra P Horn 
Subject:      W. R. Saffold's Questions

William L. Sanders offers these responses to W. Robert Saffold's
questions about TVAAS:
 
Saffold states:
    "I HAVE TYPED THIS ALMOST VERBATIM FROM THE 1994 TVAAS
REPORT--   I hope that I have typed all the numbers correctly.
Apparently, there is a computerized copy of the data available,
but the TVAAS staff have fixed it so that no data can be
manipulated or copied electronically.  I would love to hear their
explanation/justification for this."
 
     These files are available in ASCII format for easier
transfer to other computers.  The ASCII formatted data is
available on request from the UTVARAC.  Many school systems have
already availed themselves of this opportunity.
 
  Saffold:
  "Dr. Sanders' position seems to be that all variations are the
result of educational practice (unless someone else proves
otherwise).  The huge variations from year to year would seem to
undercut his position.  If the same teachers are teaching
(presumably) in much the same way each year, what accounts for
the   variation in gains?"
 
     There are 138 school systems in our state.  Each system gets
a report for each of five subjects. Of the 690 reports, there
certainly are many examples of considerable difference from year
to year.  Yes, I attribute most of the differences (above and
beyond the 'noise' in the entire process) to changes in
educational practice.  Some of the changes may be known by the
locals, some may be easily identified, while others may be much
more subtle and much harder to identify.  Let me share some of
the reports which I have received from educators within systems
that have been addressing some of the more obvious differences.
 
Among other things, they attribute the variations to
 
1. Less than desirable communication across grade levels.
        Pretend that two years ago that a system/school had retarded
gains in math from second to third grade.  Assume that the high
achieving students in the third grade were permitted to progress
at a more rapid pace.  But assume that the fourth grade
instruction was 'locked-in' and was not extended from where the
third-grade faculty 'left-off'.  The gain from third to
fourth could show some dramatic change between adjacent years.
 
2. Changes in the structure of the school day.
         Consider a school in which last year the gains in science were low
but the gains in language arts were quite high.  If the school decides,
as a result of this, to reallocate time so that more time is spent on
science at the expense of another subject area, both subjects may show a
change from one year to the next.
 
These are just two of many examples of practices that could result
in large changes between adjacent years.  However, the school and system
main effects are considerably larger than the school*year and
system*year interactions.
 
Saffold asks:
  "Could the measurement instrument have anything to do with it?"
 
The test forms are different but equivalent each year.  The
distributions for each subject and grade over the five years of
testing are extremely similar.  The means have gone up slightly
in some subjects, but the over-all variances are virtually
identical.  These distributions are based upon the 55-56,000
records obtained from each grade-subject combination each year.
The simple r's are about .7 between scores in adjacent grades for
each subject.
 
"Or could the model just be misspecified?"
   No model is perfect.  However, I can demonstrate that the
school and teacher effects are not related to indicators of ses
and prior and post levels of achievement.
 
 
3.  Saffold notes,  "I keep reading that TVAAS differentiates between
district, school, and teacher effects on student learning.  Does the
student gain at the classroom level equal the teacher's effect?"
 
The teacher model is:
 
Y(ijklm) = mu(ijk) + Year*grade*subject*teacher(ijkl) + e(ijklm)
where  i= ith year j= jth grade k=kth subject l=lth teacher
       e(ijklm)=mth student score within i,j,k,l
 
var Y  is ZGZ' + R
 
In this specific model the mu(ijk) is deemed to be the fixed part
of the equation, while year*grade*subject*teacher is the random
part.
 
The general form of the mixed model equations that we use is:
  -                                     -  -   -   -           -
 |   X'*INV(R)*X   X'*INV(R)*Z           | | b |   |X'*INV(R)*Y|
 |                                       | |   | = |           |
 |   Z'*INV(R)*X   Z'*INV(R)*Z  + INV(G) | | u |   |Z'*INV(R)*Y|
 |-                                      -  - -     -          -
 
 The G-matrix contains the covariances among teachers over
grades,subjects and years.
 
The R-matrix contains the covariances among student scores over
all years, subjects and grades.
 
The Y-vector contains the scale scores (not gains) for all
students over all subjects over all years and over all grades for
a school system.
 
The X-matrix contains all fixed effects, either continuous or
discrete variables.
 
The Z-matrix contains all random effects.
 
THE B-VECTOR IS BLUE IF G AND R ARE KNOWN.
THE U-VECTOR IS BLUP IF G AND R ARE KNOWN.
  Harville has called the u's , 'the realized value of the random
variable', a label that I personally like.
 
If G and R are estimated from the data, then B and U are often
referred to as empirical BLUE  and BLUP.  These estimations are
usually completed with REML, or as in our case with close
approximations for G and R for computing reasons.
 
Upon inspection, it is easy to see that the number of equations
to be solved for a large system like Memphis will number into the
tens of thousands.
 
The U's sum to zero for each grade and each subject.  These are
'shrinkage' estimates and give a direct measure of each teacher's
effect as the deviation from the system mean for each subject
each year.  By choosing estimable functions correctly, these
estimates can be scaled as 'gains'.  Several points: with little
information the estimates are 'pulled' close to the average.  This
adds considerable protection against someone having an unfair
estimated effect due to chance; in fact, an estimate cannot be
judged different from average unless the effect is extreme and
until considerable information is accumulated.  The sensitivity
of this process accrues because we are using the considerable
correlation structure that accumulates by simultaneously
considering all records for each student and fitting all teacher
effects simultaneously (a comment that McLean felt that I was
fobbing him with).
 
(Note: We encode Z in various ways to accommodate various forms
of teaching, ie. self-contained classroom, departmentalized
instruction, team teaching, teaching across grades, changing
assignments over time, etc. Also, we have developed a procedure
that we call our stacked-block concept, which I won't go into
here, that adds considerable improvement to the sensitivity of the
estimates).
 
(Second note: To all of you HLM modelers, I am fully aware of
these models.  I recognize that they are a sub-set of the general
form of mixed model theory and methods.  I don't agree with a
recent post of McLean that there is a new theory and methodology
that surpasses mixed models.  They have been developed to deal
with a different set of problems when covariables are included at
different places on the hierarchy).
 
Saffolf asks, "Does the student gain at the school level equal the school's
effect?  Does the student gain at the system level equal the
system's effect?"
 
The school model is like the teacher model except it is included
as a fixed effect.
 
4. Saffold says, "Dr. Sanders admits that we don't even have a measure of
student ability."
 
We don't have a DIRECT measure of student ability.  However,
the   evidence strongly supports that by including all of the
covariance structure that an unbiased measure of the teacher
effect is obtained.  The state-wide analysis clearly shows that
the group of students in Tennessee which are not making as much
gain in most of our systems is NOT the low achieving students.
Rather it is the HIGH achieving students.  Some of my colleagues
are nearly through with a manuscript that will document this
fact.
 
 
Saffold asks,  "(And if ability--or lack thereof--doesn't matter, why are
special education students excluded from teacher assessment?)"
 
  That part of the EIA was suggested by others.
 
 
Saffold says, "Dr. Sanders believes that low-achieving students (who may
or may not be low ability students) can make "satisfactory
gains" consistent with the national norm gain---but a
low-ability student may not be able to do so, at least at the
same rate as a higher-ability student (and the implicit
assumption in the TVAAS model seems to be that all students
should learn at the same rate--the national norm gain)."
 
No. The implicit assumption is that all students eligible to
take the TCAP tests can achieve gains at least equal to the
norm gain if they are provided effective instruction from where
they enter the classroom, academically.
 
 
Saffold says,"Thus teachers might have to work much harder with a group
of low-ability students in order to achieve acceptable gain
scores. This discriminates against teachers with low-ability
(as opposed to low-achieving) students."
 
  I find it interesting that you and others have expressed
concern about teachers of low achieving (or low ability)
students being penalized by the TVAAS process. In fact, most of
the concerns that we have received have come from teachers,
principals, etc. within systems and schools with a
disproportionate number of high achieving students.  In fact,
many educators who are working with populations of students with
lower achievement (and lower abilities) have expressed to me that
they are delighted with TVAAS because for the first time there is
some documentation for the public to see that their students are
making creditable progress.
 
5. Saffold asks,  "Will the TVAAS staff please provide information about
the correlation between the results of the norm-referenced test
and the criterion-referenced test?"
 
     We have not done a thorough analysis of this.  The state-
wide system criterion referenced data are available from State
Testing.
 
    "It would help to know how well the norm-referenced test
matches the Tennessee curriculum."
 
     That question has been recently raised and a thorough review
of that issue has been completed.  The correlation is extremely
high in all subject areas.  If you want more information, I would
contact someone on the staff of the State Board of Education.
 
We hope that this answers the points you have raised.
 



From:         Sherman Dorn 
Subject:      Assessment
 
In a post responding to me on the comparative reform questions I've raised,
Rick Garlikov raised an important question about assessment:
 
>     Finally, Sherman, in previous posts, does not seem to expect
>the same problems with internal assessments that he tends to
>expect from external ones.  I find that odd.  As I said in a
>previous post, if it is acceptable for sixth grade teachers to
>assess what their new students have been taught previously, why
>is it not acceptable for the state or for private assessment
>enterprises to monitor as students go along, in order to report
>to parents?
 
I see nothing fundamentally wrong with reporting progress data to parents.
I see everything wrong with the way that the state legislature in Tennessee
established TVAAS.  The devil is in the details in this matter.
 
Consider, for example, whether TVAAS establishes a high-stakes environment,
and whether teachers see it as a relatively non-pressured form of feedback
(which is how UTVARAC staff argue it should be used ideally).  The
legislature created TVAAS' legal framework in a way that made it very clear
to teachers what the point was.  In 1991 and early 1992, then-governor Ned
McWherter was trying to craft a major education financing reform (including
a state income tax to fund it) in response to a successful finance equity
suit brought by districts around the state.  There is quite a bit of
evidence that TVAAS was seen as the accountability "bait" to get legislators
to go along with the tax hike.  The state consulted with Bill Sanders about
TVAAS, and the chief sponsors of the legislation quickly accepted an
amendment by state senate education chair Ray Albright to include TVAAS in
the bill.
 
The state commissioner of education at the time, Charles Smith, stated
publicly that value-added assessment would give Tennessee the best
accountability mechanism in the nation, and it would be part of a
carrot-and-stick approach to education reform.  Lamar Alexander (then
Secretary of Education) approved of McWherter's bill precisely because of
TVAAS as an accountability mechanism.  Several legislators tried to amend
the bill to remove everything EXCEPT TVAAS from the reform bill.  Others
tried to amend the bill to include explicit cut-offs at which administrators
would be removed or to make teacher effect estimates public.  (Sanders is on
record as having opposed the latter type of amendments.)
 
In the end, McWherter did not get his state income tax, but did get a
half-cent hike in the state sales tax and most of what he proposed plus TVAAS.
 
All of this was reported in the Tennessee Education Association newsletter
at the time.  I don't think this it could have escaped notice among most
public school teachers that many legislators may have voted for the
increased funding for schools only because there was TVAAS attached, or what
the proposed amendments represented.