Skip to content Skip to navigation

Connexions

You are here: Home » Content » Staging the Performance: The Challenge of Authenticity in Educational Leadership Performance Assessment

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.
  • NCPEA

    This module is included inLens: National Council of Professors of Educational Administration
    By: National Council of Professors of Educational AdministrationAs a part of collection: "PERFORMANCE ASSESSMENT IN EDUCATIONAL LEADERSHIP PROGRAMS; James Berry and Ronald Williamson, EDITORS"

    Click the "NCPEA" link to see all content they endorse.

Recently Viewed

This feature requires Javascript to be enabled.
 

Staging the Performance: The Challenge of Authenticity in Educational Leadership Performance Assessment

Module by: David Anderson. E-mail the author

Summary: This chapter focuses on the unique challenges of performance assessment in the field of educational leadership. In Teacher Preparation programs, authentic assessment is challenging but doable. If you want to know if someone can teach, you should not ask them to write an essay about what lesson they would create given a set of materials and learning objectives. You should actually give them the materials, have them teach, and assess that performance. In Educational Leadership, true performance assessment is much more challenging. If we want to know if someone can lead, it is difficult to find a situation where she/he can lead and then have trained assessors evaluate her/his performance in that situation. And what products could be captured from that performance? This chapter addresses these challenges by outlining four interconnected phases for developing a performance assessment, involving twelve steps: Design (prioritizing the skill domain, choosing an assessment activity, choosing a product from that activity, brainstorming the characteristics of a good product, formulating those characteristics into a rubric, and setting benchmarks for the rubric), evaluation (piloting the assessment, and applying the rubric to a small number of students), implementation (training assessors, and applying the rubric to all students), and program development (evaluating the results, and returning to step #1). A specific example of these steps is provided, and the role of performance assessment in reconsidering the domain of knowledge/skills/dispositions for educational leadership is discussed.

Note:

This module has been peer-reviewed, accepted, and sanctioned by the National Council of Professors of Educational Administration (NCPEA) as a significant contribution to the scholarship and practice of education administration. In addition to publication in the Connexions Content Commons, this module is published in the International Journal of Educational Leadership Preparation, Volume 4, Number 4 (October – December 2009). Formatted and edited in Connexions by Theodore Creighton, Virginia Tech.

Introduction

Concerns around the quality of public education have always been a part of the history of the US, but they have been intensifying over the past couple of decades. With these concerns, the demands for accountability have grown: first in K-12 and now in post-secondary education (Fritschler et. al., 2008). Accountability refers to reviewing the quality of educational programs and holding them to standards of quality through a series of sanctions and/or rewards.

Appropriately, the quality standards that drive accountability in education have been grounded in student learning. So, across all the academic and professional fields, there has been a tremendous amount of work defining what students should know and be able to do. In the field of Educational leadership, this domain of knowledge and skills is encapsulated by the Educational Leadership Constituent Council (ELCC) standards (Wilmore, 2002).

The move to program accountability in education has shifted from a focus on standards (what students should know and be able to do) to a focus on the challenges of implementing performance activities that can be assessed to show student program in meeting standards. In professional training programs, these assessment challenges are significant. In order to understand why, let’s examine the challenges of assessment more closely.

The Challenge of Assessment: Assessing Performance

In professional fields in higher education, where practitioner training is the focus, the emphasis (quite appropriately) is more on the able to do than on the to know. Knowledge is important, but for someone training for a specific professional role, applying that knowledge is particularly important.

Educators have a long history of assessing knowledge, but a relatively short history of assessing the application of knowledge. Knowledge assessment has traditionally been done through paper and pencil tests, preferably through standardized multiple choice tests if the results of these tests will be used for high stakes decisions, such as accountability decisions (Stiggins, 1994). The field of psychometrics has constructed an impressive and sophisticated array of theories and tools to develop valid and reliable assessments of knowledge, including the so-called higher-order thinking skills (Linn et al., 1991).

Assessing application of knowledge has been far murkier. This area is known as performance assessment. Performance assessment gets its name from the belief that, if a faculty member wants to measure how well a student applies core knowledge, the student needs to actually apply that knowledge in a performance activity, and the faculty member needs to capture that performance and evaluate it.

Learning to drive can serve as an example. The goal of a good driver’s education class is to produce competent drivers. This involves helping student drivers master a body of knowledge (how to operate a car, rules of the road, etc.) as well as actually applying that knowledge to successfully drive a car. If the instructor wants to assess a student driver’s competence, she would give two types of assessment: a knowledge assessment (How well does the student driver understand the rules of the road?) and a performance assessment (How well can the student driver actually drive a car?). These two types of assessment complement each other because a traditional knowledge test can adequately sample across a broad domain of issues (which is impossible to address in an actual driving test), whereas a driving test can clearly indicate whether the student can actually drive.

The challenge is that any high-stakes assessment, including performance assessment, must meet certain quality characteristics. These characteristics are often called the APPLE criteria: Administratively feasible, Professionally credible, Publicly acceptable, Legally defensible, and Economically affordable (Nyirenda, 1994). These criteria are typically applied to large scale high stakes assessments, but there are equivalent quality characteristics for classroom level assessments: 1) Purpose and Impact (How will the assessment be used and how will it impact instruction and the selection of curriculum?); 2) Validity/Authenticity (Does it measure what it intends to measure? Does it allow students to demonstrate both what they know and are able to do?); 3) Fairness (Is the assessment biased towards any group of students?); 4) Reliability (Does the assessment measure skills consistently regardless of situation and/or assessor?); 5) Significance (Does the assessment address content and skills that are valued by and reflect current thinking in the field?); and 6) Efficiency (Is the assessment reasonably easy and cost effective to complete and score?).

In order to address these quality characteristics, assessment developers typically follow a set of steps (Herman et. al., 1992; Stiggins, 1994). There are four interconnected phases for developing a performance assessment, involving twelve steps:

Phase 1: Design

This initial phase involves identifying the knowledge and skills that a student should possess and developing a set of activities or products that provide an opportunity for students to demonstrate their mastery. The following steps are included in this phase:

1) List and prioritize the knowledge, skills, and dispositions from the curriculum. (Standards)

2) Choose an activity that will allow students to demonstrate mastery of each set of knowledge, skills, or dispositions. (Assessment Activity)

3) Choose a product that captures this activity (an essay, a video, artifacts that identify the assessment object)

4) Brainstorm characteristics of a good product. (Assessment Criteria)

5) Formulate those characteristics into a rubric, including both holistic and analytic rubrics. (Rubric formation) Typically, assessment developers will brainstorm characteristics of a performance that hits the target, one that misses the target (since missing the target is not necessarily just the absence of the hit the target behaviors), and one that reflects developmental behaviors in between. This provides a three level rubric—a five level rubric can simply allow the assessor to place performances between the levels. In any case, any number of levels can be developed through a brainstorming session.

6) Set benchmarks for assessment—what is adequate for program accountability purposes? (Setting Performance Standards)

Phase 2:Evaluation

During this phase designers use the performance assessment with students in order to determine whether the activity appropriately measures the knowledge and skills identified earlier. Two additional steps are included.

7) Pilot the performance assessment with some students.

8) Evaluate student work, using the rubric. Choose a set of student work samples that illustrate each level of each rubric. (Identifying Anchor performances) These anchor performances, along with the rubric, are invaluable tools for students to understand the targe they are expected to hit.

Phase 3:Implementation

The third phase focuses on implementing the performance assessment. Assessors are trained and the scoring rubric broadly used to assess student work. Two steps are included in this phase.

9) Train assessors using the anchor performances to ensure inter-rater reliability;

10) Apply the rubric to all appropriate students.

Phase 4:Program Development

This final phase involves examining data about the assessment to identify patterns that may indicate issues of fairness in implementation. It also includes examining again whether the performance activity appropriately measures the knowledge and skills identified in Phase 1. Two final steps are included in this phase.

11) Examine the results—look for patterns in the results. Are there achievement gaps between males and females? Between minority and majority students? If so, there may be an issue of fairness in an assessment. Re-evaluate the assessment in light of any achievement gaps, and modify accordingly.

12) Return to the original standard. Do the performances give any new insight into the appropriateness and significance of the standard and how it is worded? (Feedback loop between Standards and Assessments)

A Common Misconception: An Essay is a Performance Assessment

Many of the earliest forms of high-stakes performance assessments (that met the criteria listed above) were in the area of writing (Weigle, 2002). The problem with this history is that many educators continue to confuse written essays with performance assessments. A written essay is only a performance assessment product in two cases: 1) if it is an assessment of writing; or 2) if it is capturing an actual performance in written form. Many so-called performance assessments are neither.

In the field of professional educator training, a classic approach to performance assessment is the infamous in-box activity. In this approach, a student is given a case scenario (with various hypothetical artifacts at his disposal and a pressing problem facing him) and asked how he would address this situation given the constraints and tools available. He then writes an essay describing a response to this situation, and must support his response with references to theoretical frameworks provided in the course. This is a fine learning activity, but it does not constitute a performance assessment in the truest sense.

To illustrate why, we can return to our driver’s education example. The in-box activity is the equivalent of asking students to imagine that they are at a four-way stop where they are one of three cars that reached the intersection at the same time: what will they do? Although answering this question correctly (with reference to the appropriate driving rule) is valuable, it is not a substitute for actually watching a student reach a four way stop in that situation. Why? Because applying knowledge in a real situation is always more complex than such a simplified scenario allows. Even if a student understands the four-way stop rule, will she/he correctly process the information that occurs in the real situation? Will she/he correctly observe all the cars as they approach the intersection and focus about the timing? Will she/he be distracted by additional (unforeseen) factors, such as the pedestrian trying to cross the road, or the car riding close behind the rear bumper? Applying knowledge in the context of real situations is the essence of true performance assessment.

Authenticity is the Key Challenge

In teacher preparation programs, authentic assessment is challenging but doable (Doherty et al., 2002). If an administrator wants to know if someone can teach, he should not ask that person to write an essay about what lesson to create, given a set of materials and learning objectives. The administrator should actually give the teacher a set of materials, have him teach, and assess that performance.

However, there are numerous practical problems with performance assessment in teacher preparation programs. It is often impractical for education professors to observe all of their students teaching in actual school environments. (This can be done by student-teaching supervisors and mentors, but not by each professor in every class). Besides, research has shown that direct observation (using a checklist or rubric) is often not the most valid and reliable method of evaluation.

So, the solution is to capture the performance in a set of products: a videotape, a written analysis, and samples of student work (or other artifacts from the actual performance). These can then be evaluated. Research has shown that evaluation of a set of products often has more validity and reliability than direct observation (because of the ability of the assessor to carefully review and reflect on the products, which is impossible to do in real time).

In educational leadership, true performance assessment is much more challenging. If leadership programs want to know if someone can lead, it is difficult to create and/or find a situation where she/he can lead and then have trained assessors evaluate her/his performance in that situation. And what products could be captured from that performance?

The Quasi Performance Assessment: Staging

In educational leadership, the challenge is to create staged situations where adequately appropriate performances can be attained. So, what are the strategies for staging?

One approach is to create an activity that requires the student to take a leadership role and actually perform leadership responsibilities, even if the activity is not embedded directly in the core activities of the school or district. For example, any activity that requires consensus building could demonstrate leadership skills.

Another approach is to have a student participate in an authentic activity by helping the actual school leader perform this activity. This would provide limited performance opportunities, but it would allow the student to experience and critically analyze actual performances in a very focused and intimate fashion.

An Example: Designing a Performance Assessment in Educational Leadership

Let’s return to the twelve steps and apply them to one hypothetical example in an educational leadership curriculum. This will be a fairly simple example for the purpose of illustration.

Step 1: Although ELCC has identified the domain of knowledge and skills required of a practitioner, every program must prioritize these standards and decide where and how often they will assess each standard. In the program at Leadership University, the faculty put a great deal of emphasis on participative, democratic leadership. A key part of this is consensus building. So, the faculty designates two faculty members (who teach a course that incorporates consensus-building) to create a performance assessment that addresses consensus building which is considered a central part of a specific ELCC standard. The faculty members decide to define consensus-building (the knowledge, skills, and dispositions) using a framework that identifies the various behavior types often seen in group dynamics as well as techniques for bringing these types together into consensus.

Step 2: The two faculty developers decide on an appropriate activity: they will ask each student to find a meeting in their organization where an important, but controversial, item will be discussed. They will then ask each student to approach the normal facilitator of the group, and ask if she/he could first observe an initial discussion around the topic, and then facilitate a consensus-building dialogue around this item. Next, the developers will ask each student to observe the beginning discussion using an observation tool that captures the behavior types of each participant at the meeting. Then, based on this observation tool, the student would (at least try to) apply the appropriate consensus-building techniques. Satisfied with this activity, the faculty developers write descriptions and directions for the students. Note: This is an authentic task because each student is actually using the consensus-building techniques in a real situation. It is somewhat staged because the meeting participants will realize that this is a student assignment, and this might affect the authenticity of their engagement in the process. Also, consensus building often requires numerous discussions, but this on-going process is not reflected in the assignment.

Step 3: It is decided that students will submit their observation sheets, as well as essays documenting the techniques they used, the effects of those techniques, and recommendations for further consensus-building activities. Each essay would be reviewed by one of the other participants from each student’s meeting for accuracy and completeness.

Step 4: Given the consensus-building framework, the faculty decide that the observation sheet should identify a range of behavior patterns across multiple participants in the meeting. The essay should address multiple techniques by clearly identifying each technique and how it was used in the discussion, as well as analyzing strengths and weaknesses.

Step 5: Then, the faculty visualize an exemplary student performance and brainstorm descriptors of that performance, including the number of behaviors and techniques integrated into the performance and analysis. Next, they visualize a performance that misses the target for a variety of reasons, and they brainstorm descriptors of that performance. Finally, they imagine a performance in the middle and brainstorm appropriate descriptors at that level. In all three cases, these descriptors include adjectives that capture a sense of quality, not just quantity of the behaviors and techniques. The faculty members notice that the descriptors fall into two broad categories: observation and analysis of behaviors, and selection and analysis of techniques. They split the descriptors into these two groups to create two analytic rubrics.

Step 6: The faculty decide that the middle levels of each analytic rubric demonstrate an adequate mastery of the skills to meet the ELCC standard. The top levels go beyond the expectations of the ELCC standard, and give room for further growth for the more advanced students completing the performance. They create a table that captures the three levels of both analytic rubrics, including all the descriptors.

Step 7: The faculty decide on an appropriate instructional activity which will help students prepare for this assessment: a simulation in small groups. They then pilot the instructional activity and assessment in one section of the course. They decide that, even though the assessment may not yet meet the rigorous standards for external accountability (e.g., the APPLE criteria), the design process to this point ensures that it meets their standards for supporting high quality teaching and learning. In other words, it will be a good learning experience for the students, and the instructor of this pilot course will be able to grade the assessment in a fair and appropriate manner. Since the assessment is part of the students’ grades, the students in the pilot will give the assessment appropriate effort. (Note: If the assessment is used this way in the piloting, it should be used this way in the actual implementation for validity and reliability purposes.)

Step 8: Next, the faculty developers choose samples of student work and assess each student individually. Then they meet and grade the samples and discuss/resolve any differences in assessment, until they have identified at least two examples of student work at each level for both rubrics. These are the anchor papers.

Step 9: The two faculty members arrange a meeting for all of the faculty who teach this particular course (and a few other interested faculty). At this meeting, the two faculty developers distribute the description and directions for completing the activity (which was given to students) as well as the rubric (which was also given to the pilot students). Next, the two developers share half the samples of student work, and ask each faculty member to individually assess the samples with the rubric. Then, as a group, the faculty discuss their assessments until they agree on what each sample should receive. Then, the faculty members distribute the second half of the samples, to check and make sure that there is consistency among all the assessors. Any re-categorization of anchor papers is discussed and resolved.

Step 10: All faculty members (both full and part-time) who teach sections of the course integrate the assessment into their sections. The faculty are given freedom in deciding how to incorporate the assessment into their grading practices.

Step 11: After a year of giving the assessments, the faculty sit down and examine the results. They notice that students working in urban environments are getting consistently lower scores. In examining samples of student work, they realize that these students are dealing with much larger groups and are trying to evaluate, in depth, more individuals. The faculty decide to change the directions by asking students to focus on specific influential participants to do their analysis. This helps with the validity of the assessment since it reflects appropriate consensus-building skills. They also notice that one adjunct professor gives consistently higher scores. So, they sit down with this instructor and assess samples of student work together. During the subsequent discussion, they identify a misunderstanding on the part of the instructor as to the definition of one behavior and associated technique.

Step 12: Having done this assessment in many sections, they convene another meeting of all the instructors who have given the assessment. At this meeting, the instructors indicate that consensus building needs to be tied to group communication skills, and they suggest a slightly different framework that incorporates both areas. With this new definition, they return to step 2 and make appropriate modifications.

Clearly, these steps reflect an ongoing process. One difficulty is that as rubrics change, it is difficult to examine longitudinal data. However, faculty can balance the need for revision with the need for longitudinal analysis as they see fit.

The Final Solution: A Portfolio of Assessments

In order for comprehensive accountability, multiple types of assessments are needed: performance assessments, knowledge assessments, and hybrid assessments. A program also needs to assess a broad domain of standards (knowledge, skills, and dispositions), and try to assess each of these characteristics through multiple assessments. But how is this done?

The best approach is to create a programmatic portfolio with course-embedded assessments. This requires horizontal and vertical articulation (scope and sequence) across the courses in the entire program. Portfolios are most valuable when the course embedded assessments can reveal a student’s progress (or lack thereof) over time. To achieve this, the assessments can be created in a way that reflects skill building over time (Koretz et al, 1993; Calfee et. al., 1996; Cummings et al., 2008).

Yet, this is difficult in the field of educational leadership, since most programs cannot afford to use a cohort model where courses are taken in a particular sequence. Practitioners need the flexibility to take courses in any order, given the need to balance both personal and professional roles. However, well-designed performance assessments demand thick description and rich analysis from students. These in-depth student work samples can still reflect student growth if assessors add written comments that pinpoint individual student needs, and subsequent assessors check on these needs.

The Future: Learning From the 12th Step—the Feedback Loop of Assessments to Standards

In all assessment processes, there is a strong feedback loop between the standards and the assessments, as captured in the 12th step of this framework. Standards are written to capture a vision of what students should be able to do. But assessment is where the “rubber really hits the road”—the faculty do not really understand and appreciate how a standard is phrased and framed until it is operationally defined by its corresponding assessment. And in assessing it, they more concretely see the skills in action and can re-evaluate the programmatic vision.

Although this example focuses on a very specific area, this process of revision should be happening more broadly in the field of educational leadership. The ELCC standards are thoughtfully developed and comprehensive, but what do they look like in practice, given the profession’s experience with performance assessment to date? What are individual programs learning in this regard?

Overall, the field of educational leadership is just beginning its journey towards a standards-driven performance assessment-based curriculum. This will be a process of trial-and-error, but it may well lead to stronger programs with significantly higher credibility with the public and the rest of academia.

Since assessment must be aligned with instruction, the use of performance assessments tends to orient instruction towards constructivist activities, where students are highly engaged in constructing their own knowledge.

References

Brown, S. and Knight, P. (1994). Assessing Student Learning in Higher Education. Boston: Kluwer Academic Publishers.

Calfee, R. C., & Freedman, S. W. (1996). Classroom Writing portfolios: Old, New, Borrowed, Blue. In R. C. Calfee & P. Perfumo (Eds.), Writing Portfolios in the Classroom. MahWah, N. J.: L. Erlbaum Associates.

Cummings, R., Maddux, C.D., Richmond, A. (2008). Curriculum-embedded performance assessment in higher education: maximum efficiency and minimum disruption. Assessment & Evaluation in Higher Education. Volume http://www.informaworld.com/smpp/title~content=t713402663~db=all~tab=issueslist~branches=33 - v3333, Issue 6 December 2008 , pages 599 – 605.

Doherty, R. William, R. Soleste Hilberg, Georgia Epaloose, and Roland G. Tharp (2002). "Standards performance continuum: development and validation of a measure of effective pedagogy." The Journal of Educational Research 96.2 (Nov-Dec 2002): 78(13)

Fritschler, A. Lee, Paul Weissburg, and Phillip Magness. "Growing government demands for accountability vs. independence in the university. (PERSPECTIVES)(Critical essay)." Liberal Education 94.4 (Fall 2008): 40(8). 

Herman, J.L., Aschbacher, P.R., & Winters, L. (1992). A practical guide to alternative assessment. Alexandria, VA: Association for Supervision and Curriculum Development.

Heubert, Jay P, and Hauser, Robert M. (1999). High Stakes: Testing for Tracking, Promotion, and Graduation. ED439151.

Koretz, D., Stecher, B., Klein, S., McCaffrey, D., & Deibert, E. (1993). Can portfolios assess student performance and influence instruction? The 1991-92 Vermont experience (December, 1993). Washington, DC: RAND Institute on Education and Training and Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing.

Linn, R. (1987). Accountability: The comparison of educational systems and the quality of test results. Educational Policy, 1 (2), 181-198.

Linn, R.L., Baker, E.L., & Dunbar, S.B. (1991, November). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20 (8), 15-21.

Linn, Robert (1994). Performance Assessment: Policy Promises and Technical Measurement Standards. Educational Researcher, 23(9), 4-14.

Nyirenda, Stanley (1994). Assessing highly accomplished teaching: Developing a metaevaluation criteria framework for performance-assessment systems for national certification of teachers. Journal of Personnel Evaluation in Education, 8(3), 313-327.

Ruppert, Sandra S. (1994). Charting Higher Education Accountability: A Sourcebook on State-Level Performance Indicators. Educational Resources Information Center. ED375789.

Stiggins, R. J. (1994). Student-centered classroom assessment. New York: Merrill.

Weigle, S. C. (2002). Assessing writing. Cambridge, UK Cambridge University Press

Wiggins, Grant. "Creating tests worth taking." Educational leadership 49(8).

Wilmore, Elaine L. (2002). Principal Leadership: Applying the New Educational leadership Constituent Council (ELCC) Standards. Thousand Oaks: Corwin Press.

Content actions

Download module as:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks