Cuyamaca College Library  

Jeri Resto's Home Page Library Home Page Student Learning Outcomes
 Introduction | Library's SLO | SLO information competency | SLO assessment plan matrix | Assessment Tools | Findings

Interrater Reliability

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

Interrater Reliability definition

So how do we determine whether two observers are being consistent in their observations? We have chosen to implement a test for inter-rater reliability.  Interrater reliability is the extent to which two or more individuals agree. 

Data Collection procedures

The data will be used to evaluate the quality of library reference services. The Reference Survey Card consists of two forms: Form A, which focuses on student experience during a reference desk interview, and Form B, which focuses on the librarian's experience during a reference desk interview. The librarian and the student will complete each form immediately after a reference desk interview.

For the rating of the students and the librarians to be consider “credible” and thus usable to draw conclusions, we must first test the degree to which the students and librarians agree about the student’s experience with reference librarian. When we test the reliability of ratings we often compute the inter-rater reliability coefficient. It is generally accepted that a inter-rater reliability coefficient of .75 or higher suggests that the ratings are reliable. The closer to 1.0 (a perfect match), the more reliable the ratings are. For example, if we have the following data:

 

Question

Student Rating

Librarian Rating

 

 

I am now better able to construct a successful search statement in order to find information.

5 = Strongly Agree

1= Strongly Disagree

 

 

Interaction Event 1

5

5

 

 

Interaction Event 2

4

4

 

 

Interaction Event 3

4

4

 

 

Interaction Event 4

3

5

 

 

Interaction Event 5

5

3

 

 

Interaction Event 6

3

4

 

 

Interaction Event 7

5

5

 

 

Interaction Event 8

5

5

 

 

Interaction Event 9

3

2

 

 

Interaction Event 10

2

2

 

 

Total

39

39

 

 

Mean

3.9

3.9

Significance Level

Interpretation

Inter-Rater Reliability

.61

.031

Marginal agreement about statement (significant result)

 

In this first case, the reliability of the ratings are considered low (.61) even though the student and librarian mean

rating (3.9 on a 5-point scale) are the same (3.9).

 

 

Question

Student Rating

Librarian Rating

 

 

I am now better able to construct a successful search statement in order to find information.

5 = Strongly Agree

1= Strongly Disagree

 

 

Interaction Event 1

5

5

 

 

Interaction Event 2

4

4

 

 

Interaction Event 3

4

4

 

 

Interaction Event 4

5

5

 

 

Interaction Event 5

3

3

 

 

Interaction Event 6

3

4

 

 

Interaction Event 7

5

5

 

 

Interaction Event 8

5

5

 

 

Interaction Event 9

3

2

 

 

Interaction Event 10

2

2

 

 

Total

39

39

 

 

Mean

3.9

3.9

Significance Level

Interpretation

Inter-Rater Reliability

.92

Less than .000

Excellent agreement about statement (highly significant result)

 In this second case, the reliability of the ratings is considered high (.92) even though the student and librarian’s mean rating (3.9) is the same as in the first case. The difference is in the level of agreement. You can visually see that in the 2nd set of data the student and librarian ratings are the same in almost each row and that is what we are looking for in with your data. It doesn’t matter if the student “Strongly Agrees or Strongly Disagrees” only that the librarian and the student agree on the result of the interaction and thus provide similar ratings.

Expected Survey outcomes
We plan to distribute 500 reference survey cards which will track 500 reference desk interviews.  The student is required to fill out the top half and the librarian will fill out the bottom half.  Our expected outcomes is that 350 raters out of 500 raters will check the same category. In this case, the percent of agreement would be 75%.