MSc-IT Study Material
June 2010 Edition

Computer Science Department, University of Cape Town
| MIT Notes Home | Edition Home |

Summary of Evaluation

In this unit we have looked at several different evaluation techniques. These were classified according to who were participants in the evaluation – either potential users, trained evaluators, or models of cognition. There are many more differences between the evaluation techniques which have been alluded to throughout this unit. This section will draw out these differences so that you get a better idea of how to use different techniques, and when they are appropriate. Preece (1995) identified several differences between evaluation techniques which are discussed in the remainder of this section.

Purpose of the Evaluation

As mentioned in the introduction, Preece identified four main purposes for doing evaluation which are outlined below. Clearly the purpose of the evaluation dictates the kind of evaluations that we should perform and the data we would want to collect. These issues are also outlined below:

  • Engineering towards a target – asking whether the system we have designed is good enough yet. The question we have to ask ourselves here is what targets are being engineered towards. If it is in terms of user satisfaction, for example, then we might employ a technique such as user studies where we can get qualitative information about whether the user is satisfied with the system.

  • Comparing designs – to identify which designs are best for a given set of tasks. In general, most evaluation approaches can give us some comparative evaluation, but quantitative results are usually the easiest to use for comparisons e.g. the results produced by user experimentation, some user studies, and GOMS.

  • Understanding the real world – working out how well the design would work in the real world such as an office or other workplace. Approaches such as GOMS and user experimentation do not give us information about how the system will fit into the real world. Other approaches such as user studies (where the system could actually be evaluated in the real world site) would be more appropriate for such questions.

  • Checking conformance to a standard – whether the system meets standards that have been set. Again we need to consider what the standards are that we are trying to conform to. If they are that a task should be performed within a specific time limit then we could use an approach such as GOMS to predict the task completion time.

Stage of System Development

As we discussed at the start of this unit, performing evaluation throughout a system's development provides us with a much more flexible process in which potentially time consuming problems are identified throughout development. The stage at which evaluation is performed will determine which kinds of evaluation techniques are appropriate. At the very beginning of the process interview and questionnaires are probably the most appropriate as they can give us plenty of information to inform the design. Once some more rigorous designs have been developed we might try using GOMS or ICS to compare designs for specific issues such as task completion time or difficulty to learn. If we have more time and effort, more grounded results (i.e. not just predictions) could be obtained from user studies, heuristic evaluation, cognitive walkthrough, and user observation. In some cases involving users very early in the design process may be problematic as they may be put off by the incomplete nature of designs at that stage.

Type of Data

As discussed throughout this unit there are two main kinds of data which can be collected from evaluations: quantitative (numerical values such as time, or attitude ratings) and qualitative (such as opinions). These two types of data provides answers to different kinds of questions, and moreover, are generated by different kinds of evaluation. The key is to select the evaluation technique which produces the kind of data necessary to answer the questions being asked. The table below lists the kinds of data we might expect to be able to get from the different kinds of evaluation we have discussed in this unit.

TechniqueQuantitative dataQualitative data
User observationXX
Interviews X
QuestionnairesX 
User experimentationX 
Cognitive walkthrough X
Heuristic evaluation X
Perception based evaluationX 
GOMSX 
ICSX 

Review Question 6

In the summary of this unit we discussed four different purposes of evaluation. Complete the table below to indicate which techniques are appropriate or not for the purposes given (mark a tick or cross).

Technique Engineering towards a targetComparing designsUnderstanding the real worldChecking conformance to standard
User observation    
User feedback    
User feedback    
Cognitive walkthrough    
Heuristic evaluation    
Perception based evaluation    
GOMS    
ICS    

Answer at the end of the chapter.

Considerations

Different kinds of evaluation require different time, effort, number of people involved, and equipment. It is important to consider whether a certain kind of techniques is appropriate for the stage of development.

We then need to consider how valid the data we have collected is. This refers to whether the data collected is suitable for the purpose of the experiment. If we were to try to use GOMS to predict the time it would take a novice user to complete tasks using a system we would get invalid results as GOMS is designed to predict task completion times for expert users, not novices.

Our data might be valid, but then we have to ask ourselves whether it is reliable. That is, whether we can expect the same results consistently. Clearly, for model based evaluation this depends on the way in which we use the models. Similarly, when performing evaluator based evaluation such as heuristic evaluation, it depends on how well the evaluators are trained, and whether we have enough to produce reliable results. When users are involved, well designed experiments should give reliable experiments, whereas user observation tend to give quite unreliable results. We might be concerned with reliability when considering whether the system meets certain standards – meeting standards needs to be shown by reliable evaluations.

Finally, we need to be aware of the biases that we may be introducing into the results of our evaluations. Preece identified two main sources of bias:

  • Selective data gathering – concentrating on certain aspects of the situation and not taking into consideration others which may also be important.

  • Manipulation of the evaluation situation – for instance, asking leading questions in interviews rather than letting interviewees formulate their own answers.