The empirical evaluation of software tools and processes is a relatively new topic in Software Engineering education and is quite broad, and as such we only touch upon it from a high-level.
I have an exam shortly on the topic so I’ll be going over some sample questions in preparation.
The first questions are of an introductory level.
Question 1.1
You are working as a software developer in a small organisation. Your manager is concerned that the software module you are working on may be more complex than the one that it is to replace in one of the organisation’s products. He asks you to measure the complexity of the two modules (the existing one and the one you are developing) and to give him a ratio of complexity values.
Is this a meaningful question? If so, how would you generate the ratio he is asking for, otherwise, why is it not meaningful?
It is extremely difficult to measure the attributes of software, let alone one as ambiguous as complexity. Even more obvious measurements, such as lines of code can be difficult to interpret with any level of precision: were comments counted, was whitespace counted? Further to this, software has both static and dynamic properties. For example, one can analyse the static source-code but there is also the dynamic run time code, which may or may not involve different properties and attributes.
The ambiguous nature of the phrase “software complexity” is the problem with this question. For one to successfully measure anything we need to determine the fitness for purpose, i.e. one has to identify the criteria that are appropriate to the evaluation of a particular system and then determine how best to assess these.
Is the manager suggesting that the new module is larger (lines of code) than the old module? If this is what the manager believes may show complexity, has he considered commenting or whitespace? Perhaps the old module was a poorly commented mess, whereas the replacement module conforms to a strict commenting system (perhaps using JavaDoc or something similar) that while improving the readability of the module, increases the number of lines of code (if comments are being counted), which may be leading to this sense of added complexity.
Alternatively, is the manager referring to an aspect of complexity that is more concrete than simply lines of code? Perhaps he feels that the time and/or space complexity of the new module is greater than the old one. Time and space complexity are measurements of the [worst case] running time and memory usage of the module. It seems unlikely that the manager is referring to this kind of software complexity as it would have required some analysis of the code to come to any firm conclusions.
The question is not meaningful in its current state, it is open to wide misinterpretation due to the ambiguity of the question and is unlikely to result in any satisfactory evaluation of the two modules. There is no fitness for purpose indicated or suggested by the manager, possibly resulting in the developer performing the evaluation in a subjective manner to ensure his module was shown to be less complex.
Question 1.2
Three students, Kevin, Kylie and Keith, are working on their final year projects, all of which involve developing some form of software product using Java. When asked by their supervisors to say how larger their programs currently are, by adding together the sizes of the files containing their code:
- Kevin counts all lines in each of the files, including empty lines and comments, using the Unix wc command.
- Kylie counts only those lines in each file that contain code and ignores any lines that are comments or white space.
- Keith counts the number of actual statements (declarations and executables) in each file.
Which of them is right? What arguments are there to support each of their decisions?
Consider the following two code fragments:
## print hello forever
while (true)
{
print "hello\n";
}
while (true) {
print "hello\n";
}
Both fragments would execute in exactly the same way, however, the first fragment of code is 5 lines long whereas the second is a mere 3 lines long. This highlights one of the problems with Lines of Code (LoC) as a measure of code. Ignoring even comments, personal coding style may effect the count as will the language the code is implemented in. Java, for example, is a particularly verbose language compared to more succinct languages like Perl or Ruby.
Neither student is necessarily wrong or right it is merely important to note that LoC is a often a poor indication of software attributes. It can be used to compare code implemented in different languages, and it cannot even necessarily be used to compare code written in the same language by different people, due to their individual styles.
Also, if one did wish to use LoC as an evaluative measure of code, it should be specified whether to include whitespace and comments or not (or a mixture) so one could more fairly evaluate the code.
Question 1.3
The Better Browser Builder (BBB) company is developing a new web server that they have named Tecumseh (a famous non-Apache Indian Chief). When running on a MS Windows platform, the beta-version of their server manages an average time of 40 minutes between failures, while when running on a Linux platform, it runs for up to six hours between failures.
Is it meaningful to speak of a value for mean time between failure for the server itself?
The mean time between failure is an attempt to measure an external attribute of the server, the reliability. Unfortunately there are two problems with the statement as presented.
Firstly, context. MS Windows and Linux are two completely different operating systems and as such, the code base would have to be somewhat altered so it could be ported between Windows and Linux. It is unfair to compare the server software running on two different underlying operating systems. It would appear, that the software is better on Linux than Windows but we have no idea how well other software compares on Windows or Linux. Perhaps their main competitor on Windows can only manage 20 minutes between failures whereas their Linux rival can manger 60 days. In that case we could say that the Windows software was in fact more reliable, even though at first glance it would appear it was less reliable than the Linux software. Also, perhaps Linux itself is a more reliable operating system and is the sole reason why the server appears more reliable when running on Linux.
Secondly, what does mean time between failure actually tell us about the reliability of the server? The reliability of a web server is likely dependent on a number of external attributes: hardware and website usage. We have no way of knowing that the software was run on identical machines with the same power. It could be that the Windows version was run on an old machine with 256MB of RAM compared to 4G of RAM for the Linux system – this would likely affect the reliability of the server. Also, we have no idea of the load under-which the two versions of software were tested. It could be that when the software was running on Linux, there were no or few visitors to the website for a long period of time until suddenly there was a huge surge that caused the server to crash. That wouldn’t show that the server was more reliable it would simple show that the server was not tested at full load.
Again there is a lack of fitness for purpose, the mean time between failure is simply too ambiguous and could be open to many interpretations.
What would be a more meaningful reliability and performance measure of the server would be to conduct thorough, objective stress testing of the software on both platforms and presenting figures such as: with 1000 concurrent requests to the server it was able to handle 150 requests/second for example. This would be more meaningful and be easier to compare with other server software (though it would still depend on the complexity of the pages being served).
Related posts:
- Software Evaluation – Case Studies
- Software Evaluation – Experiments
- Software Evaluation – Surveys
- Empirical Software Evaluation
- Software Evaluation – Quasi-Experiments & Interviews
Bookmarking
-
Stumble | Digg | del.icio.us | RSS

