Monday, May 18, 2009

Testing Heuristics and Static Analysis at ICSE 09

I had a very critical talk with two grad students from Queen's yesterday, whom I unfortunately can only identify as Jackie and Brahm. They had a difficult time accepting the immediate usefulness/significance of the study I was proposing, and that's fine, as they came up with a few useful suggestions that I can leverage, primarily to keep a narrow focus and always be aware of the result I'm trying to generate. Two interesting opinions they shared with me are below:

Are the good testers simply those who are most familiar with the SUT (software under test)? This seems entirely plausable (in my own work experience, I was a vastly better tester of sotware that I had written, or at least software written by someone on my team). If so, then the correlation between programming experience and software testing would be much weaker if we are inventing the SUT for our subjects to test. However, I believe to have some empirical evidence to the contrary, in the form of a man called Scott Tindal, who is one of the most skilled QA engineers I've ever had the pleasure of working with. Scott was of the caliber that he could almost single-handedly test every product in the OpenText/Hummingbird line. An interesting configuration of our study may be to experiment with developers testing their own code versus devlopers testing new code that they've never seen before.

Will the heuristics extracted with this study be the same as those already in use by static analysis tools like FindBugs?
It appears that determining a correlation between the techniques that experienced testers use and the heuristics employed by static analysis tools such as FindBugs could be an interesting topic. If there is no correlation between these two, may it be possible to integrate these into SA suites for any significant improvement? If there is a correlation, why haven't more testers adopted static analysis to aid in their efforts? Is it a matter of usability of SA tools? Is it that the formalism used to express the SA heuristics isn't understood by everyday testers, and so they aren't aware of what SA has to offer? I plan to talk to Nathaniel Ayewah, University of Maryland, who is working on the FindBugs team, about this further.

ICSE 09 - Day Three

Andreas Zeller, professor at Saarland University in Germany, apart from being on of the nicest people I've met so far this week, also had some very positive feedback on my research idea. He agreed with it's principle, and that research in this area would be valid. One of his first questions after I finished presenting my idea was about incentive for participation. I thought he was asking about how we would convince students and pros to give up their time to participate in our study, but this wasn't the case. he recounted results found at Saarland when teaching students to test their code. Traditional methods obviously didn't work effectively, so they switched tracks. Student assignments were assigned with a spec and a suite of unit tests that the students could use to validate their work as they progressed. Once nightly a second set of tests were run by the instructor, the source of which was kept secret, and the students were informed in the morning of the number of tests they passed and failed. To grade the assignment, a third set of (secret) tests were run. Students begin with a mark of 100%. The first failing test reduces their mark to 80%. The second, to 70%. The third to 60%. The grading continues in a similarly harsh fashion. The first assignment in the course is almost universally failed, as students drastically underestimate how thorough they need to be with their tests. Subsequent assignments are much better, and the students focus strongly on testing, and indeed collaborate by sharing test cases and testing each others assignments to eliminate as many errors as possible before the submission date. Once the students have the proper motivation to test, they eagerly consume any instruction in effective testing techniques. Andreas found that simple instruction in JUnit and maybe TDD was all that was required, and the rest the students figured out for themselves. This type of self-directed learning is encouraging, but the whole situation makes me think that these students may be working harder, but not smarter, to test their software. It may be possible that by providing instruction not only in the simple operation of an XUnit framework, but also in things like effective test case selection, they may be able to reach similar test coverage or defect discovery rates, while expending less frantic effort.
Thought: drop the first assignment from the course's final mark, as it is not used for evaluation as much as it is a learning experience (of course, don't tell the students this, otherwise the important motivational lesson will be lost).
In addition to this, Andreas, just as other folks I've spoken to so far, emphasized the need to keep my study narrowly focused on a single issue, as well as controlling the situation in a way that enables precise measurements to be made both before and after implementing the curriculum changes we hope to elicit from the study, to accurately determine the improvements (if any exist). I am beginning to see a loose correlation between a researcher's viewpoint regarding empiricism in SE research (positivist on the right and constructivist on the left) and their amount of emphasis on narrowness of focus. Often times, those people who advocate exploratory qualitative studies also recommend wider bands of observation while conducting these studies. This is likely an over generalization, and I apologize in advance to anyone who may disagree with this.

ICSE 09 - Day Two

Day Three of ICSE 09 was another great success! It began with a rousing keynote by Tom Ball of Microsoft Research fame. He described the early beginnings of version control and the emergence of mining techniques for these version systems, and continued the narrative through to the latest version of visual studio and the tool Microsoft calls CRANE, which suggests to developers area of the code which should be examined given that they are changing some other area. Pretty cool. I got to talk to him one-on-one about it over lunch.

On the previously blogged march through stanley park, I had a talk with Tom Ostrand from AT&T Labs in New Jersey. One thing I wanted to talk about (besides pitching my research idea) was leveraging the version history and reports generated from a test suite to predict bugs in target code using MSR techniques to augment the existing tools, such as mining deltas and bug trackers. This topic came up in the morning's MSR session, but aparently Tom missed it (making me look much smarter than can be measured in reality). He seemed interested in the possibility, but of course there are barriers to getting it going. Primarily, assuming that the dev team for the software we're analyzing is writing tests and putting them into the vcs, the error and coverage reports almost certainly aren't, making it difficult to to versioned history analysis over them. I thought that maybe, given the source code of the target software and the test code (both of which at some version), the MSR tool could build the code and execute the tests to create the needed reports. This will likely be extremely difficult to get going in the field, however, and make MSR mining, which is already an extremely expensive, long running process, even slower.

Sunday, May 17, 2009

Farewell to the Borkenstocks

The time has finally come to retire the Canadian Tire brand cork bottomed sandals, so lovingly referred to as the 'Borkenstocks'. After a season in the closet and a plane ride to Vancouver, the sandals got their first taste of summer today, as they accompanied me on a walk through Stanley Park. The results of this were as follows: blisters and uncomfortableness.

Feedback on my Research Proposal

Over the last 24 hours I've had the chance to talk to a few individuals about my research idea. It seems to go as follows: I give my 30 second pitch, the marks stares at me blankly. Then, I elaborate. The mark gets it. They tell me first a problem they see, and then what they'd like to see out of the results. In a couple of cases, the mark got particularly excited about the outcome. I've summarized below.

Chris Bird
Chris has a strong background in empirical software engineering, with particular emphasis on qualitative exploratory studies. As such, the quantitative analysis aspects of my pitch fell on deaf ears, but he was extremely interested in the in-lab observation sessions I was proposing. He felt that 5 or 6 actionable recommendations that might come out of observing professionals would be invaluable. He suggested that I find an older Microsoft Research study in which they trained new hires by having them monitor a screen shared by a senior developer for some amount of time (probably a few days), and had the new hire ask questions with the senior at a later date, by reviewing the video logs. I like this idea because it allows us to elicit the information from the developer directly, instead of trying to infer it ourselves, but since we're not bothering them during the initial testing session, we wouldn't be affecting their performance. This isn't without problems, though. Primarily, there is the risk that the subject isn't always sure why they do the things they do, and so are likely to invent reasons, or invent foresight where none necessarily exists. Interesting idea, though. Also, should we interview students in the same way? On one hand, they students likely don't have any special insights that we can leverage (assuming they are less effective at testing than pros). On the other hand, it may illuminate any areas of misconception or misunderstanding which we could address in future curriculum changes.

Jim Cordy
Jim is a professor at Queens, and was my instructor in my 4th year compilers course. After he heard my pitch, he had a warning about an affect he had seen in his industrial work, and it comes from a generational difference in the training of developers. Developers who were trained more than 15 or 20 years ago had a delay between changing the source code and the results of program execution on the order of hours; new developers are used to delays on the order of minutes. Also, the current state of the art in debugging utilizes interactive debuggers, which were either unavailable or unreliable in earlier days. This has lead to 'old-school' programmers to a) rely heavily on source code inspection and b) insert enormous amounts of instrumentation (debugging statements) when running the program becomes required. In comparison, new generation developers often use smaller amounts of instrumentation, relying on quick turn around times to find the cause of errors. In Jim's experience, the old school programmers were orders of magnitude more effective (in terms of bugs found or solved per hour) than younger programmers. If this is in fact the case, it should be an effect we can see if we recruit subjects trained during this era of computer science research.

Saturday, May 16, 2009

ICSE 09 - Day One

We're wrapping up the first day of talks here at ICSE 2009. I've talked my way into the Mining Software Repositories (MSR) workshop. Here's a quick breakdown of some noteworthy points:

Keynote: Dr. Michael McAllister, Director of Academic Research Centers for SAP Business Objects
An hour and a half long talk in which he sells BI to the masses. Spoke a lot about integrating data silos, and providing an integrated, unified view of the data to business level decision makers. Also interesting anicdotes on how BI helped cure SARS. Forgot what OLAP stands for. Kind of concerning. This talk made me (after spending 2+ years working for a BI company) want a running example of what BI is and how it is used in the context of an expanding organization - from the point before any computational logistics assistance is required, and progressing forward to a Wallmart sized operation. Most examples I've seen start with a huge complex organization, complete with established silos, and then installs things like supply chain management, repository abstraction, customer relations management, document management, etc.


Mining GIT Repositories - presented the difficulties in mining data out of GIT (or DVCS systems in general) as opposed to traditional centralized systems like svn. Noteworthy items include high degree of branching and the lack of a 'mainline' of development.

Universal VCS - by looking for identical files in the repos of different projects, a single unified version control view is established for nearly all available software. Developed by creating a spider program which crawled the repos of numerous projects, downloading metadata and inferring links where appropriate.

Map Reduce - blah blah blah use idle PCs for quick pluggable clustering to chuck away on map reduce problems. Look @ Google MapReduce and Hardoop.

Alitheei Core - A software engineering research platform. Plugin framework for performing operations on heterogeneous repositories. Can define a new Metric by implementing an interface, and then evaluate the metric against all repositories in the framework. Look @ SQO OSS.

Many of these previous talks created tools for mining repositories, but with no greater purpose than that. When asked about this ('mining for the sake of mining'), none of the authors seemed to have a problem. The conclusion of the discussion was that this lack of purpose was a problem, and that the professional community should be surveyed to find out what needs they have for mining repos.

Research extensions/ideas:
The third MSR session today focused heavily on defect prediction. After showing off 3 or 4 methods of mining vcs systems to predict buggy code that improved prediction probability by 4% or 5%, the discussion boiled down to this, "What do managers/developers want in these reports to help them do their jobs?" Obviously, the room full of academics didn't have a definitive answer. One gentleman asked the question I had written down, which was "has anyone used the history and coverage of a software's test suite in combination with data from the VCS as a defect predictor (in theory, heavily tested areas are less likely to contain bugs)?" I found this particularly interesting. Also, I began to wonder, if we had one of these defect prediction reports, does it improve a developer's ability to find bugs, and if so, to what extent? Would it be measurable in the same way as I intend to measure testing ability with students and professionals?

Stay tuned for more info (and pictures of beautiful vancouver)!