Montag, 16. März 2015

Artificial Intelligence in Software Engineering

It is common consensus that software projects are generally growing in size and complexity. Handling this complexity is the core problem of creating better software and no more manageable with a text editor alone. A modern software engineer uses a bunch of assisting tools, usually encompassed by an IDE, from code completion and compiler notifications to software metrics and issue management systems. However, the goal of creating good software within budget bounderies still remains hard to achieve, hence there is a need for more “intelligent” tools and that’s where the broad field of artificial intelligence comes on stage.

In this article, I will give a short (and very uncomplete) overview of the problems researchers try to deal with by utilising methods from the area of AI. Among the diverse methods used in the field of artificial intelligence, there are three core disciplines that might help in engineering better software:

Finding patterns in large amounts of data

A software project does not consist of source code artifacts only but also of huge requirement documents, architectural diagrams, tons of issues filed in some bug-tracking system, test-case specifications and corresponding test data, documentation and so forth. Hence, some software projects can be regarded as a big collection of data; in any case a bunch of projects, e.g. all accessible open source projects, can be definitely seen as big data in which one can search for interesing patterns. Machine learning and data mining are two subfields of AI that might help the software engineer by elevating intelligence already contained in existing artifacts. Predicting occurences of bugs based on bug-tracking data and software metrics is one example. Another one is intelligent code recommendations based on the large amount of existing code in open source repositories.

Understanding natural language

One subfield of AI is natural language processing with its goal to make the machine understand the meaning of natural language. Especially the understanding of text-based information like requirement specifications or bug descriptions holds the potential of assisting the software engineer in certain applications like linking requirements to source code or finding duplicated bug reports. In this area, methods have evolved in the last decades from just using term frequencies for document comparisons to understanding which topic a text is about by making use of concepts from linear algebra and probability theory.

Biology inspired search algorithms

Several problems in computer science are hard to solve, i.e. there is no exact solution that scales well when the input size increases. But often an approximate solution is good enough to work with and much more efficient to compute. In the area of evolutionary computing principles from biological evolution, like gene inheritance and mutation, are used to design algorithms that find good solutions by invoking a large number of computationally simple steps which is in the end faster than finding an exact solution, especially when the problem size increases. Evolutionary computing can be devided into two sub-areas, genetic algorithms and genetic programming. The goal of the former approach is to find a concrete solution for a problem (e.g. TSP), whereas the goal of the latter is to find a program, described by its syntax tree, that in turn solves a problem. Release planning and automated bug-fixing are only two areas of research where these kinds of algorithms are frequently used.

In the following I will present some active research topics which are quite interesting from my personal point of view. The used references can be seen as starting points for getting more details. I sorted the topics by the typical software development stages:

 

Requirements / Planning

 

Release Planning

In release planning the goal is to select a number of requirements to be implemented for the next release of the software in a way that the customers’ satisfaction is maximized while staying within given bugdet/time boundaries. Furthermore, each customer has a different importance for the company and requirements are usually not independent from each other. In large-scale requirement scenarios this problem is hard to solve, hence intelligent search methods are required to find good (approximate) solutions. Often, genetic algorithms are used for this purpose, as in [1] and [2].

 

Resource-constrained Project Scheduling

The goal of project scheduling is to assign tasks to available resources while minimizing costs. In software engineering this can be translated into assigning implementation tasks to developers with certain skillsets. This is a classic problem from operational research and is known to be NP-hard. Among the proposed approximation algorithms, here again genetic algorithms seem to be promising [3]. But also swarm intelligence methods like ant colony optimizations have been inspected in this area [4]. See also [5] for a detailed survey of proposed solutions.

 

 

Implementation

 

Code Recommendations

Code completion is a frequently used tool not only for writing code faster, but also for exploring the functions an API provides. However, it is usually statically based on the used libraries which has some drawbacks. First, when the API is huge, the developer gets overwhelmed by the number of possible functions in a certain context, hence time gets wasted for searching the desired call. Second, the proper use of a function in a certain context is still not clear, so developers end up in trial and error workflows when working with new APIs. To make the system more intelligent, one can compute recommendations based on what other developers already coded in similar situations. The data basis for this knowledge can be gathered from open source repositories. In [6] a system is proposed which encodes the context of a variable by the methods in which it is used and by the methods previously called on that variable. The idea is then to find a number of similar code snippets in the code base by means of a modified k-Nearest-Neighbors approach and provide corresponding recommendations. The conducted research in [6] is also the basis for the popular Code Recommenders plugin in the Eclipse IDE.

 

 

Testing / Bug-Fixing

  

Automated Bug-fixing

Once a bug is spotted by means of a failing test case, the idea of automated bug-fixing is to let the machine provide a suitable patch such that all test cases are green again. One way to do this is to use a genetic programming approach in which a lot of (initially nearly random) patches are tried and evaluated by running all test cases (see [7]). The best (but still not perfect) of these patches are then combined together and evaluated again. This process is repeated several times. Together with some random mutations, this strategy eventually finds a suitable patch for the bug under consideration. Additionally, the patch size is included in the evaluation process to drive the search towards smaller solutions. In [8] an algorithm is proposed which does not only mutate the code until a bug is fixed, but simultaneously evolves the unit tests based on how many bugs they find. This results in a competitive co-evolution, where the mutated programs try to catch up with the mutated unit tests.

 

Bug Prediction

Classification is one of the basic tasks in machine learning: Given some input data, compute to which class this data belongs to by making use of learned relations from some training examples. Predicting bugs can be seen as a classification problem. Trained on previous occurences of bugs and some software metrics describing the quality of the buggy piece of code, a model can be constructed that predicts bugs or at least points to code snippets that have a higher chance of being buggy. Many of the typical machine learning approaches, like Naïve Bayes, Support Vector Machines or Logistic Regression are suitable for this problem. [9] gives an overview of the research conducted in this area and compares the proposed methods.

 

 

Maintainance

 

Bug Management

In bug management, or bug triage, the existing bugs or issues are edited by assigning a suitable severity class, mapping issues to corresponding developers or detecting and marking duplicate entries. When many users are contributing to bug reports, their maintainance becomes a time consuming job. Intelligent tools can assist in categorizing new bug reports by learning from the existing ones. One can think of recommendations for severity classes based on the terms used in the bug description [10, 11]. Typical classification algorithms like Support Vector Machines can learn that terms like “crash” or “NPE” usually describe severe bugs. Since the data in these scenarios is unstructured text, the usual steps from information retrieval methods, like tokenization, stop-word removal and stemming need to be performed beforehand to yield good classification results.
When the goal is to find duplicated bug reports, taking only the used terms and their frequencies might not be enough, as two reports may describe the same bug with completely different terms. Hence, the hidden semantic structure of a bug description (comparable to the topic of a text) needs to be revealed and used for comparison. This is investigated in [12] by using Latent Dirichlet Allocation to model the mapping between the terms of a bug report and the topic they belong to in a probablistic manner.


Traceability

In software engineering, traceability information describes how different artifacts are related to each other. For example, a requirement specification can be linked to the implementing source code, which in turn can be linked to a corresponding test case. One major advantage of maintaining this information is the possibility to perform impact analyses and thereby calculating the cost of a change in question. However, the maintaining overload can be a time consuming part in large software projects. There are several research proposals on how to alleviate this burden by automatically discovering relations between software artifacts [13, 14]. The current trend is to use advanced methods from information retrieval, like Latent Dirichlet Allocation, to represent an artifact in terms of the topics it is about [15]. Traceability links are then proposed based on the similarity of artifacts in their topic representation.


This was only a very incomplete listing of research topics where methods of artificial intelligence (and especially its sub-fields machine learning, evolutionary computing and information retrieval) are used to help creating better software. But the goal here is just to give a broad overview and some starting reading material for the more intereseted readers who might also want to take a look at the RAISE workshop.




References


[1] Greer, D., & Ruhe, G. (2004). Software release planning: An evolutionary and iterative approach. Information and Software Technology, 46, 243–253.

[2] Zhang, Y., Harman, M., & Mansouri, S. (2007). The Multi-Objective Next Release Problem, 1129–1136.

[3] Hartmann, S. (2002). A self-adapting genetic algorithm for project scheduling under resource constraints. Naval Research Logistics, 49(5), 433–448. 


[4] Merkle, D., Middendorf, M., & Schmeck, H. (2002). Ant colony optimization for resource-constrained project scheduling. IEEE Transactions on Evolutionary Computation, 6(4), 333–346.

[5] Kolisch, R., & Hartmann, S. (2006). Experimental investigation of heuristics for resource-constrained project scheduling: An update. European Journal of Operational Research, 174, 23–37.


[6] Bruch, M., Monperrus, M., Mezini, M., & Briand, L. (2009). Learning from examples to improve code completion systems. Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering on European Software Engineering Conference and Foundations of Software Engineering Symposium - E, 213.

[7] Weimer, W., Forrest, S., Le Goues, C., & Nguyen, T. (2010). Automatic program repair with evolutionary computation. Communications of the ACM, 53, 109.

[8] Arcuri, A., & Yao, X. (2008). A novel co-evolutionary approach to automatic software bug fixing. 2008 IEEE Congress on Evolutionary Computation, CEC 2008, 162–168.

[9] Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2011). A Systematic Review of Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering, 38(6), 1276–1304.

[10] Lamkanfi, A., Demeyer, S., Soetens, Q. D., & Verdonckz, T. (2011). Comparing mining algorithms for predicting the severity of a reported bug. Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, 249–258.


[11] Tian, Y., Lo, D., & Sun, C. (2012). Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. Proceedings - Working Conference on Reverse Engineering, WCRE, 215–224.

[12] Nguyen, A. T., Nguyen, T. T. T. N., Lo, D., & Sun, C. (2012). Duplicate bug report detection with a combination of information retrieval and topic modeling. Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering - ASE 2012, 70.

[13] Spanoudakis, G., & Zisman, A. (2005). Software traceability: a roadmap. Handbook of Software Engineering and Knowledge Engineering (Vol. III, pp. 1–35). Retrieved from http://soi.city.ac.uk/~gespan/hseke05.pdf

[14] Oliveto, R., Gethers, M., Poshyvanyk, D., & Lucia, A. De. (n.d.). ICPC10:On the equivalence of information retrieval methods for Automated Traceability Link Recovery.

[15] Asuncion, H. U., Asuncion, A. U., & Taylor, R. N. (2010). Software traceability with topic modeling. 2010 ACM/IEEE 32nd International Conference on Software Engineering, 1(May), 95–104.

Sonntag, 28. Dezember 2014

A Genetic Algorithm for Solving TSP


In this blog post I want to give a brief introduction into genetic algorithms and show how this class of algorithms can be used to find good solutions for the traveling salesman problem (TSP). The presented algorithms are written in Java and can be found in my github repository. In the end, I will shortly evaluate the genetic algorithm approach with an exact solver in terms of computation time and optimality.

Genetic algorithms in a nutshell

Genetic algorithms are approximation algorithms imitating biological evolution to find good solutions for a typically hard to solve problem. The general steps of a genetic algorithm are shown in the following code snippets. The algorithm works on a set of (non-optimal) problem solutions, called population, and tries to find better solutions by intelligently combining the existing ones in a sequence of evolution steps until a termination condition is reached. In a first step, the population is usually initialized with random solutions of the problem. In the repeated evolution step, the next generation is created as follows:

1. Selection

Parents for recombination (also called crossover) are selected by a selection procedure mostly based on the fitness of the individuals. The goal of this procedure is to ensure that better solutions are more probably taken for recombination and thereby applying some selection pressure (less fit individuals die out).

2. Crossover/Recombination

The previously selected parents are merged into one or more new individuals, the offspring. This procedure highly depends on the chosen representation for the problem’s solutions. The goal is to preserve useful parts from both parents in the resulting offspring.

3. Mutation

The resulting offspring is mutated with some probability by randomly modifying small parts. In this way the search space is further explored and the algorithm gets a chance to escape local optimums and find the global one.

4. Replacement

The last step replaces the old generation by the newly created one. Typically the best individual(s) in the old generation is preserved in the new one in order to not loose its genetic material (this is called elitism).

The evolution process is performed until a termination condition is reached. This can be a fixed number of generations or the observation that the population did not improve during the last generations.


GeneticAlgorithm<T> ga = new GeneticAlgorithm<T>();
Population<T> pop = new Population<T>(POP_COUNT);
ga.initialize(pop);
while (!terminate()) {
    ga.evolve(pop);
}
T solution = pop.getFittest(); 
Code Snippet 1. Outer loop of genetic algorithm
 
public void evolve(Population<T> oldPopulation) {
    Population<T> newPopulation = new Population<T>(oldPopulation.size());
    for (int i = 0; i < newPopulation.size(); i++) {
        // selection
        T parent1 = getSelection().execute(oldPopulation);
        T parent2 = getSelection().execute(oldPopulation);
        // crossover
        T offspring = getCrossOver().execute(parent1, parent2);
        // mutation
        getMutation().execute(offspring);
        
        newPopulation.setIndividual(i, offspring);
    }
    // replacement
    oldPopulation = getReplacement().execute(oldPopulation, newPopulation);
}
Code Snippet 2. Evolution step

 

A genetic algorithm for solving TSP

Finding a suitable genetic algorithm for the TSP is an active research area which resulted in several proposals for appropriate selection, crossover and mutation operators. In this section I will describe the solution representation and operators I have chosen to find good solutions in a reasonable time.

Representation

The solution of a TSP can be represented by an ordered list of unique cities where two subsequent cities (as well as the last and the first one) are meant to be connected by an edge, thus forming a Hamiltonian cycle. This representation is termed path representation and chosen in order to facilitate an efficient implementation of the used operators.

Population

The population consists of a number of individuals which represent solutions by using the above mentioned path representation. The population is initialized with completely random cycles. 

 

Selection

As selection procedure deterministic Tournament Selection with a tournament size of 3 is chosen. In Tournament Selection a fixed number (here 3) of individuals is selected which then compete against each other in a tournament. In deterministic tournament selection the fittest individual (here the shortest cycle) wins.

 

Crossover

As crossover procedure Sequential Constructive Crossover as proposed in [1] is implemented. The following figure depicts the recombination process.


The SCX operator takes two parents and creates a single offspring by taking edge costs into account. Starting with an initial city, it searches for the next not yet used successor city in the two parents. In the example, a successor for city 7 is asked. In the cycle denoted by parent 1, it is city 4. In parent 2, it is city 5 since city 1 is already used in the created offspring. The city which is nearer to city 7 is then determined as successor in the offspring. This procedure is carried out repeatedly until the offspring is complete. In addition, each created offspring is optimized using the 2-opt search algorithm to further improve the solution.

 

Mutation

As mutation operator edge inversion is used with a mutation rate of 50%. In edge inversion, two cities are selected randomly and the order of all cities between them is reversed. The rather high mutation rate of 50% is determined by several tests and due to the greediness of the SCX operator and its resulting loss of population diversity.

 

Replacement

As replacement procedure, all individuals are replaced by the newly created ones apart from the best individual, the elite. This procedure is termed Elitism.

 

Termination

In the conducted tests, the algorithm is terminated when it fails to improve the best solution within 30 subsequent generations. Hence, the algorithm runs as long as the solution can be improved within a reasonable time. Increasing the window might yield better results, but only in the detriment of computation time.

Comparison with exact solver

To evaluate the quality of the presented genetic algorithm, I compare it with an exact TSP solver using the Held-Karp algorithm as implemented in [2] in two sample scenarios taken from the TSP library provided by the University of Heidelberg.

Benchmark
Exact Solver
Genetic Algorithm
(average over 10 runs)
Cycle Cost
Computation Time
Cycle Cost
Computation Time
att48
33523.71
10 sec
33546.76 (+0,07%)
1 sec
kroD100
21294.29
241 sec
21416.59 (+0,57%)
13 sec

As the results show, for a problem with 100 cities the genetic algorithm finds on average solutions which are 0,57% off the optimum while taking 13 seconds on my machine. The exact solver however needs roughly 4 minutes! Due to the different complexity classes of the algorithms, this difference highly increases with problem size.

Next steps

The described algorithms can be found in my github repository. However, the presented solution can still be improved by using more intelligent operators. One problem is the high greediness of the SCX operator which decreases the population diversity and hence tends to converge in a local optimum only. Using a high mutation rate (here 50%) is only one counter-measurement. Better results may be achieved when using operators that adapt themselves depending on the population's diversity. For example, increasing mutation rates and decreasing selection pressure (by using different selection strategies) when convergence is observed.

Feel free to play around with the code. You will find some JUnit tests to trigger the execution of the algorithm under different configurations as well as a simple graphical user interface.

References

[1] Genetic Algorithm for the Traveling Salesman Problem using Sequential Constructive Crossover Operator, Zakir H. Ahmed, Department of Computer Science, Al-Imam Muhammad Ibn Saud Islamic University, P.O. Box No. 5701, Riyadh-11432, Kingdom of Saudi Arabia.
[2] http://stackoverflow.com/questions/7159259/optimized-tsp-algorithms