Analyzing the Change-Proneness of APIs and web APIs
Transcription
Analyzing the Change-Proneness of APIs and web APIs
Analyzing the Change-Proneness of APIs and web APIs Analyzing the Change-Proneness of APIs and web APIs PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op woensdag 7 januari 2015 om 12.30 uur door Daniele ROMANO Master of Science in Computer Science - University Of Sannio geboren te Benevento, Italy. Dit proefschrift is goedgekeurd door de promotors: Prof. dr. A. van Deursen Prof. dr. M. Pinzger Samenstelling promotiecomissie: Rector Magnificus Prof. Dr. A. van Deursen Prof. Dr. M. Pinzger Prof. Dr. G. Antoniol Dr. Alexander Serebrenik Dr. Cesare Pautasso Prof. Dr. Ir. D.M. van Solingen Prof. Dr. Frances Brazier voorzitter Delft University of Technology, The Netherlands, promotor University of Klagenfurt, Austria, promotor École Polytechnique de Montréal, Canada Eindhoven University of Technology, The Netherlands University of Lugano, Switzerland Delft University of Technology, The Netherlands Delft University of Technology, The Netherlands This work was carried out as part of the Re-Engineering Service-Oriented Systems (ReSOS) project. This project was partially funded by the NWO-Jacquard program and supported by Software Improvement Group and KPMG. SERG Copyright c 2014 by Daniele Romano Cover image by Craig S. Kaplan, University of Waterloo. ”Pain is inevitable. Suffering is optional.” Haruki Murakami Acknowledgments The 10th of October, 2010 I had my job interview at the Software Engineering Research Group (SERG) of the Technology University of Delft. It was an unbelievably sunny and warm day and I immediately fell in love with the research performed at the SERG group, the people, the city of Delft, and the Dutch sun. After 4 amazing years I can say that the only thing I was wrong was when I thought "Dutch weather is not that bad". When I started my research in November 2010 all other expectations became reality. I am really happy to have spent the last 4 year in such a competitive research group with amazing people. I wish to thank all those who have supported me on this journey starting from Martin Pinzger and Arie van Deursen who gave me the opportunity to pursue this PhD. First of all, I would like to thank my supervisor Martin Pinzger whose guidance went far beyond my expectations. He always gave me his honest and professional guidance in performing scientific research. I am very thankful for his enthusiastic and human approach that made him not only a good supervisor but also a great friend. Thanks a lot Martin! All the time we spent together discussing about research or simply enjoying leisure time has been important for my professional and private life and now it is part of me. I will never forget it. Also, I will never forget the only time when you were not able to guide me (on the top of an Austrian hill). That was funny! Furthermore, I would like to thank all my colleagues who have always provided me with valuable feedback that has been important to improve the quality of my research. Especially, I would like to thank Andy Zaidman for his unending willingness in helping me as well as anyone in the group. Thanks Andy! You are in my list of best people I have met in my entire life. Finally, I would like to thank all my friends and my family who always distracted me from my dedication in performing my research activities. This has been really important even though I have not always been able to disconvii nect my mind from my research. Especially, I want to thank my family who accepted my willingness to move abroad and all its consequences. Claudio, Grazia, Maria Elena, Guido, nonna Elena, zio Tonino I love you all a lot! I am sure one day I will regret to have spent part of my life abroad to pursue my professional goals and not with you. Thanks a lot for accepting it. We are a great family and the geographical distance will never change anything. Delft, November 2014 Daniele Romano viii Contents Acknowledgements vii 1 Introduction 1 1.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Approach and Research Questions . . . . . . . . . . . . . . . . . . 8 1.3 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Origin of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Change-Prone Java Interfaces 17 2.1 Interface Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 36 3 Change-Prone Java APIs 41 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 ix 3.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 60 4 Fine-Grained WSDL Changes 63 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 WSDLDiff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 78 5 Dependencies among Web APIs 81 5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Study Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 98 6 Change-Prone Web APIs 99 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Research Questions and Approach . . . . . . . . . . . . . . . . . . 103 6.3 Online Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7 Refactoring Fat APIs 125 7.1 Problem Statement and Solution . . . . . . . . . . . . . . . . . . . 127 7.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.3 Random and Local Search . . . . . . . . . . . . . . . . . . . . . . . 135 7.4 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 148 x 8 Refactoring Chatty web APIs 149 8.1 Problem Statement and Solution . . . . . . . . . . . . . . . . . . . 151 8.2 The Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.3 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.5 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 166 9 Conclusion 167 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.2 The Research Questions Revisited . . . . . . . . . . . . . . . . . . . 170 9.3 Recommendations for Future Work . . . . . . . . . . . . . . . . . . 175 9.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Bibliography 179 Summary 197 Samenvatting 201 Curriculum Vitae 205 xi 1. Introduction Several years of research on software maintenance have produced numerous approaches to identify and predict change-prone components in a software system. Among others, source code metrics and heuristics to detect antipatterns and code smells have been widely validated as indicators of changes. However, these indicators have been mainly proposed and validated for objectoriented systems. There is still the need to define and validate indicators of changes for systems implemented in other programming paradigms such as the service-oriented one. In recent years there has been a tendency to adopt Service-Oriented Architectures (SOAs) [Josuttis, 2007] in companies and government organizations for two main reasons. First, SOAs allow companies to organize and use distributed capabilities (i.e., services) that may be under the control of different organizations or different departments within the same organization [Brown and Hamilton, 2006]. Second, organizations benefit from the loose coupling between clients and services. However, clients and services are still coupled and changes in the services can impact negatively their clients and entire systems. The dependencies removed in SOAs are the dependencies between clients and the underlying technologies used to implement services. Clients and services are still coupled through function coupling and data structure coupling [Daigneau, 2011]. In fact, clients depend 1) on the functionalities implemented by services (i.e., function coupling) and 2) on the data structures that a service’s instance receives and returns (i.e., data structure coupling). Both are specified in its interface, that we refer to as web API throughout this thesis. For this reason web APIs are considered contracts between clients and service providers and they should remain as stable as possible [Daigneau, 2011; Murer et al., 2010]. However, like any other software component, services evolve to satisfy changing or new functional and non functional requirements. In this PhD research we investigate quality indicators that can highlight change-prone web APIs. Web APIs can be split into two main categories 1 2 Chapter 1. Introduction SOAP/WSDL (WS-*) APIs and REST APIs [Pautasso et al., 2008]. In this dissertation we focus on SOAP/WSDL APIs [Alonso et al., 2010]. First, we investigate which indicators can highlight change-prone APIs. Changes in the implementation logic can cause changes in the web APIs, especially when legacy APIs are made available through web APIs. Then, we analyze indicators that highlight change-prone web APIs. Finally, based on design practices that can cause changes in APIs and web APIs we propose techniques to automatically refactor them. In this introductory chapter, first, we present services, their history, and the importance of designing and implementing stable web APIs (Section 1.1). In Section 1.2 we present the research approach, the research questions, and the contributions of this PhD thesis. In Section 1.3 we show the research method used to answer our research questions. Section 1.4 discusses the related work. Finally, we present the outline of this thesis and we present the peer reviewed publications on which the chapters of this thesis are based (Section 1.5). 1.1 Services The term service has been introduced to refer to software functions that carry out business tasks. Business tasks include tasks such as providing access to files and databases, performing functions like authentication or logging, bridging technological gaps, etc. Services can be implemented using many technologies that range from the older CORBA and DCOM to the newer REST and SOAP/WSDL technologies. Services have become popular and are widely used to ease the integration of heterogeneous systems. In fact, the main goal of services is to share business tasks across systems that run on different hardware platforms (e.g., Linux, Windows, Mac OS, Android, iOS) and are implemented through different software frameworks and programming languages (e.g., Java, .NET, Objective-C). 1.1.1 Software Integration with Web Services The benefits of using services instead of other software components to ease integration is well discussed in the book Service Design Patterns by Daigneau [Daigneau, 2011]. Objects have been the first components used for integrating business tasks across different software systems [Daigneau, 2011]. An object (e.g., a Java class) can encapsulate business functions or data and it can be reused in different software systems. To reuse an object developers instantiate it and access their business tasks invoking their methods. The main problem of objects is 1.1. Services 3 Figure 1.1: Components are reused through platform specific interfaces. Taken from [Daigneau, 2011]. that it is challenging to reuse them in software systems implemented with different programming languages. To overcome this problem component technologies have been proposed. Components are deployable binary software units that can be easily integrated into software systems implemented in different programming languages. The business tasks encapsulated into them are accessible through binary interfaces that describe their methods, attributes, and events as shown in Figure 1.1. Unlike with objects, developers do not have access to the internals of the components but only to their interfaces. The interfaces, however, are described through platform-specific languages (e.g., Microsoft Interface Definition Language). While reusing components within systems implemented in different programming languages is easy, developers are now constrained to reuse components in specific platforms (e.g., Microsoft computing platforms). To address this problem objects have been deployed on servers allowing clients to access their business tasks invoking their methods remotely (Figure 1.2). Distributed objects can be reused by different software systems independently of the platforms on which clients and objects are deployed. The most popular technologies to invoke distributed objects are CORBA, DCOM, Java Remote Method Invocation (RMI), and .NET Remoting. As shown in Figure 1.2 a client invokes the remote object through a proxy. The proxy forwards the invocation to a stub that is deployed on the distributed object’s server. Then, the stub is responsible of invoking the distributed object. This design pattern has its drawbacks as well. First, the implementation is 4 Chapter 1. Introduction Figure 1.2: Distributed objects invoked over a network by their clients. Taken from [Daigneau, 2011]. not easy for developers. The serialization and deserialization of the messages exchanged is not standardized. As a consequence, the design pattern works well if both client and server use the same technologies to create the channel. Otherwise, technical problems can arise frequently. Other problems are due to the fact that servers maintain states between client calls that can be extremely expensive. Maintaining states requires to implement proper techniques to perform effectively load-balancing and it can cause a degradation of the server memory utilization with an increasing number of clients. Web services have been conceived to solve the aforementioned problems of the local objects, components, and distributed objects. They provide a standard means of interoperating between different software applications, running on a variety of platforms and/or frameworks and based on "stateless" interactions in the sense that the meaning of a message does not depend on the state of the conversation [W3C, 2004]. The W3C has defined a web service as a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards. To reach a high portability between different platforms the W3C defined a Web Services Architecture Stack based on XML languages that offers standard, flexible and inherently extensible data formats. Among these languages SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) are the core languages to invoke a web service and to describe its interface. SOAP is a protocol that specifies the data structures of the messages exchanged with the web services and auxiliary data structures to represent other information such as header information or error information occurred while processing the message. WSDL describes the web services’ interface in terms of 1) operations exposed in web services, 2) addresses or connection endpoints 1.1. Services 5 to web services, 3) protocols to bind web services, 4) operations and messages to invoke web services. Note that WSDL interfaces can be mapped to any implementation language, platform, object model, or messaging system. As a consequence a WSDL interface is a contract between web services providers and its clients that hides the implementation behind the web services. The architectural styles of the SOAP/WSDL web services is also known as RPC (Remote Procedure Call) API highlighting the fact that clients invoke procedures over a network. However, the W3C defined another architectural style for web services called Resource API. According to this style, web services exposes resources (e.g., text files, media files) and not actions like in the RPC API style. Clients have access and can manipulate these resources through representations (e.g., XHTML, XML, JSON). When a client receives a resource representation from a web service it receives the current state of the resource. If it sends a representation of the resource to a web service it possibly alters the state of the resource. For this reason this architectural style is also known as Representational State Transfer or REST APIs [Fielding, 2000]. Resource APIs use HTTP as application protocol. Specifically, PUT is used to create or update resources, GET is used to retrieve a resource representation, and DELETE removes a resource. For a detailed comparison between SOAP/WSDL web services and REST APIs we refer the reader to the work by [Pautasso et al., 2008] 1.1.2 Change-Proneness of Web APIs Using web services allows software engineers to reduce coupling between distributed components and, hence, eases the integration among such components. Web services eliminate the dependencies between the clients and the underlying technologies used by a web service. Eliminating dependencies on technologies reduces the coupling but it does not decouple completely clients and web services. There are still four different levels of coupling, namely, the function coupling, the data structure coupling, the temporal coupling, and the URI coupling [Daigneau, 2011]. For more details on coupling in service oriented system we refer the reader to the work by [Pautasso and Wilde, 2009]. First of all, clients invoke web services to execute a business task (i.e., RPC API) or to retrieve, update, create, or delete a resource (i.e., Resource API). Clients depend indirectly on the business logic implemented by web services. This coupling is called function coupling. Second, the clients depend on the data structures used to invoke a web service and to receive the results of the invocations. These data structures are defined in the API of web services that we refer to as web API throughout this book. This dependency is also known as 6 Chapter 1. Introduction data structure coupling. Third, clients and web services are coupled through temporal coupling. This level of coupling indicates that the web service should be operational when a client invokes it. Finally, clients are coupled to the web services URIs (i.e., URI coupling). As a consequence, clients depend on the implementations, the web APIs, the reliability, and the URIs of web services. Changes to these four factors are problematic for clients and they can break them. In this PhD thesis we investigate the change-proneness of APIs and web APIs, focusing on SOAP/WSDL APIs [Alonso et al., 2010]. We decided to focus on web APIs, and hence on data structure coupling, because they are considered contracts between clients and web services specifying how they should interact. One of the key factors for deploying successful web APIs is assuring an adequate level of stability. Changes in a web API might break the clients’ systems forcing them to continuously adapt them to new versions of the web API. For this reason, assessing the stability of web APIs is key to reduce the likelihood of continuous updates. 1.1.3 Performance of Web APIs During the frequent discussions with industry, practitioners kept repeating that performance issues are one of the causes that lead web APIs to be changed. Web APIs are invoked over a network and, hence, the latency can be significantly higher than calling a similar web API when it is deployed on the same machine than the client. When a client invokes a method of a web API the request should be serialized in a stream of bytes, transmitted over a network, deserialized on the server side, and dispatched to the web services. The same steps should be executed when the web service returns the results back to the client. As a consequence designers should pay attention in designing a web API that can execute a use case with the lowest number of messages exchanged between clients and web services. To reduce the latency designers should usually prefer web APIs that exchange few chunky messages instead of many smaller messages [Daigneau, 2011]. In this way they can avoid chatty conversations that increase the latency. To better understand this problem consider the redesign of web APIs adopted at Netflix [Jacobson, 2012] and shown in Figure 1.3. At the beginning of its history Netflix had adopted an one-size-fit-all (OSFA) Rest API approach to provide its services to its clients. This approach is shown in Figure 1.3a. According to this approach there is a unique Rest API invoked by all the different clients. To satisfy the requirements of all clients this API requires a large number of interactions with clients that should invoke multiple times the API to 1.1. Services 7 Network Border Network Border REST API AUTH SIMILAR MOVIES RATINGS MEMBER DATA MOVE DATA (a) One-size-fit-all (OSFA) Rest API approach at Netflix. Each client should invoke multiple times the single Rest API. Taken from [Jacobson, 2012]. Network Border Network Border JAVA API AUTH SIMILAR MOVIES RATINGS MEMBER DATA MOVE DATA (b) Each client invokes once its specifically designed Rest API reducing the chattiness and improving the latency. Taken from [Jacobson, 2012]. Figure 1.3: Web APIs redesign at Netflix [Jacobson, 2012]. 8 Chapter 1. Introduction execute a use case, as shown by the colored arrows in Figure 1.3a. Among other issues, this approach degrades the performance since network calls are expensive. To overcome this problem engineers at Netflix have adopted a new approach (shown in Figure 1.3b) reducing the latency in some cases by several seconds. In this new approach, the clients make a single chunky request (black arrow in Figure 1.3b) to their specific endpoint designed to handle the request of a specific client. As of consequence each different client has its own web API with which it interacts with a single request per use case. These ad hoc web APIs communicate locally with a fine-grained Java API. The functionality of this API is similar to the original Rest APIs. However, its fine-grained methods are invoked locally while the clients perform only a single remote request. Thanks to this new approach engineers at Netflix have reduced the chattiness of their web API and improved considerably the latency. This story shows the relevance to design web APIs with an adequate granularity. Inadequate granularity can cause performance issues that force web APIs to be changed. 1.2 Approach and Research Questions The work presented in this PhD thesis is part of the work performed within the ReSOS (Re-enegineering Service-Oriented Systems) project. ReSOS began in November 2010 and it is aimed at improving the quality of service-oriented systems. However, the term quality is generic and it includes many different quality attributes (e.g., reliability, efficiency, security, maintainability). Based on the literature (e.g., [Daigneau, 2011]) and on the frequent discussions with our industrial partners (KPMG1 and SIG2 ) and collaborators we found that the stability of web APIs is crucial for designing and maintaining service-oriented systems. As discussed in the previous section, web services became a popular means to integrate software systems that may belong to different organizations. As a consequence, web APIs are considered contracts [Daigneau, 2011; Murer et al., 2010] for integrating systems and they should stay as stable as possible. Based on these discussions with our industrial partners and collaborators, we set up a research approach consisting of the following research tracks: • Track 1: Analysis of change-prone APIs • Track 2: Analysis of change-prone web APIs • Track 3: Refactoring of change-prone web APIs 1 2 http://www.kpmg.com/ http://www.sig.eu/en/ 1.2. Approach and Research Questions 1.2.1 9 Track 1: Change-Prone APIs Each web service is implemented by an implementation logic that is hidden to its clients through its web API. Changes to the implementation logic can be propagated and affect the web API. Among all the software units composing the implementation logic, APIs are likely to be mapped directly into web APIs. This scenario happens especially when a legacy API is made available through a web service. For this reason, in the first track we analyze the changeproneness of APIs, where we refer to API as the set of public methods declared in a software unit. To perform this study we use existing techniques to mine software repositories and to extract changes performed in the APIs. We then analyze whether there is a correlation between the amount of changes an API undergoes and the values of source code metrics and/or the presence of antipatterns. The outcome of this track will consist in a set of quality indicators (e.g., heuristics and software metrics) that can highlight change-prone APIs and assist software engineers to design stable APIs. In our context, these indicators are particularly useful to check the stability of APIs when they are mapped directly to web APIs. To investigate change-prone APIs, we first focus on the following research question: Research Question 1: Which software metrics do indicate change-prone APIs? We investigate this research question in Chapter 2 by empirically investigating the correlation between source code metrics and the number of finegrained source code changes performed in the interfaces of ten Java opensource systems. Moreover, we use the metrics to train prediction models used to predict change-prone Java interfaces. In Chapter 3 we answer our second research question: Research Question 2: What is the impact of antipatterns on the change-proneness of APIs? Previous studies showed that classes with antipatterns change more frequently than classes without antipatterns. In Chapter 3 we answer this research questions by extending these studies and taking into account fine-grained source code changes extracted from 16 Java open-source systems. In particular we investigate: (1) whether classes with antipatterns are more change-prone than classes without; (2) whether the type of antipattern impacts the changeproneness of Java classes; and (3) whether certain types of changes are per- 10 Chapter 1. Introduction formed more frequently in classes affected by a certain antipattern. Performing this analysis we retrieve the set of antipatterns that are more correlated with changes performed in APIs. 1.2.2 Track 2: Change-Prone Web APIs The second track consists of investigating the change-proneness of web APIs through the analysis of their evolution. This analysis can help us in identifying bad design practices that can increase the probability that a web API will be changed in the future. To perform this study we detect and extract changes performed in web APIs. This task is performed by a tool that compares two subsequent versions of a web API and extracts changes taking into account the syntax of the web API specification. In this way we can extract the type of a change performed in the interface as they have been classified in [Leitner et al., 2008]. Knowing the type of a change is particularly useful for two reasons. First, we can see which element is affected by the change and how it changes. Second, we can classify the changes depending on the impact they can have on the clients. In fact, changes can be divided into breaking changes and non-breaking changes depending on whether web service client developers need to update their code or not [Daigneau, 2011]. Once we are able to extract and classify changes we investigate heuristics and software metrics that can be used as indicators of change-prone web APIs. Similar to Track 1, we then investigate the correlation between them and the changes performed in the web API. To perform such analysis, we first need a tool to extract fine-grained changes among different version of a web API. Then, such analysis might require a tool to track the dependencies among web APIs. As already described in Section 1.1.2, even though services are loosely coupled they are still coupled through function and data structure coupling. Coupling can be a good quality indicator in service-oriented systems like it has been already proved for systems implemented in other programming paradigms. We expect that a service with a higher incoming and outgoing coupling can show a higher response time. However, measuring coupling in service-oriented systems is more challenging than for systems implemented in other paradigms. This is mainly due to the dynamic and distributed nature of service-oriented systems. Besides coupling, we analyze other attributes that can affect change-proneness such as cohesion. We argue that a web API should be cohesive to prevent changes in the future. A low cohesive web API can affect the comprehension of the web API resulting in a lower reusability. Moreover, a web API with 1.2. Approach and Research Questions 11 different responsibilities can be a bottleneck that can affect response time because of the different clients invoking it. To analyze the impact of these quality attributes on change-proneness we analyze existing antipatterns defined in literature and described in Section 1.4. The outcome of this study consists of a set of heuristics and metrics that can assist software engineers in designing web APIs that are less change-prone. To perform this track, first we implement a tool to extract fine-grained changes between different versions of web APIs and we answer the following research question: Research Question 3: How can we extract fine-grained changes among subsequent versions of web APIs? We answer this research question in Chapter 4 by proposing a tool called WSDLDiff able to extract fine-grained changes from subsequent versions of a web API defined in WSDL. In contrast to existing approaches, WSDLDiff takes into account the syntax of WSDL and extracts the WSDL elements affected by changes and the types of changes. We show a first study aimed at analyzing the evolution of web APIs using the fine-grained changes extracted from the subsequent versions of four real world WSDL APIs. Based on the results of this study web service subscribers can highlight the most frequent types of changes affecting a WSDL API. This information is relevant to assess the risk associated to the usage of web services and to subscribe to the most stable ones. As second step in Track 2 we propose a portable approach to infer the dynamic dependencies among web services at run time answering the following research question: Research Question 4: How can we mine the full chain of dynamic dependencies among web services? We answer this research question in Chapter 5 by proposing an approach able to extract dynamic dependencies among web services. The approach is based on vector clocks, originally conceived and used to order events in distributed environments. We use vector clocks to order web service executions and to infer causal dependencies among web services. We show the feasibility of the approach by implementing it into the Apache CXF framework3 and instrumenting SOAP messages. Moreover, we show two experiments to investigate the impact of the approach on the response time. 3 http://cxf.apache.org 12 Chapter 1. Introduction Finally, we conclude Track 2 and investigate the change-proneness of web APIs answering the following research question: Research Question 5: What are the scenarios in which developers change web APIs with low internal and external cohesion? We address this research question in Chapter 6. We present a qualitative and quantitative study of the change-proneness of web APIs with low external and internal cohesion. The internal cohesion measures the cohesion of the operations (also referred as methods) declared in a web API. The external cohesion measures the extent to which the operations of a web API are used by external consumers (also called clients). First, we report on an online survey to investigate the maintenance scenarios that cause changes to web APIs. Then, we define an internal cohesion metric and analyze its correlation with the changes performed in ten well known WSDL APIs. The goal of the study is to provide several insights into the interface, method, and data-type changeproneness of web APIs with low internal and external cohesion. The choice of focusing on internal and external cohesion, instead of other attributes, is based on our previous and related work and discussed in Chapter 6. 1.2.3 Track 3: Refactoring Web APIs Track 1 and Track 2 give useful insights into the change-proneness of APIs and web APIs. Based on the findings of these tracks in Track 3 we investigate techniques to assist software engineers in refactoring change-prone web APIs. Among all change-prone indicators found in Track 1 and Track 2 we focus on external cohesion and we define techniques to refactor web APIs with low external cohesion. We focus on this attribute because it highlights both changeprone APIs (Track 1) and change-prone web APIs (Track 2). Web APIs, and in general APIs, with low external cohesion can be refactored through the Interface Segregation Principle [Martin, 2002]. As a consequence, as first step in this track, we use search based software engineering techniques to refactor APIs with low external cohesion answering the following research question: Research Question 6: Which search based techniques can be used to apply the Interface Segregation Principle? We answer this research question in Chapter 7. We formulate the problem of applying the Interface Segregation Principle as a multi-objective clustering problem and we propose a genetic algorithm to solve it. We evaluate the capability of the proposed genetic algorithm with 42,318 public Java APIs 1.2. Approach and Research Questions 13 whose clients’ usage has been mined from the Maven repository. The capability of the genetic algorithm is then compared with the capability of other search based approaches (i.e.,, random and simulated annealing approaches). The last part of this track consists in refactoring fine-grained web APIs (i.e., chatty APIs). As discussed in Section 1.1.3 fine-grained APIs can be changed over time to improve the performance and to reduce the number of remote invocations. In this part we answer the following research question: Research Question 7: Which search based techniques can transform a fine-grained APIs into multiple coarse-grained APIs reducing the total number of remote invocations? In Chapter 8 we answer this research question by proposing a genetic algorithm that mines the clients’ usage of web service operations and suggests Façade web services whose granularity reflects the usage of each different type of clients. These Façade web services can be deployed on top of the original web service and they become contracts for the different types of clients satisfying the Consumer-Driven Contracts pattern [Daigneau, 2011]. According to this pattern the granularity of a web API, in terms of exposed operations, should reflect the clients’ usage. 1.2.4 Contributions The contributions of this PhD research can be summarized as follows: • A set of validated quality indicators, comprising metrics and heuristics, to highlight change-prone APIs; • A set of validated quality indicators, comprising metrics, heuristics, techniques, and tools to highlight change-prone web APIs; • A tool to mine fine-grained changes between different versions of a web API; • An approach to infer the dynamic dependencies among web service at run time; • An approach to refactor web APIs, and in general APIs, with low external cohesion applying the Interface Segregation Principle; • An approach to refactor fine-grained web APIs into coarse-grained web APIs with a lower number of required remote invocations. 14 1.3 Chapter 1. Introduction Research Method Our research has been done in close collaboration with our industrial partners and collaborators following an industry-as-laboratory approach [Potts, 1993]. The involvement of the industry in our research is crucial to address challenges faced by practitioners and develop techniques and tools capable of assisting them in solving real world problems. Frequent discussions allowed us to focus on their main problems and agree on sustainable solutions. As a consequence, all the problems addressed in this thesis have arisen from these discussions with the industrial parties. This step has been particularly useful to define the aforementioned research questions and the directions of our research (i.e., change-proneness of APIs and web APIs). To answer our research questions we used different research methods. Research questions 1, 2, and 5 are aimed at validating indicators that highlight change-prone APIs and web APIs. They have been mainly answered performing quantitative studies based on mining software repositories techniques [Kagdi et al., 2007] and using statistics [Sheskin, 2007] and machine learning techniques [Witten and Frank, 2005]. We performed these studies analyzing open source systems from different domains. The reason behind the choice of these systems is two-fold. First, industrial parties are reluctant to release their systems’ repositories and to allow public discussions about them. Second, using open source systems allows other researchers to compare our findings with theirs and also to verify and extend our work. Whenever the available data was not enough to draw statistical conclusions (i.e., Research Question 5) we followed a mixed-methods approach [Creswell and Clark, 2010] which is a combination of quantitative and qualitative methods. In this case the results of statistical tests are complemented with an online survey [Floyd J. Fowler, 2009]. The remaining research questions are aimed at validating approaches to analyze service oriented systems (i.e., research question 3 and 4) and to refactor APIs and web APIs (i.e., research questions 6 and 7). Whenever the available data was not enough to validate these approaches (i.e., Research question 4 and 7) we use synthetic data and performed controlled experiments [Wohlin et al., 2000]. The approaches used to refactor APIs and web APIs have been implemented and evaluated with state of the art search-based techniques [Harman et al., 2012]. 1.4. Related Work 1.4 15 Related Work In this section we present an overview of related work while the main chapters of this PhD thesis provide more details. Many studies (e.g., [Perepletchikov et al., 2010; Moha et al., 2012; RotemGal-Oz, 2012; Král and Zemlicka, 2007]) propose quality indicators for serviceoriented systems. However, these indicators have been poorly validated mainly because of the lack of availability of such systems. Perepletchikov et al. [2010, 2006] defined a set of cohesion and coupling metrics for service-oriented systems. They analyzed cohesion in the context of web services and proposed four different types of cohesion metrics for measuring analyzability [Perepletchikov et al., 2010]. Furthermore, they proposed three different coupling measures for web services and they showed their impact on maintainability [Perepletchikov et al., 2006]. The most recent work on web services antipatterns has been proposed by Moha et al. [2012]. They proposed an approach to specify and detect an extensive set of antipatterns that encompass concepts like granularity, cohesion and duplication. Their tool is capable of detecting the most popular web services antipatterns defined in literature. Besides these antipatterns, they specified three more antipatterns, namely: bottleneck service, service chain and data service. Bottleneck service is a web service used by many web services and it is affected by a high incoming and outgoing coupling that can affect response time. Service chain appears when a business task is achieved by a long chain of consecutive web services invocations. Data service is a web service that performs simple information retrieval or data access operations that can affect the cohesion. Rotem-Gal-Oz [2012] defined the knot antipattern as a set of low cohesive web services which are tightly coupled. This antipattern can cause low usability and high response time. The sand pile defined by Král and Zemlicka [2007] appears when many fine-grained web services share common data that may be available through a web service affected by the data service antipattern. Cherbakov et al. [2006] proposed the duplicate service antipattern that affect services sharing similar methods and that can cause maintainability issues. Dudney et al. [2002] defined a set of antipatterns for J2EE applications. Among these we investigate the multi service, tiny service and chatty service antipatterns. The multi service is a service that provides different business operations that are low cohesive and can affect availability and response time. Tiny services are small web services with few methods that are used together. This 16 Chapter 1. Introduction antipattern can affect the reusability of such services. Finally the chatty service antipattern affects services that communicate with each other with small data. This antipattern can affect the response time. All the aforementioned studies suggest and detect antipatterns for designing web APIs but they do not investigate the effects of these antipatterns on the change-proneness and do not suggest techniques to refactor web APIs. 1.5 Origin of Chapters The chapters of this thesis have been published before as peer-reviewed publications or are under review. As a consequence they are self-contained and, hence, they might contain some redundancy in the background, motivation, and implication sections. The author of this thesis is the main contributor of all chapters and all publications have been co-authored by Martin Pinzger. The following list provides an overview of these publications: Chapter 2 was published in the 27th International Conference on Software Maintenance (ICSM 2011) [Romano and Pinzger, 2011a]. Chapter 3 was published in the 19th Working Conference on Reverse Engineering (WCRE 2012) [Romano et al., 2012]. Chapter 4 was published in the 19th International Conference on Web Services (ICWS 2012) [Romano and Pinzger, 2012]. Chapter 5 was published in the 4th International Conference on Service Oriented Computing and Application (SOCA 2011) [Romano et al., 2011]. Chapter 6 is currently under review and published as technical report [Romano et al., 2013]. Chapter 7 was published in the 30th International Conference on Software Maintenance and Evolution (ICSME 2014) [Romano et al., 2014]. Chapter 8 was published in the 10th World Congress on Services (Services 2014) [Romano and Pinzger, 2014]. 2. Change-Prone Java Interfaces Recent empirical studies have investigated the use of source code metrics to predict the change- and defect-proneness of source code files and classes. While results showed strong correlations and good predictive power of these metrics, they do not distinguish between interface, abstract or concrete classes. In particular, interfaces declare contracts that are meant to remain stable during the evolution of a software system while the implementation in concrete classes is more likely to change. This chapter aims at investigating to which extent the existing source code metrics can be used for predicting change-prone Java interfaces. We empirically investigate the correlation between metrics and the number of fine-grained source code changes in interfaces of ten Java open-source systems. Then, we evaluate the metrics to calculate models for predicting change-prone Java interfaces. Our results show that the external interface cohesion metric exhibits the strongest correlation with the number of source code changes. This metric also improves the performance of prediction models to classify Java interfaces into change-prone and not change-prone.1 2.1 Interface Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 36 Software systems are continuously subjected to changes. Those changes are necessary to add new features, to adapt to a new environment, to fix bugs 1 This chapter was published in the 27th International Conference on Software Maintenance (ICSM 2011) [Romano and Pinzger, 2011a] 17 18 Chapter 2. Change-Prone Java Interfaces or to refactor the source code. However, the maintenance of software systems is also risky and costly. Several approaches have been developed to optimize the maintenance activities and reduce the costs. They range from automated reverse engineering techniques to ease program comprehension to prediction models that can help identifying the change- and defect-prone parts in the source code. Developers should focus on understanding these change- and defect prone parts in order to take appropriate counter measures to minimize the number of future changes [Girba et al., 2004]. Many of these prediction models have been developed using source code metrics, such as by Briand et al. [2002], Subramanyam and Krishnan [2003], and Menzies et al. [2007]. While those prediction models showed good performance, they work on file and class level. None of them takes the kind of class into account, whether it is a concrete class, abstract class, or interface that is change- or defect-prone. We believe that changes in interfaces can have a stronger impact than changes in concrete and abstract classes, and should therefore be treated separately. Interfaces are meant to represent contracts among modules and logic units in a software system. For this reason, they are supposed to be more stable to avoid contract violations and to reduce the effort to maintain a software system. In this chapter, we focus on Java interfaces and investigate the predictive power of various source code metrics to classify Java interfaces into changeprone and not change-prone. Concerning the source code metrics, we take into account (1) the set of metrics defined by Chidamber and Kemerer [1994]; (2) a set of metrics to measure the complexity and the usage of interfaces; and (3) two metrics to measure the external cohesion of Java interfaces. The number of fine-grained source code changes (#SCC), as introduced by Fluri et al. [2007], is used to distinguish between change-prone and not changeprone interfaces. We selected the Chidamber and Kemerer (C&K) metrics suite because it is widely used and it has been validated by several approaches, such as [Rombach, 1987], [Li and Henry, 1993], [Basili et al., 1996]. The two external cohesion metrics are Interface Usage Cohesion (IUC) and a clustering metric. These metrics are meant as heuristics to indicate violations of the Interface Segregation Principle (ISP) as described by Martin [2002]. We believe that the violation of the ISP can impact the maintenance of interfaces and the software system as a whole. The complexity and usage metrics for interfaces have been added to provide a broader set of interface metrics for our study. To investigate our claim, we perform an empirical study with the source 2.1. Interface Metrics 19 code and versioning data of ten Java open source systems, namely: eight plugin projects from the Eclipse platform, Hibernate2 and Hibernate3. In the study, we address the following two research hypotheses: • H1: IUC has a stronger correlation with the #SCC of interfaces than the C&K metrics • H2: IUC can improve the performance of prediction models to classify Java interfaces into change- and not change-prone The results show that most of the C&K perform well for predicting changeprone concrete and abstract classes but are limited in predicting change-prone Java interfaces, therefore confirming our claim that interfaces need to be treated separately. The IUC metric exhibits the strongest correlation with #SCC of Java interfaces and proves to be an adequate metric to compute prediction models for classifying Java interfaces. The remainder of this chapter is organized as follows. Section 2.1 discusses the C&K metrics and their effectiveness when used for measuring the size and complexity of interfaces. We furthermore introduce the IUC metric and several other interface complexity and usage metrics. Section 2.2 describes the approach used to measure the metrics and to mine the fine-grained source code changes from versioning repositories. The empirical study and results are presented in Section 2.3. Section 2.4 discusses the results and threats to validity. Related work is presented in Section 2.5. We draw our conclusions and outline directions for future work in Section 2.6. 2.1 Interface Metrics In this section, we present the set of source code metrics used in our empirical study. We furthermore discuss their applicability to measure the size, complexity, and cohesion of Java interfaces. We then present the IUC metric and motivate its application to predict change-prone interfaces. At the end of the section, we list additional metrics to measure the complexity and the usage of interfaces. Those metrics are meant to provide further validation of the predictive power of the IUC metric. 2.1.1 Object-Oriented Metrics & Interfaces Among the existing product metrics [Henderson-Sellers, 1996], we focus on the object-oriented metrics introduced by Chidamber and Kemerer [1994]. They have been widely used as quality indicators of object-oriented software systems. These metrics are: 20 Chapter 2. Change-Prone Java Interfaces • Coupling Between Objects (CBO) • Lack of Cohesion Of Methods (LCOM) • Number Of Children (NOC) • Depth of Inheritance Tree (DIT) • Response For Classes (RFC) • Weighted Methods per Class (WMC) We selected the C&K metrics mainly because prior work demonstrated their usefulness for building models for change prediction, e.g., [Li and Henry, 1993] [Zhou and Leung, 2007], as well as defect prediction, e.g., [Basili et al., 1996]. In the following, we briefly describe each metric and discuss its application to interfaces. Coupling Between Objects (CBO) The CBO metric represents the number of data types a class is coupled with. More specifically, it counts the unique number of reference types that occur through method calls, method parameters, return types, exceptions, and field accesses. If applied to interfaces, this metric is limited to method parameters, return types and exceptions leaving out method calls and field accesses. Lack of Cohesion Of Methods (LCOM) The LCOM metric counts the number of pairwise methods without any shared instance variable, minus the number of pairwise methods that share at least one instance variable. More precisely, the LCOM metric revised in [HendersonSellers et al., 1996] is defined as: P a 1 µ Aj −m j=1 a LCOM = 1−m where a represents the number of attributes of a class, m the number of methods, and µ(A j ) the number of methods which access each attribute A j of a class. Perfect cohesion is defined as all methods accessing all variables, in that case the value of LCOM is 0. In contrast, if all methods do not share any instance variable, the value of LCOM is 1. The LCOM metric is not applicable to interfaces since interfaces do not contain logic and consequently attribute accesses. For instance, the commercial metric tool Understand2 outputs either 0 or 1 as values for LCOM for an 2 http://www.scitools.com/ 2.1. Interface Metrics 21 interface. The value 1 denotes that the interface also contains the definition of constant attributes, otherwise the value for LCOM is 0. This limits the use of LCOM for computing prediction models. Weighted Methods per Class (WMC) WMC is the sum of the cyclomatic complexities of all methods declared by a class. Formally, the metric is defined as: W MC = n X ci i=1 where ci is the cyclomatic complexity of the ith method of a class. In case of Understand, this metric corresponds to the Number Of Methods (NOM), since the complexity of each method declared in an interface is 1. In case of the Metrics tool3 this metric is always 0 for interfaces. This limits the predictive power of this metric for predicting change-prone interfaces. Number Of Children (NOC) The NOC metric counts the number of directly derived classes of a class or interface. Even though this metric is sound for interfaces, we argue that its application for predicting change prone interfaces is limited. The main reason being that interfaces inherit only the type definition (i.e., sub-typing) while abstract classes and concrete classes also inherit the business logic. Depth of Inheritance Tree (DIT) The DIT metric denotes the length of the longest path from a sub-class to its base class in an inheritance structure. The idea behind the usage of DIT as change-proneness indicator is that classes contained in a deep inheritance structure are more likely to change (e.g., changes in a super-class cause changes in its sub-classes). Similar to NOC, we believe that this metric is more useful for abstract and concrete classes than for interfaces. Response For Classes (RFC) The RFC metric counts the number of local methods (including inherited methods) of a class. This metric remains valid for interfaces, but it is close to the WMC metric since the only added information is the count of the inherited method. In summary, while most of the C&K metrics are adequate metrics for abstract and concrete classes they are not as powerful for interfaces. Moreover, 3 http://metrics.sourceforge.net/ 22 Chapter 2. Change-Prone Java Interfaces these metrics fall short in expressing the cohesion of interfaces, therefore we introduce the two external cohesion metrics as presented in the following section. 2.1.2 External Cohesion Metrics of Interfaces Developers should not design fat interfaces that are interfaces whose clients invoke different methods. This problem has been formalized in the Interface Segregation Principle (ISP) described by Martin [2002]. The ISP principle states that fat interfaces need to be split into smaller interfaces according to the clients of an interface. Any client should only know about the set of methods provided by an interface that are used by the client. In literature the lack of conformance to the ISP principle is mainly associated to a higher risk for clients to change when an interface is changed. To the best of our knowledge there exists no empirical evidence that underlines this association. In order to measure the violation of the ISP principle, we use two cohesion metrics: the external cohesion metric for services called Service Interface Usage Cohesion (SIUC) taken from Perepletchikov et al. [Perepletchikov et al., 2007, 2010] and a clustering metric. In the following, we refer to the SIUC metric as Interface Usage Cohesion (IUC) because we apply it in the context of object-oriented systems. The metric is defined as: Pn used_methods( j,i) I U C(i) = j=1 num_methods(i) n where j denotes a client of the interface i; used_methods (j,i) is the function which computes the number of methods defined in i and used by the client j; num _methods(i) returns the total number of methods defined in i; and n denotes the number of clients of the interface i. The external cohesion defined by Perepletchikov et al., and hence the IUC metric, states that there is a strong external cohesion if every client uses all methods of an interface. We argue that interfaces with strong external cohesion (the value of IUC is close to one) are less likely to change. On the other hand, when there is a high lack of external cohesion (the value of IUC is close to zero) the interface is more likely to change due to the larger number of clients. Consider the example in Figure 2.1a that shows an interface for providing bank services. The service is used by two different clients, namely the Professional Client and the Student Client. The two clients share only one interface method, namely the method accountBalance(). Since this method is shared by two different clients, it is more likely to change to satisfy the requirements of 2.1. Interface Metrics 23 (a) Different clients share a method (b) Different clients do not share any methods Figure 2.1: An example of lack of external cohesion the different clients. The design of the BankServices interface does not conform (3/4+2/4) to ISP. The value of IUC for this interface is = 5/8. 2 Consider another example depicted in Figure 2.1b. It shows the same interface, except the shared method accountBalance() has been split into two different methods to serve the two different clients. The design of the interface still violates the ISP and changes in the clients can lead to changes in the interface. In fact, the clients depend upon methods that are not used, and the implementing classes implement methods that are not needed. The IUC 3/5+2/5 = 1/2 which denotes a lower cohesion compared to of this interface is 2 the interface in Figure 2.1a. The lower cohesion is mainly due to the higher number of methods, namely 5. Another heuristic to measure the external cohesion is the ClusterClients(i) 24 Chapter 2. Change-Prone Java Interfaces metric. This metric counts the number of clients of an interface i that do not share any method. Higher values for this metric indicate lower cohesion. For the interface in Figure 2.1a the value of ClusterClients is 0 and for the interface in Figure 2.1b the value is 2. We use this metric to investigate whether the contribution of the shared methods, as computed by the IUC metric, is relevant to predict change-prone interfaces. 2.1.3 Complexity and Usage Metrics for Interfaces In addition to the object-oriented metrics we validate the IUC metric against several other metrics defined to measure the complexity and usage of an interface. The complexity metrics are: • NOM(i): counts the number of methods declared in the interface i; • Arguments(i): counts the total number of arguments of the declared methods in the interface i; • APP(i): measures the mean size of method declarations of an interface i and is equal to Arguments(i) divided by NOM(i), as defined by Boxall and Araban [2004]; The usage metrics are: • Clients(i): counts the number of distinct classes that invoke the interface i; • Invocations(i): counts the number of static invocations of the methods declared in the interface i; • Implementing_Classes(i): counts the number of direct classes that implement the interface i. 2.2 The Approach In this section, we illustrate the approach used to extract the fine-grained source code changes, to measure the metrics and to perform the experiments aimed at addressing our research hypotheses. Figure 2.2 shows an overview of our approach that consists of three stages: (A) in the first stage we checkout the source code of the projects from their versioning repositories and we measure the source code metrics; (B) we then compute the number of SCC from the versioning data for each class and interface; (C) finally we use the metrics 2.2. The Approach 25 and the number of SCC to perform our experiments with the PASWStatistics4 and RapidMiner5 toolkits. 2.2.1 Source Code Metrics Computation The first step of the process consists of checking out the source code of each project from the versioning repositories. The source code of each project then is parsed with the Evolizer Famix Importer, belonging to the Evolizer6 tool set. The parser extracts a FAMIX model that represents the source code entities and their relationships [Tichelaar et al., 2000]. Figure 2.3 shows the core of the FAMIX meta model. The model represents inheritance relationships among classes, the methods belonging to a class, the attribute accessed by a method and the invocations among methods. For more details we refer the reader to [Tichelaar et al., 2000]. After obtaining the FAMIX model, the next step consists of measuring the source code metrics of classes and interfaces. We use the Understand tool to measure the C&K metrics. We decided to use the Understand tool because in our view it provides the most precise measurement of these metrics for interfaces. We use the FAMIX model to measure the external cohesion, complexity and usage metrics of interfaces. For example, to measure the Invocations(i) metric we count the number of invocation objects in the FAMIX model that point to a method of the interface i. 2.2.2 SCC Extraction The first step of the SCC extraction stage consists of retrieving the versioning data from the repositories (e.g., CVS, SVN, or GIT) for which we use the Evolizer Version Control Connector [Gall et al., 2009]. The versioning repositories provide log entries that contain information about revisions of files that belong to the system under analysis. For each log entry, it extracts the revision number, the revision timestamp, the name of the developer who checked-in the revision, the commit message, the total number of lines modified (LM), and the source code. In the second step, we use ChangeDistiller [Gall et al., 2009] to extract the fine-grained source code changes (SCC) from the various source code revisions of each file. ChangeDistiller implements a tree differencing algorithm, that compares the Abstract Syntax Trees (ASTs) between all direct subsequent revisions of a file. Each change represents a tree edit operation that is required 4 http://www.spss.com/software/statistics/ http://rapid-i.com/content/view/181/196/ 6 http://www.evolizer.org/ 5 26 Chapter 2. Change-Prone Java Interfaces to transform one version of the AST into the other. In this way we can track fine-grained source changes down to the statement level. Based on this information we count the number of fine-grained source code changes (#SCC) for each class and interface over the selected observation period. 2.2.3 Correlation and Prediction Analysis We use the collection of metric values and #SCC of each class and interface as input to our experiments. First, we use the PASWStatistics tool to perform a correlation analysis between the source code metrics and the #SCC. Then, we use the RapidMiner tool to analyze the predictive power of the source code metrics to discriminate between change- and not change-prone interfaces. We perform a series of classification experiments with different machine learning algorithms, namely: Support Vector Machine, Naive Bayes Network and Neural Nets. The next section details the empirical study. 2.3 Empirical Study The goal of this empirical study is to evaluate the possibility of using the IUC metric for predicting the change-prone interfaces and to highlight the limited predictive power of the C&K metrics. The perspective is that of a researcher, interested in investigating whether the traditional object-oriented metrics are useful to predict change-prone interfaces. The results of our study are also interesting for quality engineers who want to monitor the quality of their software systems, using an external cohesion metric for interfaces. The context of this study consists of ten open-source systems, widely used in both, the academic and industrial community. These systems are eight plugins from the Eclipse7 platform and the Hibernate2 and Hibernate3 systems.8 Eclipse is a popular open source system that has been studied extensively by the research community (e.g., [Businge et al., 2010], [Businge et al., 2013], [Businge, 2013] [Bernstein et al., 2007], [Nagappan et al., 2010], [Zimmermann et al., 2007], and [Zimmermann et al., 2009]). Hibernate is an objectrelational mapping (ORM) library for the Java language. Table 2.1 shows an overview of the dataset used in our empirical study. The #Files is the number of unique Java files, #Interfaces is the number of unique Java interfaces, #Rev is the total number of Java file revisions, #SCC is the number of fine-grained source code changes performed within the given time period (Time). 7 8 http://www.eclipse.org/ http://www.hibernate.org/ 2.3. Empirical Study 27 Table 2.1: Dataset used in the empirical study Project Hibernate3 Hibernate2 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui eclipse.update.core #Files #Interfaces #Rev #SCC Time[M,Y] 970 494 188 793 381 469 172 189 293 274 165(17%) 69(14%) 97(52%) 129(16%) 105(28%) 140(30%) 44(26%) 25(13%) 45(15%) 71(26%) 30774 13584 8295 41860 22136 11711 3726 12343 20183 7425 34960 22960 11670 55259 27041 33895 4551 23311 32267 25617 Jun04-Mar11 Jan03-Mar11 May01-Mar11 May01-Mar11 Sep02-Mar11 Jun01-Mar11 Nov01-Mar11 Nov01-Mar11 Nov01- Mar11 Oct01-Mar11 In this study, we address the following two research hypotheses: • H1: IUC has a stronger correlation with the #SCC of interfaces than the C&K metrics • H2: IUC can improve the performance of prediction models to classify Java interfaces into change- and not change-prone We first perform an initial analysis of the extracted information, in terms of number of changes and in terms of metric values. Figure 2.4 shows the box plots of the #SCC of Java classes and interfaces mined from the versioning repositories of each project. The results show that on average the number of changes involving Java classes are at least one order of magnitude higher than the ones involving Java interfaces. This result is not surprising since interfaces can be considered contracts among modules, and in general among logic units of a system. Figure 2.5 shows the values of the C&K metrics for classes and interfaces over all ten projects. The values of the CBO metric are in general lower for interfaces, since it counts only the number of reference types in the parameters, return types and thrown exceptions in the method signatures. The values of the RFC metric are higher for classes than for interfaces. Also the values of the DIT metric are in general higher for classes than for interfaces. Analyzing the LCOM we can notice that Java classes have a low median LCOM and hence a high cohesion. On the other hand, interpreting the LCOM of interfaces we can state that most of them do not expose any attributes in their body. In fact, the Understand tool registers a 0 LCOM when there are no attribute declarations, and 1 if there are some. The values of WMC confirm 28 Chapter 2. Change-Prone Java Interfaces Table 2.2: Spearman rank correlation between the C&K metrics and the #SCC computed for Java classes and Java interfaces (** marks significant correlations at α= 0.01, * marks significant correlations at α= 0.05, values in bold mark a significant correlation) Project CBOc CBOi NOCc NOCi RFCc RFCi Hibernate3 Hibernate2 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui eclipse.update.core 0.590** 0.352** 0.560** 0.566** 0.570** 0.502** 0.453** 0.655** 0.532** 0.649** 0.535** 0.373** 0.484** 0.216* 0.239* 0.512** 0.367* 0.688** 0.301* 0.499** 0.109** 0.134** -0.025 0.087* 0.257** 0.154** 0.180* 0.347** 0.152** 0.026 0.029 0.065 0.105 0.033 0.012 0.256** 0.102 -0.013 -0.003 -0.007 0.338** 0.273** 0.431** 0.291** 0.516** 0.132 0.435** 0.407** 0.382** 0.364** 0.592** 0.325** 0.486** 0.152 0.174** 0.349** 0.497** 0.738** 0.299* 0.381** Median 0.563 0.428 0.143 0.031 0.373 0.365 Project DITc DITi LCOMc LCOMi WMCc WMCi -0.098** 0.156** 0.065 0.473** 0.173** 0.089 0.060 0.145 0.039 0.007 0.058 -0.010 0.232* 0.324** 0.103 -0.049 0.243 0.618** -0.103* 0.146 0.367** 0.269** 0.564 0.626** 0.563** 0.237** 0.335** 0.477** 0.493** 0.326** 0.103 0.006 0.337 0.214* 0.320** 0.238** 0.400 0.610** 0.395** 0.482** 0.617** 0.455** 0.600** -0.048 0.754** 0.668** 0.561** 0.753** 0.595** 0.735** 0.657** 0.522** 0.597** 0.131 0.137 0.489** 0.451** 0.744** 0.299* 0.729** 0.422 0.328 0.608 0.505 Hibernate3 Hibernate2 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui eclipse.update.core Median 0.023 0.124 the assumptions made in Section III about the loss of meaning of this metric when applied to interfaces. In fact, the values of WMC correspond exactly to the value of the NOM (Number of Methods). As expected, we registered higher values of the NOC for interfaces than for classes. This is due to the number of implementing classes that are counted as children by Understand. 2.3.1 Correlation between metrics and #SCC The next step in our study aims at investigating the correlation between the metrics and the #SCC mined from the versioning repositories. We used the Spearman rank correlation analysis to identify highly-correlated metrics. Spearman compares the ordered ranks of the variables to measure a monotonic relationship. Differently to the Pearson correlation, the Spearman correlation does not make assumptions about the distribution, variances and the type of the relationship [S.Weardon and Chilko, 2004]. A Spearman value of +1 and 2.3. Empirical Study 29 -1 indicates high positive or high negative correlation, whereas 0 indicates that the variables under analysis do not correlate at all. Values greater than +0.5 and lower than -0.5 are considered to be substantial; values greater than +0.7 and lower than -0.7 are considered to be strong correlations. To test the hypothesis H1, we performed two correlation analyses: (1) we analyze the correlation among the C&K metrics and the #SCC in Java classes and Java interfaces. An insignificant correlation of the C&K metrics for interfaces is a precondition for any further analysis of the interface complexity and usage metrics. (2) We explore the extent to which the interface cohesion, complexity and usage metrics correlate with #SCC. Table 2.2 lists the results of the correlation analysis between the C&K metrics and #SCC for classes and interfaces in each project. The heading Xc indicates the correlation of the metric X with the #SCC of classes, and Xi the correlation with the #SCC of interfaces. The first important result is that only the metrics CBOc and WMCc have a substantial correlation with the #SCC of Java classes, since their median correlation is greater than 0.5. In five projects out of ten WMCc exhibits a substantial correlation and in three cases the correlation is strong. Similarly, the CBOc metric shows a substantial correlation in eight cases but no strong correlations. The other metrics do not show a significant correlation with the #SCC. The median correlation values of the C&K metrics applied to interfaces are significantly lower. Among the six metrics WMCi exhibits the strongest correlation with #SCC. It shows three substantial and two strong correlations. CBOi shows a substantial correlation for three projects. We applied the same correlation analysis to the interface complexity and usage metrics defined in Section III. We report the result in Table 2.3. IUCi is the only metric that exposes a substantial correlation with the #SCC of interfaces. This metric shows a median correlation value of -0.605, having a substantial correlation in six projects and a strong correlation in one project. The negative correlation is due to the nature of the metric and it means that the IUCi value is inversely proportional to the #SCC. More precisely, the stronger the external cohesion is (values of IUCi close to one) the less frequently an interface changes. Concerning the other metrics, the NOMi shows the strongest correlation with the #SCC. This result is not surprising since the more methods are declared in the interface the more likely the interface changes. Surprisingly, neither the number of clients nor the number of invocations result in a sub- 30 Chapter 2. Change-Prone Java Interfaces Table 2.3: Spearman rank correlation between the interface complexity and usage metrics and #SCC (** marks significant correlations at α= 0.01, * marks significant correlations at α= 0.05, values in bold mark a significant correlation) Project Hibernate3 Hibernate2 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui eclipse.update.core Median Project Hibernate3 Hibernate2 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui eclipse.update.core Median IUCi Clientsi Invocationsi ClustersClientsi -0.601** -0.373** -0.682** -0.508** -0.363** -0.605** -0.475** -0.819** -0.618** -0.656** 0.433** 0.104 0.327** 0.498** 0.099 0.471 0.278 0.608** 0.270 0.656** 0.544** 0.165 0.317** 0.497** 0.205* 0.495** 0.261 0.557** 0.290 0.677** 0.302** 0.016 0.273** 0.418** 0.106** 0.474** 0.328* 0.369 0.056 0.606** 0.327 0.317 0.328 -0.605 ImplementingClassesi 0.021 0.054 0.070 0.139 0.063 0.223 0.102 -0.037 -0.003 -0.095 0.063 Argumentsi APPi NOMi 0.668** 0.531** 0.298** 0.128 0.207* 0.474** 0.241 0.614** 0.144 0.433** 0.450** 0.288** 0.125 -0.022 0.110 0.361** 0.138 0.383 -0.107* 0.278 0.657** 0.522** 0.597** 0.131 0.137 0.489** 0.451** 0.744** 0.299* 0.729** 0.365 0.208 0.505 stantial correlation with the #SCC. The Argumentsi metric correlates only in three projects out of ten, while the APPi shows a correlation only for one project. The ClustersClientsi metric shows a substantial correlation only in one project. Therefore we conclude that the contribution of the number of methods shared among different clients is relevant for the correlation analysis. The weakest correlation is by the ImpementingClassesi metric. Based on this result we can accept H1. Among the selected metrics, the IUCi metric exhibits the strongest correlation with #SCC of interfaces. This result confirms our belief that the violation of the Interface Segregation Principle can impact the robustness of interfaces. 2.3. Empirical Study 2.3.2 31 Prediction analysis To test the research hypothesis H2, we analyzed whether the IUC metric can improve prediction models to classify interfaces into change-prone and not change-prone. We performed a series of classification experiments with three different machine learning algorithms. Prior work [Lessmann et al., 2008] showed that some machine learning techniques perform better than others, even though they state that performance differences among classifiers are marginal and not necessarily significant. For that reason we used the following classifiers: Support Vector Machine (LibSVM), Naive Bayes Network (NBayes) and Neural Nets (NN) provided by the RapidMiner toolkit. For each project, we binned the interfaces into change-prone and not changeprone using the median of the #SCC per project: ¨ change-prone if # SCC > median interface = not change-prone otherwise First, we trained the machine learning algorithms using the following object oriented metrics: CBO, RFC, LCOM, WMC. We selected these metrics because they showed the strongest correlation with the #SCC. We refer to this set of metrics as OO. Next, the training is performed using the OO metrics plus the IUC metric. We refer to this set of metrics as IUC. In order to evaluate the classifications models, we use the area under the curve statistic (AUC). In addition we report the precision (P) and recall (R) of each model. AUC represents the probability, that, when choosing randomly a change-prone and a not change-prone interface, the trained model assigns a higher score to the change-prone interface [Green and Swets, 1966]. We trained the models using 10 fold cross-validation and we considered models with an AUC value greater than 0.7 to have adequate classification performance [Lessmann et al., 2008]. Table 2.4 reports the results obtained with the NBayes learner. The results show that the median AUC is higher when we include the IUC metric. Moreover, for each project we obtained an adequate performance (AUC>0.7) with the IUC. Only for two projects (JDT Debug and Team UI) out of ten we registered a better performance for the OO metrics. Using the LibSVN (see Table 2.5) and the NN (see Table 2.6) classifiers we obtained similar results. With LibSVN, in eight projects the IUC metrics outperformed the OO metrics. Using NN, in seven projects out of ten the IUC metrics outperformed the OO metrics. The median values of the Precision and Recall show similar results for most of the projects. In several projects, however, the Precision and Recall is affected by the lack of information about interfaces (i.e., a high percentage of 32 Chapter 2. Change-Prone Java Interfaces interfaces did not change during the observed time period). For instance, in the eclipse.jface project the number of interfaces that did not change is 81% (85 out of 105). The result is that the prediction model computed with the NN learner showed a Precision and Recall of 0. Table 2.4: AUC, Precision and Recall using Naive Bayes Network (NBayes) with OO and IUC to classify interfaces into change-prone and not change-prone. Bold values highlights the best AUC value per project. Project AUCOO POO ROO AUCIUC PIUC RIUC eclipse.team.cvs.core eclipse.debug.core eclipse.debug.ui hibernate2 hibernate3 eclipse.jdt.debug eclipse.jface eclipse.team.core eclipse.team.ui eclipse.update.core 0.55 0.75 0.66 0.745 0.835 0.79 0.639 0.708 0.88 0.782 90 93 63.81 78.62 88.61 69.67 50 68.75 85 67.49 75 38 40.33 32.02 57.92 47.67 28.33 48.13 70 46.5 0.75 0.79 0.72 0.807 0.862 0.738 0.734 0.792 0.8 0.811 92.6 94.1 69 84.22 82.8 77.71 53.85 58.33 78.95 81.19 83.33 55.23 41 85.33 56.31 45.38 48.33 43.33 75 61.67 Median 0.747 74.14 47.08 0.791 80.07 55.77 To investigate whether the difference between the AUC values of OO and IUC metrics are significant we performed the Related Samples Wilcoxon SignedRanks Test. The results of the test show a significant difference at α= 0.05 for the median AUC obtained with Support Vector Machine (LibSVN). The difference between the medians obtained with NBayes and NN was not significant. Based on these results we can partially accept the hypothesis H2. The additional information provided by the IUC metric can improve the median performance of the prediction models by up to 9.2%. The Wilcoxon test confirmed this improvement for the LibSVM learner, however not for NBayes and NN learners. This result highlights the need to analyze a wider dataset in order to provide a more precise validation. 2.3.3 Summary of Results The results of our empirical study can be summarized as follows: The IUC metric shows a stronger correlation with the #SCC of interfaces than the C&K metrics. With a median Spearman rank correlation of -0.605, the IUC shows a stronger correlation with the #SCC on Java interfaces than the C&K metrics. Only the WMC metric shows a substantial correlation in five projects out of ten, with a median value of 0.505, hence we accepted H1. 2.3. Empirical Study 33 Table 2.5: AUC, Precision and Recall using Support Vector Machine (LibSVN) with OO and IUC to classify interfaces into change-prone and not change-prone. Bold values highlights the best AUC value per project. Project AUCOO POO ROO AUCIUC PIUC RIUC eclipse.team.cvs.core eclipse.debug.core eclipse.debug.ui hibernate2 hibernate3 eclipse.jdt.debug eclipse.jface eclipse.team.core eclipse.team.ui eclipse.update.core 0.692 0.806 0.71 0.735 0.64 0.741 0.607 0.617 0.74 0.794 55.61 82.61 75 70 52 67.17 66.67 66.67 73.33 86.67 54.2 46 21.33 40 33.45 56.24 45 45 70 56.83 0.811 0.828 0.742 0.708 0.856 0.82 0.778 0.608 0.883 0.817 90.91 89.47 80.83 66.76 82.4 68.56 72 58.33 83.33 81 83.33 52.5 26.8 45 73.36 58.33 62 45 75 64.17 Median 0.722 68.58 45.5 0.814 80.91 60.16 Table 2.6: AUC, Precision and Recall using Neural Nets (NN) with OO and IUC to classify interfaces into change-prone and not change-prone. Bold values highlights the best AUC value per project. Project AUCOO POO ROO AUCIUC PIUC RIUC eclipse.team.cvs.core eclipse.debug.core eclipse.debug.ui hibernate2 hibernate3 eclipse.jdt.debug eclipse.jface eclipse.team.core eclipse.team.ui eclipse.update.core 0.8 0.85 0.748 0.702 0.874 0.77 0.553 0.725 0.65 0.675 71.43 80 79.33 53.85 83.17 73.39 0 53.33 83.33 70 71.43 80 44.67 50 69.52 63.24 0 50 75 58.33 0.8 0.875 0.766 0.747 0.843 0.762 0.542 0.85 0.75 0.744 87.5 91.67 78.05 50 78.49 80.5 0 61.11 78.95 78.33 100 70 58.5 45 69.05 58.05 0 63.33 75 56.67 Median 0.736 72.41 60.78 0.764 78.41 60.69 The IUC metric improves the performance of prediction models to classify change- and not change-prone interfaces. The models trained with the Support Vector Machine (LibSVN) and NBayes using the IUC metric set outperformed the models computed with the OO metric set in eight out of ten projects. Using the NN learner, the models of seven projects showed better performance with the IUC metric set. This improvement in performance is significant for the models trained with the Support Vector Machine (LibSVN), however not for the other two learners. Therefore, we partially accepted H2. 34 2.4 Chapter 2. Change-Prone Java Interfaces Discussion This section discusses the implications of our results and the threats to validity. 2.4.1 Implications of Results The implications of the results of our study are interesting for researchers, quality engineers and, in general, for developers and software architects. The results of our study can be used by researchers interested in investigating software systems through the analysis of source code metrics. Studies based on source code metrics should take into account the nature of the entities that are measured. This can help to obtain more accurate results. Quality engineers should consider the possibility to enlarge their metric suite. In particular, the set of metrics should include specific metrics for measuring the cohesion of interfaces, such as the IUC metric. The C&K metrics are limited in measuring this cohesion of interfaces. Finally, developers and software architects should use the IUC metric to measure the conformance to the ISP. Our results showed that low IUC values, indicating a violation of the ISP, can increase the effort needed to maintain software systems. 2.4.2 Threats to Validity We consider the following threats to validity: construct, internal, conclusion, external and reliability validity. Threats to construct validity concern the relationship between theory and observation. In our study, this threat can be due to the fact that we measured the metrics on the last version of the source code. Previous studies in literature also used metrics collected from a single release (e.g., [Mauczka A., 2009] [Alshayeb and Li, 2003]). We mitigated this threat by collecting the metrics from the last release, since this release reflects the history of a system. Nevertheless, we believe that further validation with metrics measured over time (i.e., from different releases) is desirable. Threats to internal validity concern factors that may affect an independent variable. In our study, the independent variables (values of the metrics and #SCC) are computed using deterministic algorithms (provided by the Understand and Evolizer tools) that always deliver the same results. Threats to conclusion validity concern the relationship between the treatment and the outcome. Wherever possible, we used proper statistical tests to support our conclusions for the two research questions. We used the Spearman correlation, which does not make any assumption on the underlying data distribution to test H1. To address H2 we selected a set of three machine learn- 2.5. Related Work 35 ing techniques. Further techniques can be applied to build predictive models, even though previous work [Lessmann et al., 2008] states that performance differences among classifiers are not significant. Threats to external validity concern the generalization of our findings. In our study, this threat can be due to the fact that eight out of ten projects stem from the Eclipse platform. Therefore, the generalizability of our findings and conclusions should be verified for other projects. Nevertheless, we considered systems of different size and different roles in the Eclipse platform. Eclipse has been widely used by the scientific community and we can compare our findings with previous work. Moreover, we added two projects from Hibernate. As a matter of fact, any result from empirical work is in general threatened by the bias of their datasets [Menzies et al., 2007]. Threats to reliability validity concern the possibility of replicating our study and obtaining consistent results. The analyzed systems are open source systems and hence publicly available; the tools used to implement our approach (Evolizer and ChangeDistiller) are available from the reported web sites. 2.5 Related Work In this section, we discuss previous work related to the usage of change prediction models to guide and understand maintenance of software systems. Rombach was among the first researchers to investigate the impact of software structure on maintainability aspects [Rombach, 1987], [Rombach, 1990]. He focused on comprehensibility, locality, modifiability, and reusability in a distributed system environment, highligthing the impact of the interconnectivity between components. In literature several approaches used source code metrics to predict the change-prone classes. Khoshgoftaar and Szabo [1994] presented an approach to predict maintenance measured as lines changed. They trained a regression model and a neural network using size and complexity metrics. Li and Henry used the C&K metrics to predict maintenance in terms of lines changed [Li and Henry, 1993]. The results show that these metrics can significantly improve a prediction model compared to traditional metrics. In 2009, Mauczka et al. measured the relationship of code changes with source-level software metrics [Mauczka A., 2009]. This work focuses on evaluating the C&K metrics suite against failure data. Zhou et al. [2009] used three size metrics to examine the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. A further validation of the object-oriented metrics was provided by Alshayeb and Li [2003]. This 36 Chapter 2. Change-Prone Java Interfaces work highlights the capability of those metrics in two different iterative processes. The results show that the object-oriented metrics are effective in predicting design efforts and source lines modified in the short-cycled agile process. On the other hand they are ineffective in predicting the same aspects in the long-cycled framework process. Object-oriented metrics were not only successfully applied for maintenance but als for defect prediction. Basili et al. [1996] empirically investigated the suite of object-oriented design metrics as predictors of fault-prone classes. Subramanyam and Krishnan [2003] validated the C&K metrics suite in determining software defects. Their findings show that the effects of those metrics on defects vary across the data set from two different programming languages, C++ and Java. Besides the correlation between metrics and change proneness, other design practices have been investigated in correlation with the number of changes. Khomh et al. [2009] investigated the impact of classes with code smells on change-proneness. They showed that classes with code smells are more changeprone than classes without and that specific smells are more correlated than others. Penta et al. [2008] developed an exploratory study to analyze the change-proneness of design patterns and the kinds of changes occurring to classes involved in design patterns. A complementary branch of change prediction is the detection of change couplings. Shirabad et al. [2003] used a decision tree to identify files that are change coupled. Zimmermann et al. [2004] developed the ROSE tool that suggests change coupled source code entities to developers. They are able to detect coupled entities on a fine-grained level. Robbes et al. [2008] used fine-grained source changes to detect several kinds of distinct logical couplings between files. Canfora et al. [2010] use the multivariate time series analysis and forecasting to determine whether a change occurred on a software artifact was consequentially related to changes on other artifacts. Our work is complementary to the existing work since (1) we explore limitations of the C&K metrics in predicting the change-prone Java interfaces; (2) we investigate the impact of the ISP violation as measured by the IUC metric on the change-proneness of interfaces. 2.6 Conclusions and Future Work Interfaces declare contracts that are meant to remain stable during the evolution of a software system while the implementation in concrete classes is more likely to change. This leads to a different evolutionary behavior of interfaces 2.6. Conclusions and Future Work 37 compared to concrete classes. In this chapter, we empirically investigated this behavior with the C&K metrics that are widely used to evaluate the quality of the implementation of classes and interfaces. The results of our study with eight Eclipse plug-in and two Hibernate projects showed that: • The IUC metric shows a stronger correlation with #SCC than the C&K metrics when applied to interfaces (we accepted H1) • The IUC metric can improve the performance of prediction models in classifying Java interfaces into change-prone and not change-prone (we partially accepted H2) Our findings provide a starting point for studying the quality of interfaces and the impact of design violations, such as the ISP, on the maintenance of software systems. In particular, the acceptance of the hypothesis H1 implicates engineers should measure the quality of interfaces with specific interface cohesion metrics. Software designers and architects should follow the interface design principles, in particular the ISP. Furthermore, researchers should consider distinguishing between classes and interfaces when investigating models to estimate and predict the change-prone interfaces. In future work, we plan to evaluate the IUC metric with more open source and also commercial software systems. Furthermore, we plan to analyze the performance of our models taking into account releases (train the model with a previous release to predict the change-prone interfaces of the next release). Another direction of future work is to apply our models to other types of systems, such as Component Based Systems (CBS) and Service Oriented Systems (SOS), in which interfaces play a fundamental role. 38 Chapter 2. Change-Prone Java Interfaces A. Source Code Metrics Computation B. SCC Extraction C. Correlation and Prediction Analysis Figure 2.2: Overview of the data extraction and measurement process 2.6. Conclusions and Future Work 39 Superclass Inheritance BelongsToClass Class Subclass BelongsToClass Attribute InvokedBy Invocation Method Accesses Access AccessedIn Invokes Figure 2.3: Core of the FAMIX meta model [Tichelaar et al., 2000] Type class interface 160 150 140 130 120 110 #SCC 100 90 80 70 60 50 40 30 20 10 0 . e se or lip e.c ec dat i up .u am .te se am te re re o s.c .cv .co am ce re ug i .u eb t.d te . se lip lip ec ec . se jd jfa . se lip lip ec ec . se g bu co . ug eb e .d .d se lip lip ec ec se e3 at 2 te na rn be r be lip ec hi Hi Project Figure 2.4: Box plots of the #SCC of interfaces and classes per project 40 Chapter 2. Change-Prone Java Interfaces 100 Metric CBO DIT LCOM NOC RFC WMC 80 Value 60 40 20 0 class interface Type Figure 2.5: Box plots of the C&K metric values for classes and interfaces measured over all selected projects 3. Change-Prone Java APIs Antipatterns are poor solutions to design and implementation problems which are claimed to make object oriented systems hard to maintain. Recent studies showed that classes with antipatterns change more frequently than classes without antipatterns. In this chapter, we detail these analyses by taking into account fine-grained source code changes (SCC) extracted from 16 Java open source systems. In particular we investigate: whether classes with antipatterns are more change-prone (in terms of SCC) than classes without; (2) whether the type of antipattern impacts the change-proneness of Java classes; and (3) whether certain types of changes are performed more frequently in classes affected by a certain antipattern. Our results show that: 1) the number of SCC performed in classes affected by antipatterns is statistically greater than the number of SCC performed in classes with no antipattern; 2) classes participating in the three antipatterns ComplexClass, SpaghettiCode, and SwissArmyKnife are more change-prone than classes affected by other antipatterns; and 3) certain types of changes are more likely to be performed in classes affected by certain antipatterns, such as API changes are likely to be performed in classes affected by the ComplexClass, SpaghettiCode, and SwissArmyKnife antipatterns.1 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 60 Over the past two decades, maintenance costs have grown to more than 50% and up to 90% of the overall costs of software systems [Erlikh, 2000]. To help reduce the cost of maintenance, researchers have proposed several 1 This chapter was published in the 19th Working Conference on Reverse Engineering (WCRE 2012) [Romano et al., 2012]. 41 42 Chapter 3. Change-Prone Java APIs approaches to ease program comprehension, and identify change- and bugprone parts of the source code of software systems. These approaches include source code metrics (e.g., [Mauczka A., 2009]) and heuristics to assess the design of a software system (e.g., [Posnett et al., 2011; Khomh et al., 2012; Thummalapenta et al., 2010]). Recently, Khomh et al. analyzing the impact of antipatterns on the changeproneness of software units [Khomh et al., 2012]. Antipatterns [Brown et al., 1998] are “poor” solutions to design and implementation problems. In contrast to design patterns [Gamma et al., 1995] which are “good" solutions to recurring design problems. Antipatterns are typically introduced in software systems by developers lacking the adequate knowledge or experience in solving a particular problem or having misapplied some design patterns. Coplien and Harrison [2005] described an antipattern as “something that looks like a good idea, but which back-fires badly when applied”. Previous studies, such as ours [Khomh et al., 2012], support this description by showing that software units, i.e., classes, affected by antipatterns are more likely to undergo changes than other units. Existing literature proposes many different antipatterns, such as the 40 antipatterns described by Brown et al. [1998]. Furthermore, antipatterns occur in large numbers and affect large portions of some software systems. For instance, we found that more than 45% of the classes in the systems studied in [Khomh et al., 2012] contained at least one antipattern. Because of the diversity and the large number of antipatterns, support is needed, for instance by software engineers, to identify the risky classes affected by antipatterns that lead to errors and increase development and maintenance costs. For this, we need to obtain a deeper understanding of the change-proneness of different antipatterns and the types of changes occurring in classes affected by them. Providing this deeper understanding is the main objective of this chapter. In this chapter we investigate the extent to which antipatterns can be used as indicators of changes in Java classes. The goal of this study is to investigate which antipattern is more likely to lead to changes and which types of changes are likely to appear in classes affected by certain antipatterns. Differently to existing studies (i.e., [Khomh et al., 2009, 2012]), the approach of our study is based on the analysis of fine-grained source code changes (SCC) mined from version control systems [Fluri et al., 2007; Gall et al., 2009]. This approach allows us to analyze the types of changes performed in classes affected by a particular antipattern which was not possible with previous approaches. Moreover, we take into account the significance of the change types [Fluri and Gall, 2006] and we filter out irrelevant change types (e.g., changes to com- 43 ments and copyrights), that account for more than 10% of all changes in our dataset. Using the data of fine-grained source code changes and antipatterns, we aim at providing answers to the following three research questions: • RQ1: Are Java classes affected by antipatterns more change-prone than Java classes not affected by any antipattern? This research question is aimed at replicating the previous study [Khomh et al., 2012] with fine-grained source code changes (SCC). • RQ2: Are Java classes affected by certain types of antipatterns more change-prone than Java classes affected by other antipatterns – i.e., does the type of antipattern impact change-proneness? The results from this research question can assist software engineers in identifying the risky classes affected by antipatterns. • RQ3: Are particular types of changes more likely to be performed in Java classes affected by certain types of antipatterns? The results of this question will assist software engineers in prioritizing antipatterns that need to be resolved to prevent certain types of changes in a system. For example changes in the method declarations of a class exposing a public API. To answer our research questions, we perform an empirical study with data extracted from 16 Java open-source software systems. Our main outcomes are: • The number of SCC performed in classes affected by antipatterns is statistically greater than the number of SCC performed in other classes. • Classes affected by ComplexClass, SpaghettiCode, and SwissArmyKnife are more change-prone than classes affected by other antipatterns. • Changes in APIs are more likely to appear in classes affected by the ComplexClass, SpaghettiCode, and SwissArmyKnife; methods are more likely to be added/deleted in classes affected by ComplexClass and SpaghettiCode; changes in executable statements are likely in AntiSingleton, ComplexClass, SpaghettiCode, and SwissArmyKnife; changes in conditional statements and else-parts are more likely in classes affected by SpaghettiCode. 44 Chapter 3. Change-Prone Java APIs These findings suggest that software engineers should consider detecting and resolving instances of certain antipatterns to prevent certain types of changes. For instance, they should resolve instances of the ComplexClass, SpaghettiCode, and SwissArmyKnife to prevent frequent changes in the APIs. The remainder of this chapter is organized as follows. Section 3.1 describes the approach used to mine fine-grained source code changes and to detect Java classes participating in antipatterns. The study design and our findings are presented in Section 3.2. Section 3.3 discusses threats to the validity of the results of our study. Section 3.4 presents related work. We draw our conclusions and outline directions for future work in Section 3.5. 3.1 Data Collection In this section, we describe the approach used to gather the data needed to perform our study. The data consist of the fine-grained source code changes (SCC), performed in each Java class along the history of the systems under analysis, and the type and number of antipatterns in which a class participates during its evolution. Figure 3.1 shows an overview of our approach consisting of 4 steps. In the following we describe each step in details. 3.1.1 Importing Versioning Data The first step concerns retrieving the versioning data for the Java classes from the version control systems (e.g., CVS, SVN or GIT). To perform this step we use the Evolizer Version Control Connector (EVCC) [Gall et al., 2009], belonging to the Evolizer2 tool set. For each class EVCC fetches and parses the log entries from the versioning repository. Per log entry, EVCC extracts the revision numbers, the revision timestamps, the name of the developers who checked-in the revision, the commit messages, the total number of lines modified, and the source code. This information plus the source code of each revision of Java class is stored into the Evolizer repository. 3.1.2 Fine-Grained Source Code Changes Extraction In the second step, ChangeDistiller is used [Fluri et al., 2007] to extract the fine-grained source code changes (SCC) between the subsequent versions of a Java class. ChangeDistiller first parses the source code from the two subsequent versions of a Java class and creates the corresponding Abstract Syntax Trees (ASTs). Second, the two ASTs are compared using a tree differencing algorithm that outputs the differences in form of the tree-edit operations add, 2 http://www.evolizer.org/ 3.1. Data Collection 45 Source Code Repository 1. Versioning Data Importer 3. Antipatterns Detector Revision Info Subsequent Versions Classes affected by antipatterns 2. Fine-Grained Source Code Changes Extractor Fine-grained source code changes (SCCs) 4. Data Preparation Figure 3.1: Overview of the approach to extract fine-grained source code changes and antipatterns for Java classes. delete, update, and move. Next, each edit operation for a given node in the AST is annotated with the semantic information of the source code entity it represents and is classified as a specific change type based on a taxonomy of code changes [Fluri and Gall, 2006]. For instance, the insertion of a node representing an else-part in the AST is classified as else-part insert change type. The result is a list of change types between two subsequent versions of each Java class which is stored into the Evolizer repository. 3.1.3 Antipatterns Detection The third step of our approach is detecting the antipatterns that occur in Java classes. This is achieved by DECOR (Defect dEtection for CORrection) [Moha et al., 2008a,b, 2010]. DECOR provides a domain-specific language to describe antipatterns through a set of rules (e.g., lexical, structural, internal, etc.) and an algorithm to detect antipatterns’ in Java classes. We use the predefined specifications of antipatterns and run DECOR on 46 Chapter 3. Change-Prone Java APIs the different source code releases of our systems under analysis. Among the antipatterns detectable with DECOR we select the following twelve antipatterns: • AntiSingleton: A class that provides mutable class variables, which consequently could be used as global variables. • Blob: A class that is too large and not cohesive enough, that monopolises most of the processing, takes most of the decisions, and is associated to data classes. • ClassDataShouldBePrivate (CDSBP): A class that exposes its fields, thus violating the principle of encapsulation. • ComplexClass (ComplexC): A class that has (at least) one large and complex method, in terms of cyclomatic complexity and LOCs. • LazyClass (LazyC): A class that has few fields and methods (with little complexity). • LongMethod (LongM): A class that has a method that is overly long, in term of LOCs. • LongParameterList (LPL): A class that has (at least) one method with a too long list of parameters with respect to the average number of parameters per methods in the system. • MessageChain (MsgC): A class that uses a long chain of method invocations to realise (at least) one of its functionality. • RefusedParentBequest (RPB): A class that redefines inherited method using empty bodies, thus breaking polymorphism. • SpaghettiCode (Spaghetti): A class declaring long methods with no parameters and using global variables. These methods interact too much using complex decision algorithms. This class does not exploit and prevents the use of polymorphism and inheritance. • SpeculativeGenerality (SG): A class that is defined as abstract but that has very few children, which do not make use of its methods. • SwissArmyKnife (Swiss): A class whose methods can be divided in disjunct set of many methods, thus providing many different unrelated functionalities. 3.1. Data Collection 47 Per release, we obtain a list of detected antipatterns for each Java class. We choose this subset of antipatterns because (1) they are well-described by Brown et al. [1998], (2) they appear frequently in the different releases of the systems under analysis and (3) they are representative of design and implementation problems with data, complexity, size, and the features provided by Java classes. Moreover they allow us to compare our findings with those of a previous study [Khomh et al., 2012]. 3.1.4 Data Preparation In this step, the fine-grained source code changes are grouped and linked with the antipatterns. ChangeDistiller currently supports more than 40 types of source code changes that cover the majority of modifications to entities of object oriented programming languages [Fluri and Gall, 2006]. We group these change types into five categories. Grouping them facilitates the analysis of the contingency between different types of changes and the interpretation of the results. Table 3.1: Categories of source code changes [Giger et al., 2011]. Category API oState func stmt cond Description Changes that involve the declaration of classes (e.g., class renaming and class API changes) and the signature of methods (e.g., modifier changes, method renaming, return type changes, changes of the parameter list). Changes that affect object states of classes (e.g., fields addition and deletion). Changes that affect the functionality of a class (e.g., methods addition and deletion) Changes that modify executable statements (e.g., statements insertion and deletion) Changes that alter condition expressions in control structures and the modification of else-parts The different categories are shown in Table 3.1 together with a short description of each category. Per Java class revision we count the number of changes for each category. Per Java class we compute the sum for each change type category over the Java class revisions between two subsequent releases k and k+1. Finally, for each Java class we add the number of antipatterns detected in the Java class at release k. We did not normalize the number of 48 Chapter 3. Change-Prone Java APIs changes in classes by the number of lines of code, because we wanted our results to be comparable to previous studies. Furthermore, one of previous studies [Khomh et al., 2012] has shown that size alone is not the dominating factor affecting the change proneness of classes with antipatterns. The resulting list contains for each release k a list of Java classes with the number of detected instances of the twelve antipatterns at release k plus the number of fine-grained changes per change type category that occurred between the two subsequent releases k and k+1. The analyses performed on these data will be described in the next section. 3.2 Empirical Study The goal of this empirical study is to investigate the association between antipatterns and the change proneness of Java classes. We performed the empirical study with 16 open-source systems from different domains, implemented in Java and widely used in academic and industrial communities. Table 3.2 shows an overview of the dataset. #Files denotes the number of Java files in the last release, #Releases denotes the number of releases analyzed, #SCC denotes the number of fine-grained source code changes in the given time period (Time) and #SCC’ denotes the number of fine-grained source code changes without counting changes performed in the comments and copyrights. In total, changes due to comments and copyrights modifications account for approximately 11% of all the changes (i.e., 64021 out of 585614). This high percentage highlights the necessity to filter out changes related to comments and copyrights, in order to avoid biasing the results. Table 3.3 shows the number of antipatterns detected by DECOR in the first and last release of the analyzed systems. Basically, all systems contain instances of most of the 12 antipatterns. In particular, rapid miner and vuze contain the largest number of antipatterns which is not surprising since they also are the largest systems in our sample set. According to our numbers, the antipatterns LongMethod (LongM), MessageChain (MsgC), and RefusedParentBequest (RPB) occur most frequently while SpaghettiCode (Spaghetti), SpeculatigeGenerality (SG), and SwissAmryKnife (Swiss) occur less frequently. Overall, the frequency of antipatterns and changes allows us to investigate the three research questions stated at the beginning of this chapter. The raw data used to perform our analysis are available on our web site.3 In the following, we state the hypotheses, explain the analysis methods, and 3 http://swerl.tudelft.nl/twiki/pub/DanieleRomano/WebHome/ WCRE12rawData.zip 3.2. Empirical Study 49 Table 3.2: Dataset used in our empirical study. System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces #Files 1716 494 970 188 793 381 469 172 189 293 1996 1288 184 2061 3265 710 #Releases 9 10 20 12 22 17 16 6 11 13 30 17 8 4 29 20 #SCC 97767 26099 37271 7600 40551 14072 14983 2318 13070 9787 41665 67050 14795 9899 119138 69549 #SCC’ 79414 23638 34440 6555 37306 11789 13647 1790 11544 8948 37983 63601 13693 9277 113570 54398 Time [M,Y] Oct02-Mar09 Jan03-Mar11 Jun04-Mar11 May01-Mar11 May01-Mar11 Sep02-Mar11 Jun01-Mar11 Nov01-Mar11 Nov01-Mar11 Nov01- Mar11 Dec03-Oct11 Dec06-Jun09 May99-Aug07 Oct09-Aug10 Dec06-Apr10 Dec00-Dec12 report on the results for each research question. 3.2.1 Investigation of RQ1 The goal of RQ1 is to analyze the change-proneness of Java classes affected by antipatterns, compared to the change-proneness of classes not affected by antipatterns. We address RQ1 by testing the following two null hypotheses: • H1a : The proportion of classes changed at least once between two releases is not different between classes that are affected by antipatterns and classes not affected by antipatterns. • H1b : The distribution of SCC performed in classes between two releases is not different for classes affected by antipatterns and classes not affected by antipatterns. Analysis Method For investigating H1a we classify the Java classes of each system and release k into change-prone if there was at least one change in between two subsequent releases (k and k+1). Otherwise they are classified as not change-prone. This binary variable (we refer to it as change-proneness(k,k+1)) denotes the dependent variable. As independent variable we also use a binary variable that denotes whether a Java class is affected by at least one antipattern in a given release k. We refer to this variable as antipatterns(k). 50 Chapter 3. Change-Prone Java APIs Table 3.3: Number of antipatterns detected with DECOR in the first and last releases of the analyzed systems. System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces #Antisingleton 352-3 113-104 176-232 1-22 18-146 8-25 17-44 1-12 9- 64 9-64 12-139 4-70 16-18 11-19 179-145 10-22 #Blob 26-169 34-37 52-75 7-14 13-70 7-22 26-27 2-7 1-21 1-21 10-136 43-101 5-11 130-161 199-282 8-59 #CDBSP 136-51 33-17 31-50 0-12 0-70 6-32 1-74 1-10 2-6 2-6 8-400 61-174 4-18 145-203 189-270 14-134 #ComplexC 56-195 30-37 58-8 1-8 11-50 5-13 30-33 1-5 1-21 1-21 9-144 43-83 9-19 152-156 138-193 13-44 #LazyC 16-53 5-3 9-12 0-9 0-22 6-22 8-42 0-4 0-0 0-0 1-126 2-16 4-9 10-15 29-215 6-21 #LongM 172-354 56-72 121-194 5-22 30-176 22-60 68-78 8-33 17-79 17-79 21-365 132-300 11-33 450-568 381-473 29-96 System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces #LPL 195-334 34-19 48-74 0-18 25-41 19-45 37-40 0-26 1-51 1-51 2-169 43-66 9-8 214-270 217-295 16-130 #MsgC 130-197 51-101 157-236 3-6 6-53 22-34 78-80 1-15 4-45 4-45 2-332 98-135 15-51 583-674 514-773 19-99 #RPB 65-513 93-97 123-202 0-11 6-73 5-14 80-82 0-7 0-13 0-13 2-295 34-165 3-7 781-1068 476-637 3-37 #Spaghetti 22-1 15-4 9-12 0-1 3-8 0-2 3-3 0-1 0-1 0-1 1-16 2-0 0-0 1-1 22-16 2-1 #SG 9-34 2-1 3-8 1-1 2-24 7-21 1-2 3-10 2-10 2-10 0-17 12-35 0-2 12-28 21-27 5-4 #Swiss 3-4 0-0 3-9 0-2 0-7 0-2 1-1 0-0 0-0 0-0 0-1 1-1 0-1 3-1 35-70 10-11 Next, we use the Fisher’s exact test [Sheskin, 2007] to test for each release k of each system whether there is an association between antipatterns(k) and change-proneness(k,k+1) of classes. We then use the odds ratio (ORs) [Sheskin, 2007] to measure the probability that a Java class will be changed between two releases (k and k+1) if it is affected by at least one antipattern p/(1−p) in the release k. OR is defined as OR = q/(1−q) and it measures the ratio of the odds p of an event occurring in one group (i.e., experimental group) to the odds q of it occurring in another group (i.e., control group). In this case, the event is a change in a Java class, the experimental group is the set of classes affected by 3.2. Empirical Study 51 at least one antipattern and the control group is the set of classes not affected by any antipattern. ORs equal to 1 indicate that a change can appear with the same probability in both groups. ORs greater than 1 indicate that the change is more likely to appear in a class affected by at least one antipattern. ORs less than 1 indicate that classes not affected by antipatterns are more likely to be changed. Concerning H1b we use the Mann-Whitney test to analyze for each release k whether there is a significant difference in the distributions of #SCC(k,k+1) performed in Java classes affected by antipatterns and in Java classes not affected by any antipattern. We apply the Cliff’s Delta d effect size [Grissom and Kim, 2005] to measure the magnitude of the difference. Cliff’s Delta estimates the probability that a value selected from one group is greater than a value selected from the other group. Cliff’s Delta ranges between +1 if all selected values from one group are higher than the selected values in the other group and -1 if the reverse is true. 0 expresses two overlapping distributions. The effect size is considered negligible for d < 0.147, small for 0.147 ≤ d < 0.33, medium for 0.33 ≤ d < 0.47 and large for d ≥ 0.47 [Grissom and Kim, 2005]. We chose the Mann-Whitney test and Cliff’s Delta effect size because the values of the SCC per class are non-normally distributed. Furthermore, our different levels (small, medium, and large) facilitate the interpretation of the results. The Cliff’s Delta effect size has been computed with the orddom package4 available for the R environment.5 Results The odds ratios computed to test H1a are summarized in Table 3.4. Table 3.4 shows for each system the total number of releases (#Releases) and the number of releases that showed a p-value for the Fisher’s exact test smaller than 0.01 and odds ratios greater than 1 (ORs>1). The results show that, except for three systems (eclipse.team.cvs.core, jabref and rhino), in most of the analyzed releases, Java classes affected by at least one antipattern are more changeprone than other classes. In total, for 190 out of 244 releases (≈82%), classes affected by at least one antipattern are more change-prone. These results allow us to reject H1a and accept the alternative hypothesis that Java classes affected by antipatterns are more likely to be changed than classes not affected by them. Table 3.5 shows the p-values of the Mann-Whitney tests and values of the Cliff’s Delta d effect size for testing H1b . Only in 18 releases (≈7%) there is 4 5 http://cran.r-project.org/web/packages/orddom/index.html http://www.r-project.org/ 52 Chapter 3. Change-Prone Java APIs Table 3.4: Total number of releases (#Releases) and number of releases for which Fisher’s exact test and OR show a significant association between change-proneness and antipatterns in Java classes. System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces Total #Releases 9 10 20 12 22 17 16 6 11 13 30 17 8 4 29 20 244 Fisher p-value < 0.01 & OR >1 9 10 19 8 20 16 16 4 5 9 3 17 2 4 29 19 190 no significant difference (Mann-Whitney p-value≥0.01) between the distributions of SCC performed in classes affected by antipatterns and in other classes. In the other 226 releases (≈93%) the difference is significant (Mann-Whitney p-value<0.01). Concerning the effect size we found that this difference is small (0.147≤d<0.33) in 102 releases (≈42%), medium (0.33≤d<0.47) in 26 releases (≈11%), large (0.47≤d) in 9 releases (≈4%) and negligible (d < 0.147) in 89 releases (≈36%). Based on these results we reject H1b and accept the alternative hypothesis that in most cases Java classes with antipatterns undergo more changes during the next release than classes that are free of antipatterns. Based on these findings we can answer RQ1: Java classes affected by antipatterns are more change-prone than other classes. The results confirm the findings of the previous study [Khomh et al., 2012], this time taking into account the type of changes, and filtering out non source code changes such as changes to indentations and comments. 3.2.2 Investigation of RQ2 The goal of RQ2 is to test whether certain antipatterns lead to more changes in Java classes than other antipatterns. The basic idea is to assist software engineers in identifying the most change-prone classes affected by antipatterns. 3.2. Empirical Study 53 Table 3.5: p-values of the Mann-Whitney (M-W) tests and Cliff’s Delta d showing the magnitude of the difference between the distribution of SCC in classes affected and not affected by antipatterns. System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces total #Releases 9 10 20 12 22 17 16 6 11 13 30 17 8 4 29 20 244 0.47≤d 0 0 0 4 0 0 0 0 1 1 0 0 2 0 0 1 9 M-W<0.01 0.33≤d<0.47 0.147≤d<0.33 1 6 1 6 3 7 2 4 0 14 0 12 1 8 1 3 3 4 4 3 3 11 2 9 0 0 0 0 2 7 3 8 26 102 ≥0.01 d≤0.147 2 3 10 1 8 4 5 0 3 1 16 6 0 4 20 6 89 0 0 0 1 0 1 2 2 0 4 0 0 6 0 0 2 18 They should be resolved first. We address RQ2 by testing the following null hypotheses: • H2: The distribution of SCC is not different for classes affected by different antipatterns. Analysis Method As dependent variable we use the number of SCC performed in a class between two releases #SCC(k,k+1). As independent variable we use a binary variable for each antipattern that denotes whether a class is affected by a particular antipattern. To test H2 we use the Mann-Whitney test and Cliff’s Delta d effect size over all releases for a system. We selected all releases per system since some releases had too few data points (e.g., there have been only 6 SCC between releases 1.6R3 and 1.6R4 of Rhino). The orddom package used to compute Cliff’s Delta d is not optimized for very big data sets. Therefore, in cases of systems with more than 5000 data points (i.e., more than 5000 classes experiencing changes over the revision history), we randomly sampled 5000 data points 30 times and computed the average of the obtained Cliff’s Delta values. This sampling allows us to compute Cliff’s Delta values for each system 54 Chapter 3. Change-Prone Java APIs with a confidence level of 99% and a confidence interval of 0.004; which is a very precise estimation. Table 3.6: Cliff’s Delta d effect sizes of cases for which Mann-Whitney shows a significant difference (p-value<0.01) or NA otherwise. Values in bold denote the largest difference per system. For the underlined systems we applied random sampling. System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces Median #AS 0.311 0.143 0.171 0.553 0.169 0.461 0.277 0.422 0.026 0.290 0.089 -0.020 0.276 0.051 0.151 0.302 0.223 #Blob 0.098 0.112 0.086 0.352 0.299 NA 0.182 0.433 0.374 0.293 0.001 0.150 NA 0.060 0.076 0.104 0.131 #CDBSP 0.331 0.193 0.064 0.419 0.150 NA 0.078 NA 0.085 0.212 0.019 0.177 0.393 -0.001 0.079 0.044 0.117 #ComplexC 0.226 0.500 0.386 0.889 0.454 0.411 0.485 0.581 0.723 0.395 0.094 0.388 0.119 0.141 0.211 0.541 0.403 #LazyC -0.012 NA -0.110 NA 0.147 NA 0.103 NA NA NA NA NA NA NA NA NA 0.045 #LongM 0.192 0.149 -0.172 0.544 0.231 0.266 0.250 0.33 0.331 0.265 0.072 0.192 0.067 0.051 0.121 0.269 0.211 System argo hibernate2 hibernate3 eclipse.debug.core eclipse.debug.ui eclipse.jface eclipse.jdt.debug eclipse.team.core eclipse.team.cvs.core eclipse.team.ui jabref mylyn rhino rapidminer vuze xerces Median #LPL 0.148 0.347 0.169 0.691 0.169 NA 0.295 0.107 0.172 0.163 0.044 0.232 0.025 0.080 0.106 0.327 0.169 #MsgC 0.248 0.250 0.170 0.289 0.227 0.385 0.137 0.315 0.329 0.187 0.042 0.228 0.100 0.051 0.140 0.122 0.207 #RPB 0.035 -0.032 0.016 0.435 NA NA 0.051 NA NA NA -0.006 0.063 NA -0.002 -0.021 0.036 0.025 #Spaghetti 0.354 0.262 0.191 NA 0.377 NA 0.361 NA NA 0.642 0.356 NA 0.928 NA 0.308 0.153 0.355 #SG 0.030 NA NA 0.298 0.009 NA NA 0.373 NA 0.183 NA NA NA NA 0.028 0.307 0.183 #Swiss 0.528 0.654 0.662 0.650 0.514 NA 0.919 NA NA NA 0.966 NA NA 0.600 0.213 0.565 0.625 Results Table 3.6 shows the values for Cliff’s Delta d effect size for which the p-value of the Mann-Whitney is significant (p-value<0.01). NA denotes p-values for Mann-Whitney greater than 0.01 and consequently Cliff’s Delta is not computed. 3.2. Empirical Study 55 The results of the Mann-Whitney tests show that, except for the LazyClass and SpeculativeGenerality (SG), the distributions of SCC performed in classes affected by a specific antipattern are different from the distribution of SCC performed in classes not affected by that antipattern. According to the median values for Cliff’s Delta shown in the last row of Table 3.6, this difference is large for SwissArmyKnife (Swiss), medium for 2 antipatterns (0.33≤d<0.47), small for 5 antipatterns (0.147≤d<0.33) and negligible for 4 antipatterns. Note, that for classes affected by LazyClass and SG the Mann-Whitney test was significant only in 4 and respectively 7 systems. Looking at the values in bold we can see that classes affected by the ComplexClass (ComplexC), SpaghettiCode (Spaghetti) and SwissArmyKnife (Swiss) antipatterns are more change-prone than classes affected by any other antipattern. More specifically, in 8 systems out of 16 the Cliff’s Delta effect size is highest for classes affected by SwissArmyKnife. In 4 systems the Cliff’s Delta effect size is higher for classes affected by ComplexClass. In the other 3 systems the highest effect size is for classes affected by SpaghettiCode. Only in one system, namely eclipse.jface, the Antisingleton antipattern shows the highest value for Cliff’s Delta. Based on these results we reject H2 and we conclude that among all classes the classes affected by the ComplexClass, SpaghettiCode, and SwissArmyKnife antipatterns are more change-prone. These results detail the findings in [Khomh et al., 2012] by highlighting three antipatterns that are more change-prone than the other antipatterns. Moreover, the new findings allow us to advice software engineers to focus on detecting instances of these three change-prone antipatterns and fix them first. 3.2.3 Investigation of RQ3 To address RQ3, we analyze the relationship between different antipatterns and different types of changes. The goal is to further assist software engineers by verifying whether a particular type of changes is more likely to be performed in classes affected by a specific antipattern. This knowledge can help engineers to avoid or fix certain antipatterns leading to changes that impact large parts of the rest of a software system, such as changes in the method declarations of a class that exposes a public API. We answer RQ3 by testing the following null hypothesis: • H3: The distributions of different types of SCC performed in classes affected by different antipatterns are not different. 56 Chapter 3. Change-Prone Java APIs Analysis Method To test H3 we categorize the changes mined with ChangeDistiller in five different categories as listed in Table 3.1. As dependent variables we use the change type categories representing the number of SCC that fall in each category. As for H2, the independent variables are the set of binary variables that denote whether a class is affected by a specific antipattern or not. We test the difference in the distributions of SCC per category using the Mann-Whitney test and compute the magnitude of the difference with the Cliff’s Delta d effect size. In order to have enough data about each change type category we use the data from all systems as input for this analysis. Similar to H2, we use the random sampling approach for computing Cliff’s Delta and we report the mean effect size of the 30 random samples. Results Table 3.7 lists the results of this analysis. Values in bold denotes differences that are at least small according to Cliff’s Delta. They show that changes in the class and methods declaration (API) are more likely to appear in classes affected by the ComplexClass, SpaghettiCode and SwissArmKnife antipatterns. Changes in the functionalities (func) are likely in classes affected by the ComplexClass and SpaghettiCode antipatterns. Changes in the execution statements (stmt) are likely to appear in classes affected by the Antisingleton, ComplexClass, SpaghettiCode and SwissArmyKnife antipatterns. Finally, changes in the condition expressions and else-parts (cond) are more frequent in classes affected by the SpaghettiCode antipattern. Based on these results we reject H3 and conclude that classes affected by different antipatterns undergo different types of changes. 3.2.4 Manual Inspection To further highlight the relationship between antipatterns and change proneness we manually inspected several classes affected by antipatterns that have been resolved. For these classes we analyzed the number of changes before and after the removal of the antipatterns. The analysis clearly shows that when classes are affected by an antipattern they undergo a considerably higher number of changes. For instance, the class org.apache.xerces.StandardParserConfiguration from the Xerces system. This class was affected by the ComplexClass antipattern until the release 2.0.2. Before release 2.0.2, the class underwent on average 64.5 changes per release. The average number of changes decreased to 5.2 after the antipattern was removed. Furthermore, the average number of API changes decreased from 2 to 0.07. 3.2. Empirical Study 57 Table 3.7: Cliff’s Delta d effect sizes of cases for which Mann-Whitney shows a significant difference (p-value<0.01) or NA otherwise. Values in bold denote an effect size that is at least small (d > 0.147). Group API oState func stmt cond #Antisingleton 0.131 0.080 0.084 0.157 0.080 #Blob 0.077 0.048 0.057 0.077 0.035 #CDBSP 0.038 0.031 0.019 0.051 0.028 #ComplexC 0.213 0.144 0.153 0.252 0.138 #LazyC -0.043 NA -0.040 NA -0.020 #LongM 0.073 0.042 0.053 0.140 0.059 Group API oState func stmt cond #LPL 0.095 0.060 0.076 0.146 0.081 #MsgC 0.075 0.045 0.054 0.120 0.058 #RPB 0.001 -0.001 -0.002 0.100 0.001 #Spaghetti 0.207 0.126 0.149 0.308 0.178 #SG 0.029 -0.001 NA 0.007 0.100 #Swiss 0.150 0.109 0.142 0.245 0.136 As another example, consider the views.memory.AddMemoryBlockAction class from the eclipse.debug.ui system. This class was affected by the SpaghettiCode antipattern until the release 3.2. The average number of changes decreased from 79.83 to 1.5 after the release 3.2. Moreover the average number of cond changes decreased from 2.67 to 0.1. 3.2.5 Implications of Results In summary, we see two main implications of our results that concern software engineers and researchers. Concerning the researcher, our results provide a deeper insight into the effects of antipatterns on the change-proneness of Java classes. First, we confirmed the results from [Khomh et al., 2012] but this time taking into account the type of changes (see RQ1). Second, we identified three antipatterns, namely ComplexClass, SpaghettiCode and SwissArmyKnife that lead to change-prone classes (see RQ2). Third and most of all, we showed that certain antipatterns lead to certain types of changes (see RQ3). This helps to focus our research on a sub-set of antipatterns, namely the most changeprone ones. Regarding the software engineer, the results of our study have several implications. In particular, the results for RQ2 and RQ3 show that software engineers should focus on detecting and resolving the three antipatterns ComplexClass, SpaghettiCode and SwissArmyKnife. Classes affected by these antipatterns turned out to be the most change-prone ones, therefore resolving instances of these antipatterns helps to prevent changes in their APIs. In particular, because API changes can have a significant impact on the implementa- 58 Chapter 3. Change-Prone Java APIs tion of the other parts of a software system therefore should be prevented. For instance, consider the scenario in which APIs are made available through web services. The responsible software engineers want to assure the robustness of these classes to minimize the possibility of breaking the clients of the web services. Based on the results of our study they can use DECOR to detect instances of the ComplexClass, SpaghettiCode and SwissArmyKnife antipatterns in the set of API classes. These are the antipatterns they should resolve first in order to reduce the probability that APIs are changed and, hence, that clients are broken. 3.3 Threats to Validity This section discusses the threats to validity that can affect the results of our empirical study. Threats to construct validity concern the relationship between theory and observation. In our study, this threat can be due to the fact that we considered SCC performed in between two subsequent releases. However, the effects of antipatterns can manifest themselves after the next immediate release whenever the class affected by antipatterns needs to be changed. We mitigated this threat by testing all the hypotheses taking into account all the SCC performed after a release for which we obtained similar results. Threats to internal validity concern factors that may affect an independent variable. In our study, both the independent and dependent variables are computed using deterministic algorithms (implemented in ChangeDistiller and DECOR) delivering always the same results. Threats to conclusion validity concern the relationship between the treatment and the outcome. To mitigate these threats our conclusions have been supported by proper statistical tests, in particular by non-parametric tests that do not require any assumption on the underlying data distribution. Threats to external validity concern the generalization of our findings. Every result obtained through empirical studies is threatened by the bias of their datasets [Menzies et al., 2007]. To mitigate these threats we tested our hypotheses over 16 open-source systems of different size and from different domains. Threats to reliability validity concern the possibility of replicating our study and obtaining consistent results. We mitigated these threats by providing all the details necessary to replicate our empirical study. The systems under analysis are open-source and the source code repositories are publicly available. 3.4. Related Work 59 Moreover, we published on-line the raw data to allow other researches to replicate our study and to test other hypotheses on our dataset. 3.4 Related Work In this section, we discuss the related literature on antipatterns in relation to software evolution. Code Smells/Antipatterns Detection Techniques. The first book on antipatterns in object-oriented development was written in Webster [1995]. The book made several contributions on conceptual, political, coding, and qualityassurance problems. Fowler [1999] defined 22 code smells, suggesting where developers should apply refactorings. Mantyla [2003] and Wake [2003] proposed classifications for code smells. Brown et al. [1998] described 40 antipatterns, including the Blob, the Spaghetti Code, and the MessageChain. These books provide in-depth views on heuristics, code smells, and antipatterns, and are the basis of all approaches to detect (semi-)automatically code smells and antipatterns, such as DECOR [Moha et al., 2010] used in this study. Several approaches to specify and detect code smells and antipatterns exist in the literature. They range from manual approaches, based on inspection techniques [Travassos et al., 1999], to metric-based heuristics [Marinescu, 2004; Munro, 2005; Oliveto et al., 2010], using rules and thresholds on various metrics or Bayesian belief networks [Khomh et al., 2011]. Some approaches for complex software analysis use visualization [Dhambri et al., 2008; Simon et al., 2001]. Although visualization is sometimes considered as an interesting compromise between fully automatic detection techniques, which are efficient but loose track of the context, and manual inspections, which are slow and subjective [Langelier et al., 2005], visualization requires human expertise and is thus time-consuming. Sometimes, visualization techniques are used to present the results of automatic detection approaches [Lanza and Marinescu, 2006; van Emden and Moonen, 2002]. This previous work significantly contributed to the specification and detection of antipatterns. The approach used in this study, DECOR, builds on this previous work. Code Smells/Antipatterns and Software Evolution. Deligiannis et al. [Ignatios et al., 2003, 2004] proposed the first quantitative study of the relation between antipatterns and software quality. They performed a controlled experiments with 20 students on two software systems to understand the impact of Blobs on the understandability and maintainability of software systems. The results of their study suggested that Blob classes considerably affect the evolu- 60 Chapter 3. Change-Prone Java APIs tion of design structures, in particular the use of inheritance. Bois et al. [2006] showed that the decomposition of Blob classes into a number of collaborating classes using refactorings can improve comprehension. Abbes et al. [2011] conducted three experiments, with 24 subjects each, to investigate whether the occurrence of antipatterns does affect the understandability of systems by developers during comprehension and maintenance tasks. They concluded that although the occurrence of one antipattern does not significantly decrease developers’ performance, a combination of two antipatterns impedes significantly developers’ performance during comprehension and maintenance tasks. Li and Shatnawi [2007] investigated the relationship between the probability of a class to be faulty and some antipatterns based on three versions of Eclipse and showed that classes with antipatterns Blob, Shotgun Surgery, and Long Method have a higher probability to be faulty than other classes. Olbrich et al. [2009], analyzed the historical data of Lucene and Xerces over several years and concluded that classes with the antipatterns Blob and Shotgun Surgery have a higher change frequency than other classes; with Blob classes featuring more changes. However, they did not investigated the kinds of changes performed on the antipatterns. Using Azureus and Eclipse, we investigated the impact of code smells on the change-proneness of classes and showed that in general, the likelihood for classes with code smells to change is very high [Khomh et al., 2009]. In [Khomh et al., 2012] we also investigated the relation between the presence of antipatterns and the change- and fault-proneness of classes. We found that classes participating in antipatterns are significantly more likely to be subject to changes and to be involved in fault-fixing changes than other classes. Furthermore, we also investigated the kind of changes, namely structural and non-structural changes, experienced by classes with antipatterns. Structural changes are changes that alter a class interface while non-structural changes are changes to method bodies. We found that in general structural changes are more likely to occur in classes participating in antipatterns. The main difference with this work is that we detailed the changes into 40 types of source code changes classified in 5 change type categories. This detailed information about changes allowed us to analyze which antipatterns lead to which types of source code changes. Also, this work is performed with more systems, namely 16, compared to previous work which was done with only 4 systems. 3.5 Conclusion and Future Work Antipatterns have been defined to denote poor solutions to design and implementation problems. Previous studies have shown that classes affected by 3.5. Conclusion and Future Work 61 antipatterns are more change-prone than other classes. In this chapter we provide a deeper insight into which antipatterns lead to which types of changes in Java classes. We analyzed the change-proneness of these classes taking into account 40 types of fine-grained source code changes (SCC) extracted from the version control repositories of 16 Java open-source systems. Our results show that: • Classes affected by antipatterns change more frequently along the evolution of a system, confirming previous findings (see RQ1). • Classes affected by the ComplexClass, SpaghettiCode and SwissArmyKnife antipatterns are more likely to be changed than classes affected by other antipatterns (see RQ2). • Certain antipatterns lead to certain types of source code changes, such as API changes are more likely to appear in classes affected by the ComplexClass, SpaghettiCode and SwissArmyKnife antipatterns (see RQ3). Our results have several implications on software engineers and researchers. Regarding researchers our results suggest to focus our efforts on understanding a subset of antipatterns that lead to change-prone classes or changes with a high impact on the other parts of a software system. Concerning software engineers, our results provide strong evidence to use antipatterns detection tools, such as DECOR, to detect and resolve ComplexClass, SpaghettiCode and SwissArmyKnife antipatterns. Resolving them shows to be beneficial in terms of preventing source code changes, such as API changes, that impact other parts of a system. In future work, we plan to perform a more extended qualitative analysis of antipatterns. We also plan to enlarge our data set and analyze industrial software systems. Another direction of future work is to analyze the types of changes performed when antipatterns are introduced and when they are resolved. These analysis are needed to further estimate the development and maintenance costs caused by antipatterns. . 4 Fine-Grained WSDL Changes In the service-oriented paradigm web service interfaces are considered contracts between web service consumers and providers. However, these interfaces are continuously evolving over time to satisfy changes in the requirements and to fix bugs. Changes in a web service interface typically affect the systems of its consumers. Therefore, it is essential for consumers to recognize which types of changes occur in a web service interface in order to analyze the impact on his/her systems. In this chapter we propose a tool called WSDLDiff to extract fine-grained changes from subsequent versions of a web service interface defined in WSDL. In contrast to existing approaches, WSDLDiff takes into account the syntax of WSDL and extracts the WSDL elements affected by changes and the types of changes. With WSDLDiff we performed a study aimed at analyzing the evolution of web services using the fine-grained changes extracted from the subsequent versions of four real world WSDL interfaces. The results of our study show that the analysis of the fine-grained changes helps web service consumers to highlight the most frequent types of changes affecting a WSDL interface. This information can be relevant for web service consumers who want to assess the risk associated to the usage of web services and to subscribe to the most stable ones.1 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 WSDLDiff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Over the last decades, the evolution of software systems has been studied in order to analyze and enhance the software development and maintenance processes. Among other applications, the information mined from the evolu1 This chapter was published in the 19th International Conference on Web Services (ICWS 2012) [Romano and Pinzger, 2012]. 63 64 Chapter 4. Fine-Grained WSDL Changes tion of software systems has been applied to investigate the causes of changes in software components [Khomh et al., 2009; Penta et al., 2008]. Software engineering researchers have developed several tools to extract information about changes from software artifacts [Fluri et al., 2007] [Tsantalis et al., 2011] [Xing and Stroulia, 2005a] and to analyze their evolution. In service-oriented systems understanding and coping with changes is even more critical and challenging because of the distributed and dynamic nature of services [Papazoglou, 2008]. In fact, service providers do not necessarily know the service consumers and how changes on a service can impact the existing service clients. For this reason service interfaces are considered contracts between providers and consumers and they should be as stable as possible [Erl, 2007]. On the other hand, services are continuously evolving to satisfy changes in the requirements and to fix bugs. Recognizing the types of changes is fundamental for understanding how a service interface evolves over time. This can help service consumers to quantify the risk associated to the usage of a particular service and to compare the evolution of different services with similar features. Moreover, detailed information about changes allow software engineering researchers to analyze the causes of changes in a service interface. In order to analyze the evolution of WSDL2 interfaces, Fokaefs et al. [2011] propose a tool called VTracker. This tool is based on the Zhang-Shasha’s treeedit distance [Zhang and Shasha, 1989] comparing WSDL interfaces as XML3 documents. However, VTracker does not take into account the syntax of WSDL interfaces. As consequence, their approach outputs only the percentage of added, changed and removed XML elements. We argue that this information is inadequate to analyze the evolution of WSDL interfaces without manually checking the types of changes and the WSDL elements affected by changes. Moreover, their approach of transforming a WSDL interface into a simplified representation can lead to the detection of multiple changes while there has been only one change. In this chapter we propose a tool called WSDLDiff that compares subsequent versions of WSDL interfaces to automatically extract the changes. In contrast to VTracker, WSDLDiff takes into account the syntax of WSDL and XSD,4 used to define data types in a WSDL interface. In particular, WSDLDiff extracts the types of the elements affected by changes (e.g., Operation, Message, XSDType) and the types of changes (e.g., removal, addition, move, at2 http://www.w3.org/TR/wsdl http://www.w3.org/XML/ 4 http://www.w3.org/XML/Schema 3 4.1. Related Work 65 tribute value update). We refer to these changes as fine-grained changes. The fine-grained changes extraction process of WSDLDiff is based on the UMLDiff algorithm [Xing and Stroulia, 2005a] and has been implemented on top of the Eclipse Modeling Framework (EMF).5 With WSDLDiff we performed a study aimed at analyzing the evolution of web services using the fine-grained changes extracted from subsequent versions of four real world WSDL interfaces. We address the following two research questions: • RQ1: What is the percentage of added, changed and removed elements of a WSDL interface? • RQ2: Which types of changes are made to the elements of a WSDL interface? The study shows that different WSDL interfaces are affected by different types of changes highlighting how they are maintained with different strategies. While in one case mainly Operations were added continuously, in the other three cases the data type specifications were the most affected by changes. Moreover, we found that in all four WSDL interfaces under analysis there is a type of change that is predominant. From this information web service consumers can be aware of the frequent types of changes when subscribing to a web service and they can compare the evolution of web services that provide similar features in order to subscribe to the most stable web service. The remainder of this chapter is organized as follows. In Section 4.1 we report the related work and we discuss the main differences with our work. Section 4.2 describes the WSDLDiff tool and the process to extract fine-grained changes implemented into it. The study and results are presented in Section 4.3. We draw our conclusions and outline directions for future work in Section 4.4. 4.1 Related Work Fokaefs et al. [2011] analyzed the evolution of web services using a tool called VTracker. This tool is based on the Zhang-Shasha’s tree edit distance algorithm [Zhang and Shasha, 1989], which calculates the minimum edit distance between two trees. In this study the WSDL interfaces are compared as XML files. Specifically the authors created an intermediate XML representation to reduce 5 http://www.eclipse.org/modeling/emf/ 66 Chapter 4. Fine-Grained WSDL Changes the verbosity of the WSDL specification. In this simplified XML representation, among other transformations, the authors trace the references between messages parameters (Parts) and data types (XSDTypes) and they replace the references with the data types themselves. The output of their analysis consists of the percentage of added, changed and removed elements among the XML models of two WSDL interfaces. There are two main differences between our work and the approach proposed by Fokaefs et al. First, we compute the changes between WSDL models taking into account the syntax of WSDL and XSD and, hence, extracting the type of the elements affected by changes (e.g., Operation, Message, XSDType) and the types of changes (e.g., removal, addition, move, attribute value update). For example, WSDLDiff extracts differences in the order of the elements only if it is relevant, such as changes in the order of Parts defined in a Message. Our approach is aware of irrelevant order changes, such as changes in the order of XSDTypes defined in the WSDL types definition. This allows us to analyze the evolution of a WSDL interface only looking at the changes without manually inspecting the XML coarse-grained changes. Second, WSDLDiff does not replace the references to data types with the data types themselves. This transformation can lead to the detection of a change in a data type multiple times while there has been only one change. Wang and Capretz [2009] proposed an impact analysis model based on service dependency. The authors analyze the service dependencies graph model, service dependencies and the relation matrix. Based on this information they infer the impact of the service evolution. However, they do not propose any technique to analyze the evolution of web services. Aversano et al. [2005] proposed an approach to understand how relationships between sets of services change across service evolution. Their approach is based on formal concept analysis. They used the concept lattice to highlight hierarchy relationships and to identify commonalities and differences between services. While the work proposed by Aversano et al. consists in extracting relationships among services, our work focuses on the evolution of single web services using fine-grained changes. As future work the two approaches can be integrated to correlate different types of changes with the different relationships. In literature several approaches have been proposed to measure the similarity of web services (e.g., [Liu et al., 2010] [Plebani and Pernici, 2009]). However, these approaches compute the similarity amongst WSDL interfaces to assist the search and classification of web services and not to analyze their evolution. Concerning the model differencing techniques, the approach proposed by Xing et al. [Xing and Stroulia, 2005a] [Xing and Stroulia, 2005b] is most 4.2. WSDLDiff 67 relevant for our work. In fact, their algorithm to infer differences among UML6 diagrams has been implemented by the EMF Compare7 that we used to implement our tool WSDLDiff. The authors proposed the UMLDiff algorithm for detecting structural changes between the designs of subsequent versions of object oriented systems, represented through UML diagrams. This algorithm has been later adapted in the EMF Compare to compare models conforming to any arbitrary metamodel and not only UML models [Brun and Pierantonio, 2008]. Several approaches have been proposed to classify changes in service interfaces. For instance Feng et al. [2011] and Treiber et al. [2008] have proposed approaches to classify the changes of web services taking into account their impact to different stakeholders. These classifications can be easily integrated in our tool to classify the different fine-grained changes extracted along the evolution of a web service. As can be deduced from the overview of related work there currently does not exist any tool for extracting fine-grained changes amongst web services. In this chapter, we present such a tool based on the UMLDiff algorithm [Xing and Stroulia, 2005a]. 4.2 WSDLDiff In this section, we illustrate the WSDLDiff tool used to extract the fine-grained changes between two versions of a WSDL interface. Since the tool is based on the Eclipse Modeling Framework, we first present an overview of this framework and then we describe the fine-grained changes extraction process implemented by WSDLDiff. A first prototype of WSDLDiff is available on our web site.8 4.2.1 Eclipse Modeling Framework The Eclipse Modeling Framework (EMF) is a modeling framework that lets developers build tools and other applications based on a structured data model. This framework provides tools to produce a set of Java classes from a model specification and a set of adapter classes that enable viewing and editing of the models. The models are described by meta models called Ecore. As part of the EMF project, there is the EMF Compare plug-in. It provides comparison and merge facilities for any kind of EMF Models through a frame6 http://www.uml.org/ http://www.eclipse.org/emf/compare/ 8 http://swerl.tudelft.nl/twiki/pub/DanieleRomano/WebHome/ WSDLDiff.zip 7 68 Chapter 4. Fine-Grained WSDL Changes WSDL Version1 A B C WSDL Version2 WSDL Parser WSDL Parser org.eclipse.wst.wsdl org.eclipse.xsd org.eclipse.wst.wsdl org.eclipse.xsd WSDL Model1 WSDL Model2 XSD Transformer XSD Transformer WSDL Model1’ WSDL Model2’ Matching Engine org.eclipse.compare.match Match Model D Differencing Engine org.eclipse.compare.diff Diff Model Figure 4.1: The process implemented by WSDLDiff to extract fine-grained changes between two versions of a WSDL interface. work easy to be used and extended to compare instances of EMF Models. The Eclipse community provides already an Ecore meta model for WSDL interfaces, including a meta model for XSD, and tools to parse them into EMF Models. We use these features to parse and extract changes between WSDL interfaces as described in the following. 4.2.2 Fine-Grained Changes Extraction Process Figure 4.1 shows the process implemented by WSDLDiff to extract fine-grained changes between two versions of a WSDL interface. The process consists of four stages: • Stage A: in the first stage we parse the WSDL interfaces using the APIs provided by the org.eclipse.wst.wsdl and org.eclipse.xsd projects. The output of this stage consists of the two EMF Models (WSDL Model1 and 4.2. WSDLDiff 69 WSDL Model2) corresponding to the two WSDL interfaces taken as input (WSDL Version1 and WSDL Version2). • Stage B: in this stage we transform the EMF Models corresponding to the XSD (contained by the WSDL models) in order to improve the accuracy of the fine-grained changes extraction process as it will be shown in the Subsection 4.2.4. The output of this stage consist of the transformed models (WSDL Model1’ and WSDL Model2’). • Stage C: in the third stage we use the Matching Engine provided by the EMF Compare framework to detect the nodes that match in the two models. • Stage D: the Match Model produced by the Matching Engine is then used to detect the differences among the two WSDL models under analysis. This task is accomplished by the Differencing Engine provided also by EMF Compare. The output of this stage is a tree of structural changes that reports the differences between the two WSDL models. The differences are reported in terms of additions, removals, moves and modifications of each element specified in the WSDL and in the XSD. In the next subsection we first illustrate the strategies behind EMF Compare describing the matching (Stage C) and differencing (Stage D) stages and then we describe the XSD transformation (Stage B). 4.2.3 Eclipse EMF Compare The comparison facility provided by EMF Compare is based on the work developed by Xing and Stroulia [2005a]. This work has been adapted to compare generic EMF Models instead of UML models as initially developed by Xing. The comparison consists of two phases: (1) the matching phase (Stage C in our approach) and (2) the differencing phase (Stage D in our approach). The matching phase is performed computing a set of similarity metrics. These metrics are computed for two nodes while traversing the two models under analysis with a top-down approach. In the generic Matching Engine, provided in org.eclipse.compare.match and used in our approach, the set of metrics consists of four similarity metrics: • type similarity: to compute the match of the types of two nodes; • name similarity: to compute the similarity between the values of the attribute name of two nodes; 70 Chapter 4. Fine-Grained WSDL Changes • value similarity: to compute the similarity between the values of other attributes declared in the nodes; • relations similarity: to compute the similarity of two nodes based on the relationships they have with other nodes (e.g., children and parents in the model). Once the matching phase has been completed, it produces a matching model consisting of all the entities that are matched in the two models. The matching model is then used in the differencing phase to extract all the differences between the two models. Specifically, the matching model is browsed by a Differencing Engine that computes the tree edit operations. These operations represent the minimum set of operations to transform a model into an other model. They are classified in added, changed, removed and moved operations. For more details about the matching and differencing phases implemented by EMF Compare we refer the reader to [Brun and Pierantonio, 2008]. 4.2.4 XSD Transformation In an initial manual validation of EMF Compare on WSDL models we found that in a particular case the set of differences produced did not correspond to the minimum set of tree edit operations. The problem was due to the EMF Model used to represent the XSDs. For this reason we decided to add the XSD Transformer. To better understand the problem behind the original EMF Model and the solution adopted, consider the example shown in Figure 4.2. Figure 4.2a shows an XSDElement book that consists of an XSDModelGroup (the element sequence) that contains two XSDElements (the elements author and title). Figure 4.2b shows the original EMF Model parsed by the WSDL Parser (Stage A in Figure 4.1). The EMF Model contains the nodes XSDParticle. These nodes are necessary to represent the attributes minOccurs, maxOccurs and ref for each XSDElement declared in an XSDModelGroup and for the XSDModelGroup itself. The XSDParticles in the original model are parents of the elements to which they are associated. This structure can lead to mistakes when the order of XSDElements within an XSDModelGroup changes. In this case, when the Matching Engine traverses the models, it can detect a match between XSDParticles that are associated to different XSDElements (e.g., a match between the XSDParticle of the element author and the XSDParticle of the element title). This match is likely because the values of the attributes minOccurs, maxOccurs and ref are set to their default values. When this match occurs the Matching Engine keeps traversing the model and it detects a mismatch when it traverses the 4.2. WSDLDiff 71 <xs:element name=”book"> <xs:complexType> <xs:sequence> <xs:element name=”author” type="xs:string"/> <xs:element name=”>tle" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> (a) Definition of an XSD element XSDElement XSDElement book book XSDComplexType XSDComplexType XSDPar.cle XSDModelGroup XSDModelGroup XSDPar.cle XSDPar.cle XSDPar.cle XSDElement XSDElement author %tle (b) Original EMF Model XSDElement author XSDPar.cle XSDElement %tle XSDPar.cle (c) Transformed EMF Model Figure 4.2: An example that shows the XSD transformation performed by the XSD Transformer in the Stage B of the fine-grained changes extraction process. children of the previously matched XSDParticles (e.g., a mismatch between the elements author and title). As consequence, even if there are no differences among the models the Differencing Engine can produce the added XSDelement title, the added XSDelement author, the removed XSDelement title and the removed XSDelement author as changes. To overcome this problem, we decided to transform the EMF Model inverting the parent-child relationship in presence of XSDParticles as shown in Figure 4.2c. In the transformed models, the Matching Engine traverses the XSDParticles only when a match is detected between the XSDElements to which they are associated. Besides this problem, in one case, WSDLDiff reported the removed Part and added Part changes instead of the changed Part change when a Part was renamed. However for this study the two set of changes are equivalent. For 72 Chapter 4. Fine-Grained WSDL Changes this reason we have not considered it as a problem. Clearly, as part of our future work we plan to validate the fine-grained changes extraction process with a benchmark. 4.3 Study The goal of this study is to analyze the evolution of web services through the analysis of fine-grained changes extracted from subsequent versions of a WSDL interface. The perspective is that of web services consumers interested in extracting the types of changes that appear along the evolution of a web service. They can analyze the most frequent changes in a WSDL interface estimating the risk related to the usage of a specific element. The context of this study consists of all the publicly available WSDL versions of four real world web services, namely: • Amazon EC2: Amazon Elastic Compute Cloud is a web service that provides resizable compute capacity in the cloud. In this study we have analyzed 22 versions. • FedEx Rate Service: the Rate Service provides the shipping rate quote for a specific service combination depending on the origin and destination information supplied in the request. We analyzed 10 different versions. • FedEx Ship Service: the Ship Service provides functionalities for managing package shipments and their options. 7 versions out of 10 have been analyzed in this study. • FedEx Package Movement Information Service: the Package Movement Information Service provides operations to check service availability, route and postal codes between origin and destination. We analyzed 3 versions out of 4. For the sake of simplicity we refer to this service as FedEx Pkg. We chose these web services because they were previously used by Fokaefs et al. [2011]. The other web services analyzed by Fokaefs et al. [2011] (PayPal SOAP API9 and Bing Search10 ) have not been considered because the previous versions of the WSDL interfaces are not publicly available. For the same reasons not every version of the web services has been considered in our analysis. 9 https://www.paypalobjects.com/enUS/ebook/PPAPIReference/ architecture.html 10 http://www.bing.com/developers 4.3. Study 73 In Table 4.5 at the end of the chapter we report the size of the WSDL interfaces in terms of number of Operations, number of Parts, number of XSDElements and number of XSDTypes declared in each version. The size of the WSDL interfaces has been measured using the API provided by the org.eclipse.wst.wsdl and org.eclipse.xsd Eclipse Plug-in projects. The results reported in Table 4.5 show that the web services under analysis evolve differently. The number of Operations declared in the AmazonEC2 service is continuously growing and only in four versions does not change (version 5, 7, 22 and 23). The number of Operations declared in the other web services is more stable. Specifically, the FedEx Pkg service declares always 2 Operations. The FedEx Rate service declares 1 Operation in 9 versions out of 10 and 2 Operations in 1 version (version 3). Concerning the FedEx Ship service we can notice an increase in the number of Operations from version 1 to version 5. Then, the number of Operations decreases to 7 and it remains stable until the current version (version 10). To better understand the evolution of web services we used the WSDLDiff tool to extract the fine-grained changes from subsequent versions of the WSDL interfaces under analysis. In the next subsections we first show the types of changes extracted in this study and then we present the results of the study answering our research questions. Table 4.1: Number of added, changed and removed WSDL and XSD elements for each WSDL interface under analysis WSDL AmazonEC2 AmazonEC2 AmazonEC2 FedEx Rate FedEx Rate FedEx Rate FedEx Ship FedEx Ship FedEx Ship FedEx Pkg FedEx Pkg FedEx Pkg Type WSDL XSD Total WSDL XSD Total WSDL XSD Total WSDL XSD Total #Added 358 623 981 (≈80%) 3 236 239 (≈39%) 28 182 210 (≈38%) 0 0 0 (0%) #Changed 34 166 200 (≈16%) 1 295 296 (≈49%) 4 298 302 (≈55%) 0 6 6 (100%) #Deleted 46 5 51 (≈4%) 3 73 76 (≈12%) 8 28 36 (≈6%) 0 0 0 (0%) 74 Chapter 4. Fine-Grained WSDL Changes 4.3.1 Fine-Grained Changes The output of WSDLDiff consists of the set of edit operations. These operations are associated with the elements declared in the WSDL and XSD specifications. Among all the elements the following WSDL elements have been detected as affected by changes: BindingOperation, Operation, Message and Part. The XSD elements detected as affected by changes are: XSDType, XSDElement, XSDAttributeGroup and XSDAnnotation. These elements were affected by the following fine-grained changes: • XSD Element changes: consist of added XSDElements (XSDElementA), removed XSDElements (XSDElementR) and moved XSDElements (XSDElementM) within a declaration of an XSDType or an XSDElement. • Attribute changes: changes due to the update of an attribute value. Specifically we detected changes to the values of the attributes name (NameUpdate), minOccurs (MinOccursUpdate), maxOccurs (MaxOccursUpdate) and fixed (FixedUpdate). • Reference Changes: consists of changes to a referenced value (RefUpdate). • Enumeration Changes: changes of elements declared within an XSDEnumeration element. We detected added enumeration values (EnumerationA) and removed enumeration values (EnumerationR). For the sake of simplicity we have presented only the changes detected in our study. However WSDLDiff is able to detect changes to every element declared in the WSDL and XSD specifications. 4.3.2 Research Question 1 (RQ1) The first research question (RQ1) is: What is the percentage of added, changed and removed elements of a WSDL interface? To answer RQ1, for each type of element declared in the WSDL and XSD specifications, we counted the number of times they have been added, changed, or removed between every pair of subsequent versions of the WSDL interfaces under analysis. We present the results in three different tables. In Table 4.2 we report the number of added, changed and deleted WSDL elements while the added, changed and removed XSD elements are shown in Table 4.3. Table 4.1 summarizes the results showing the total number and the percentage 4.3. Study 75 of added, changed and deleted WSDL and XSD elements for each web service. The raw data with the changes extracted for each pair of subsequent versions is available on our web site.11 In Table 4.2 we omitted the number of added, changed and removed BindingOperations because they are identical to the number of added, changed and removed Operations. Moreover, the added and removed Parts do not include the Parts that were added and removed due to the additions and deletions of Messages. This choice allows us to highlight the changes in the Parts of existing Messages. Table 4.2: Number of added Operations (OperationA), changed Operations (OperationC), deleted Operations (OperationD), added Messages (MessageA), changed Messages (MessageC), deleted Messages (MessageD), added Parts (PartA), changed Parts (PartC) and deleted Parts (PartD) for each WSDL interface. Change Type AmazonEC2 FedEx Rate FedEx Ship FedEx Pkg OperationA 113 1 10 0 OperationC 0 1 0 0 OperationD 9 1 4 0 MessageA 218 2 16 0 MessageC 2 0 2 0 MessageD 10 2 2 0 PartA 27 0 2 0 PartC 34 0 0 0 PartD 27 0 2 0 Total 440 7 38 0 The results show that in all the web services the total number of deleted elements is a small percentage of the total number of changes (see Table 4.1). In particular, the percentage of deleted elements is approximately 4% for AmazonEC2, 12% for FedEx Rate and 6% for FedEx Ship. This result demonstrates that web service providers do not tend to delete existing elements. Concerning the number of added elements, the FedEx Rate and Ship services show approximately the same percentage (39% and 38%) while the AmazonEC2 service shows a percentage of approximately 80%. These percentages need to be interpreted taking into account the added, changed and removed WSDL and XSD elements. In fact, while the AmazonEC2 evolves con11 http://swerl.tudelft.nl/twiki/pub/DanieleRomano/WebHome/ ICWS12RQ1.pdf 76 Chapter 4. Fine-Grained WSDL Changes Table 4.3: Number of added XSDTypes (XSDTypeA), changed XSDTypes (XSDTypeC), deleted XSDTypes (XSDTypeD), added XSDElements (XSDElementA), changed XSDElements (XSDElementC), deleted XSDElements (XSDElementD), added XSDAttributeGroup (XSDAttributeGroupA) and changed XSDAttributeGroup (XSDAttributeGroupC) for each WSDL interface. Change Type AmazonEC2 FedEx Rate FedEx Ship FedEx Pkg XSDTypeA 409 234 157 0 XSDTypeC 160 295 280 6 XSDTypeD 2 71 28 0 XSDElementA 208 2 25 0 XSDElementC 1 0 18 0 XSDElementD 0 2 0 0 XSDAttributeGroupA 6 0 0 0 XSDAttributeGroupC 5 0 0 0 Total 791 604 508 6 tinuously adding 113 new Operations (see Table 4.2), the FedEx services are more stable with 1 new Operation added in FedEx Rate and 10 new Operations added in FedEx Ship. However, despite the few number of new Operations added in the FedEx services the number of added, changed and removed XSDTypes is high like in the AmazonEC2 service. This result lets us assume that the elements added in the FedEx services modify old functionalities and, hence, they are more likely to break the clients. Instead the AmazonEC2 is continuously evolving providing new Operations. This assumption is confirmed by the percentage of changed elements, that is lower in AmazonEC2 (about 16%) than in FedEx Rate and Ship (about 49% and 55%). Based on these results we can answer RQ1 stating that in all four web services the percentage of removed elements is a small percentage compared to the total number of added, changed and removed elements. Concerning the added elements the AmazonEC2 showed the highest percentage (≈80%) due to the high number of new WSDL elements added along its evolution. Instead the FedEx Rate and Ship services showed lower percentages (respectively about 39% and 38%). The percentage of changed elements is higher in the FedEx Rate and Ship services (respectively about 49% and 55%) compared to the approximately 16% of changed elements in AmazonEC2. Answering RQ1 we decided to omit the analysis of the FedEx Pkg service because the low number of changes and versions do not allow us to make any 4.3. Study 77 assumption. 4.3.3 Research Question 2 (RQ2) The second research question (RQ2) is: Which types of changes are made to the elements of a WSDL interface? In order to address RQ2 we focused on the changes applied to XSDTypes. In fact, among all the elements changed (802), 742 elements (approximately 92%) are XSDTypes (see Table 4.2 and 4.3). For each XSDType we extracted the fine-grained changes and we report the results in Table 4.4. We omitted to report the number of XSDAnnotation changes because they are not relevant for our study. The raw data with the changes extracted for each pair of subsequent versions is available on our web site.12 Table 4.4: Number of added XSDElements (XSDElementA), deleted XSDElements (XSDElementR), moved XSDElements (XSDElementM), updated attributes (NameUpdate, MinOccursUpdate, MaxOccursUpdate and FixedUpdate), updated references (RefUpdate), added enumeration values (EnumerationA) and removed enumeration values (EnumerationR) in the XSDTypes for each WSDL interface. Change Type AmazonEC2 FedEx Rate FedEx Ship FedEx Pkg XSDElementA 198 113 136 1 XSDElementD 11 47 49 3 XSDElementM 1 55 51 0 NameUpdate 11 20 8 0 MinOccursUpdate 17 33 39 0 MaxOccursUpdate 0 9 6 0 FixedValue 0 11 12 2 RefUpdate 9 80 273 0 EnumerationA 0 1141 926 2 EnumerationD 0 702 528 3 Total 247 2211 2028 11 The results show that the most frequent change along the evolution of the AmazonEC2 is the XSDElementA. In fact, it accounts for around 80% (198 changes out of 247) of the total changes. Concerning the FedEx Rate and FedEx 12 http://swerl.tudelft.nl/twiki/pub/DanieleRomano/WebHome/ ICWS12RQ2.pdf 78 Chapter 4. Fine-Grained WSDL Changes Ship services, the EnumerationA changes are the most frequent, accounting for approximately 51% (1141 changes out of 2211) and for 45% (926 changes out of 2028) of all changes. Adding the EnumerationD changes, we obtain approximately 83% (1843 changes out of 2211) and 71% (1454 changes out of 2028) of changes occurring in the enumeration elements. The results show that in 3 web services out of 4 there is a type of change that is predominant. Based on this result web services consumers can become aware of the most frequent types of changes affecting a WSDL interface. Like for RQ1, the small number of changes in the FedEx Pkg does not allow any valid conclusion. 4.3.4 Summary and implications of the results The changes collected in this study highlight how different WSDL interfaces evolve differently. This study with the WSDLDiff tool can help services consumers to analyze which elements are frequently added, changed and removed and which types of changes are performed more frequently. For example, a developer who wants to integrate a FedEx service into his/her application can learn that the specification of data types changes most frequently while Operations change only rarely (RQ1). In particular, the enumeration values are the most unstable elements (RQ2). Instead, an AmazonEC2 consumer can be aware that new Operations are continuously added (RQ1) and that data types are continuously modified adding new elements (RQ2). 4.4 Conclusion & Future Work In this chapter we proposed a tool called WSDLDiff to extract fine-grained changes between two WSDL interfaces. With WSDLDiff we performed a study aimed at understanding the evolution of web services looking at the changes detected by our tool. The results of our study showed that the fine-grained changes are a useful means to understand how a particular web service evolves over time. This information is relevant for web services consumers who want 1) to analyze the most frequent changes affecting a WSDL interface and 2) to compare the evolution of different web services with similar features. From this information they can estimate the risk associated to the usage of a web service. The study presented in this chapter is the first study on the evolution of web services and we believe that our tool provides an essential starting point. As future work, first we plan to investigate metrics that can be used as indicators of changes in WSDL elements. For instance in our work shown in Chapter 2, we found an interesting correlation between the number of changes in Java interfaces and the external cohesion metric defined for services by 4.4. Conclusion & Future Work 79 Perepletchikov et al. [2010]. With our tool to extract fine-grained changes we performed a similar study with WSDL interfaces that will be shown in Chapter 6. Next, we plan to classify the changes retrievable with WSDLDiff, integrating and possibly extending the works proposed by Feng et al. [2011] and Treiber et al. [2008]. Finally, we plan to investigate the co-evolution of the different web services composing a service oriented system. With WSDLDiff we can highlight web services that evolve together, hence, violating the loosely coupling property. This analysis can help us to investigate the causes of web services co-evolution and techniques to keep their evolution independent. 80 Chapter 4. Fine-Grained WSDL Changes Table 4.5: Number of Operations, Parts, XSDElements and XSDTypes declared in each version of the WSDL interfaces under analysis WSDL AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 AmazonEC2 FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Rate FedEx Ship FedEx Ship FedEx Ship FedEx Ship FedEx Ship FedEx Ship FedEx Ship FedEx Pkg FedEx Pkg FedEx Pkg Ver. Operations Parts XSDElements XSDTypes 2 14 28 28 60 3 17 34 34 75 4 19 38 38 81 5 19 38 38 81 6 20 40 40 87 7 20 40 40 85 8 26 52 52 111 9 34 68 68 137 10 37 74 74 151 11 38 76 76 157 12 41 82 82 171 13 43 86 86 179 14 65 130 130 259 15 68 136 136 272 16 74 148 148 296 17 81 162 162 326 18 87 174 174 350 19 91 182 182 366 20 95 190 190 390 21 118 236 236 464 22 118 236 236 465 23 118 236 236 467 1 1 2 2 72 2 1 2 2 80 3 2 4 4 88 4 1 2 2 124 5 1 2 2 129 6 1 2 2 178 7 1 2 2 202 8 1 2 2 223 9 1 2 2 228 10 1 2 2 235 2 1 2 2 124 5 9 16 16 178 6 9 16 16 177 7 7 12 12 199 8 7 12 12 221 9 7 12 12 246 10 7 12 12 254 2 2 4 4 20 3 2 4 4 20 4 2 4 4 20 . 5. Dependencies among Web APIs Service Oriented Architecture (SOA) enables organizations to react to requirement changes in an agile manner and to foster the reuse of existing services. However, the dynamic nature of service oriented systems and their agility bear the challenge of properly understanding such systems. In particular, understanding the dependencies among services is a non trivial task, especially if service oriented systems are distributed over several hosts belonging to different departments of an organization. In this chapter, we propose an approach to extract dynamic dependencies among web services. The approach is based on the vector clocks, originally conceived and used to order events in a distributed environment. We use the vector clocks to order service executions and to infer causal dependencies among services. We show the feasibility of the approach by implementing it into the Apache CXF framework and instrumenting the SOAP messages. We designed and executed two experiments to investigate the impact of the approach on the response time. The results show a slight increase that is deemed to be low in typical industrial service oriented systems.1 5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Study Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 98 IT organizations need to be agile to react to changes in the market. As a consequence they started to develop their software systems as Software as a 1 This chapter was published in the 4th International Conference on Service Oriented Computing and Application (SOCA 2011) [Romano et al., 2011]. 81 82 Chapter 5. Dependencies among Web APIs Service (SaaS), overcoming the poor inclination of monolithically architected systems towards agility. Hence, the adoption of Service Oriented Architectures (SOAs) has become popular. In addition, SOA-based application development also aims at reducing development costs through service reuse. On the other hand, mining dependencies between services in a SOA is relevant to understand the entire system and its evolution over time. The distributed and dynamic nature of those architectures makes this task particularly challenging. In order to get an accurate picture of the dependencies within a SOA system a dynamic analysis is required. Using static analyses simply fails to cover important features of a SOA architecture, for example the ability to perform dynamic binding. To the best of our knowledge, existing technologies used to deploy a service oriented system do not provide tool to accurately detect the entire chain of dependencies among services. For instance, open source Enterprise Service Bus systems (e.g., MuleESB2 and ServiceMix3 ) are limited to detect only direct dependencies (i.e., invocation between pair of services). Such monitoring facilities are widely implemented through the wire tap and the message store patterns described by Hohpe and Woolf [2003]. Other tools, such as HP OpenView SOA Manager4 , allow the exploration of the dependencies, but they must explicitly be specified by the user [Basu et al., 2008]. In this chapter, we propose (1) an adaptation of our approach based on vector clocks [Romano and Pinzger, 2011b] to extract dynamic dependencies among web services deployed in an enterprise; (2) a non-intrusive, easy-toimplement and portable implementation and (3) an analysis of the impact of our approach on the performance. Vector clocks have originally been conceived and used to order events in a distributed environment [Mattern, 1989; Fidge, 1988]. We bring this technique to the domain of service oriented systems by attaching the vector clocks to SOAP messages and use them to order service executions and to infer causal dependencies. The approach has been implemented into the Apache CXF5 framework taking advantage of the Pipes and Filters pattern [Hohpe and Woolf, 2003]. Since this pattern is widely used in the most popular web service frameworks and Enterprise Service Buses, the approach can be implemented on other SOA 2 http://www.mulesoft.org/ http://servicemix.apache.org/ 4 http://h20229.www2.hp.com/products/soa/ 5 http://cxf.apache.org/ 3 5.1. Applications 83 platforms (e.g., Apache Axis26 and Mule ESB) in a similar manner. To analyze the impact of the approach on the performance of a system we investigate how the approach affects the response time of services. The results show a slight increase due to the increasing message size and the instrumented Apache CXF framework. To determine the impact on real systems a repository of 41 industrial systems is examined. Given the amount of services typically deployed within these industrial systems we do not expect a significant increase of the response time when using our approach. This chapter is structured as follows. In Section 5.1 we present the main applications of the proposed approach. In Section 5.2 we report the related work. In Section 5.3 we describe the context in which we plan to apply our study. In Section 5.4 we describe our approach to extract dynamic dependencies among web services. In Section 5.5 we propose an implementation of our approach. In Section 5.6 we report the first experiments and the obtained results. Finally, we conclude the chapter and present the future work in Section 5.7. 5.1 Applications In this section we discuss the main applications of our approach that we plan to perform in future work. 5.1.1 Quality attributes measurement Our approach can be used to build up dynamic dependency graphs. These graphs are commonly weighted, where the weights indicate the number of times a particular service is invoked or a particular execution path is traversed. The information contained in these graphs can help software engineers to measure important quality attributes (e.g., analyzability and changeability) for measuring maintainability of the system under analysis. For instance Perepletchikov et al. defined several cohesion and coupling metrics to estimate the maintainability and analyzability of service oriented systems [Perepletchikov et al., 2006, 2007, 2010]. In our work shown in Chapter 2, we found an interesting correlation between the number of changes in Java interfaces and the external cohesion metric defined by Perepletchikov et al. With our approach to extract dynamic dependencies among services we plan to perform similar studies to validate and improve those metrics by analyzing service oriented systems. More in general, our dynamic dependency analysis is a starting point to study the interactions among services in indus6 http://axis.apache.org/ 84 Chapter 5. Dependencies among Web APIs trial service oriented systems and to define anti-patterns that can affect the quality attributes required by a SOA. 5.1.2 Change Impact Analysis Besides the measurement of quality attributes our approach can be used to perform Change Impact Analyses (IA) on service oriented systems. Bohner et al. [Shawn A. Bohner, 1996; Bohner, 2002] defined the IA as the identification of potential consequences of a change, or the assessment of what needs to be modified to accomplish a change. They defined two techniques to perform IA, namely Traceability and Dependency. Wang and Capretz [2009] defined an IA approach for service oriented systems based on a service dependency graph. Our approach fits in with their work by adding a dynamic dependency graph. 5.2 Related Work The most recent work on mining dynamic dependencies in service oriented systems has been developed by Basu et al. [2008]. Basu et al. infer the causal dependencies through three dependencies identification algorithms, respectively based on the analysis of 1) occurrence frequency of logged message pairs, 2) distribution of service execution time and 3) histogram of execution time differences. This approach does not require the instrumentation of the system infrastructure. However, it is based on probabilities and there is still the need for properly setting the parameters of their algorithms to reach a good accuracy. Briand et al. [2006] proposed a methodology and an instrumentation infrastructure aimed at reverse engineering of UML sequence diagrams from dynamic analysis of distributed Java systems. Their approach is based on a complete instrumentation of the systems under analysis which in turn requires a complete knowledge of the system. Hrischuk and Woodside [2002] provided a series of requirements to reverse engineer scenarios from traces in a distributed system. However, besides the requirements, this work does not provide any approach to extract dependencies in a service oriented system. As can be deduced from the overview of related work there currently does not exist any accurate approach for inferring the dependencies amongst services. In this chapter, we present such an approach based on the concept of vector clocks. 5.3. Study Context 85 Figure 5.1: A sample enterprise with web services deployed in two departments 5.3 Study Context In this section we describe the context in which we plan to apply our study. The perspective is that of a quality engineer who wants to extract the dynamic dependencies among services within the boundaries of an enterprise. We refer to dependencies as message dependencies, according to which two services are dependent if they exchange messages. We furthemore refer to web services as services which are compliant to the following XML-standards: • WSDL7 (Web Services Description Language) which describes the service interfaces. • SOAP8 (Simple Object Access Protocol) widely adopted as a simple, robust and extensible XML-based protocol for the exchange of messages among web services. 7 8 http://www.w3.org/TR/wsdl http://www.w3.org/TR/soap/ 86 Chapter 5. Dependencies among Web APIs Finally, we assume that the enterprise provides a UDDI9 (Universal Description, Discovery, and Integration) registry to allow for the publication of services and the search for services that meet particular requirements. Our sample enterprise is composed of several departments (a sample enterprise with two departments is shown in Figure 5.1). Each department exposes some functionality as web services that can be invoked by web services deployed in other departments. Services deployed within the boundaries of the enterprise are called internal services. Services deployed outside the boundaries of the enterprise are called external services. We assume that hosts within the departments publish web services through an application server (e.g., JBoss AS10 or Apache Tomcat11 ) and web service engines (e.g., Apache Axis2 or Apache CXF). 5.4 Approach Our approach to extract dynamic dependencies among web services is based on the concept of vector clocks. In this section, we first provide a background on vector clocks after which we present our approach to order service executions and to infer dynamic dependencies among web services. 5.4.1 Vector Clocks Ordering events in a distributed system, such as a service oriented system, is challenging since the physical clock of different hosts may not be perfectly synchronized. The logical clocks were introduced to deal with this problem. The first algorithm relying on logical clocks was proposed by Lamport [1978]. This algorithm is used to provide a partial ordering of events, where the term partial reflects the fact that not every pair of events needs to be related. Lamport formulated the happens-before relation as a binary relation over a set of events which is reflexive, antisymmetric and transitive. Lamport’s work is a starting point for the more advanced vector clocks defined by Fidge and Mattern in 1988 [Fidge, 1988; Mattern, 1989]. Like the logical clocks, they have been widely used for generating a partial ordering of events in a distributed system. Given a system composed by N processes, a vector clock is defined as a vector of N logical clocks, where the ith clock is associated to the ith process. Initially all the clocks are set to zero. Every time a process sends a message, it increments its own logical clock, and it attaches 9 http://uddi.xml.org/ http://www.jboss.org/jbossas/ 11 http://tomcat.apache.org/ 10 5.4. Approach 87 the vector clock to the message. When a process receives a message, first it increments its own logical clock and then it updates the entire vector clock. The updating is achieved by setting the value of each logical clock in the vector to the maximum of the current value and the values contained by the vector received with the message. 5.4.2 Inferring dependencies among web services We conceive a vector clock (VC) as a vector/array of pairs (s,n), where s is the service id and n is number of times the service s is invoked. When an instance of the service s receives an execution request the vector clock is updated according to the following rules: • if the request does not contain a vector clock (e.g., a request from outside the system), the vector clock is created, and the pair (s,1) is added to it; • if the request contains a vector clock and a pair with service id s is already contained in the vector clock, the value of n is incremented by one; if not, the pair (s,1) is added to the vector. Once the vector clock is updated, its value is associated to the execution of service s and we label it VC(s). The vector clock is then stored in a database. Whenever an instance of the service s sends an execution request to another service x, then the following actions are performed: • if the service x is an internal service, then the vector clock is attached to the outgoing message; • if the service x is an external service, the pair (x,1) is added to the vector clock and the vector clock is stored in the database but not attached to the outgoing message. From the set of vector clocks stored in the database, we can infer the causal order of the service executions. Given the vector clocks associated to the execution of the service i and the service j, VC(i) and VC(j), we can state that the execution of service i causes the execution of service j, if VC(i) <VC(j), according to the following equation: V C(i) < V C( j) ⇔ ∀x V C(i) x ≤ V C( j) x ∧∃x 0 V C(i) x 0 < V C( j) x 0 (5.1) 88 Chapter 5. Dependencies among Web APIs Figure 5.2: Example of a service oriented system to open a bank account where VC(i)x denotes the value for n in the pair (x,n) of the vector clock VC(i). In other words, the execution of a service i causes the execution of a service j, if and only if all the pairs contained in the vector VC(i) have values for n that are less or equal to the corresponding values for n in VC(j), and at least one value for n is smaller. If all the corresponding pairs of the two vector clocks VC(i) and VC(j) contain the same values for n except one corresponding pair whose values for n differ exactly by 1, we state that there is a direct dependency (i.e., a direct call) between service i and service j. If a pair with id s is missing in the vector the value for n is considered to be 0. Finally, to infer the dynamic dependencies among services, it is necessary to apply the binary relation in (5.1) among each pair of vector clocks whose values are stored in the database. 5.4. Approach 5.4.3 89 Working Example Consider the example system from Figure 5.2 composed by six services inside the enterprise boundary, one external service and one client which triggers the execution. The system provides the services to open an account in a banking system. In this example, the client interested in creating an account needs to invoke the service OpenAccount. This service invokes the services GetUserInfo, Deposit and RequestCreditCard. These services invoke the service WriteDB to access a database. WriteDB first writes in a database and then, if its invocation has been triggered by RequestCreditCard, invokes NotifyUser which performs actions to notify the user. The external service TaxAuthority is invoked by GetUserInfo to inquire fiscal information about the user. The execution flow resulting from the invocation of the service OpenAccount is shown as a UML sequence diagram in Figure 5.3. The arrows in the diagram are labeled with the vector clocks associated to the execution of the invoked service. Vector clocks with superscripts mark vector clocks associated to different instances of the same service. When the OpenAccount (OA) service is invoked, there is no vector clock attached to the message, since the invocation request comes from outside (i.e., Client). Hence, a new vector clock (VC(OA)) is created with the single pair (OA,1) and it is stored in the database. Then the execution of the service OpenAccount triggers the execution of the service GetUserInfo (GUI). When this service is invoked, a new pair (GUI,1) is added to the vector clock, obtaining the new clock VC(GUI)=[(OA,1),(GUI,1)] that is stored in the database. When the service GetUserInfo (GUI) invokes the external service TaxAuthority (TA) the vector clock is set to VC(TA)=[(OA,1),(GUI,1),(TA,1)] and is stored in the database. In this way we can infer dependencies to external services. Since TaxAuthority (TA) is an external service and we do not have control of external services the vector clock is not attached to this message. Consider the execution of the service WriteDB (WDB), and assume we want to infer all the services that depend on it. Since we have multiple invocations of the service WriteDB in the execution flow, the dependent services are all the services x whose vector clocks VC(x) satisfy the following boolean expression: V C(x) < V C(W DB)0 ∨ V C(x) < V C(W DB)00 ∨ ∨V C(x) < V C(W DB)000 These services are OpenAccount, GetUserInfo, Deposit and RequestCreditCard (see Figure 5.3). <<external>> /TaxAuthority (TA) <<internal>> /Deposit (D) <<internal>> /WriteDB (WDB) <<internal>> /NotifyUser (NU) VC(WDB)ʼʼʼ=[(OA,1),(RCC,1),(WDB,1)] VC(NU)=[(OA,1),(RCC,1), (WDB,1),(NU,1)] VC(WDB)ʼʼ=[(OA,1),(D,1),(WDB,1)] VC(WDB)ʼ=[(OA,1),(GUI,1),(WDB,1)] <<internal>> /RequestCreditCard (RCC) VC(RCC)=[(OA,1),(RCC,1)] VC(D)=[(OA,1),(D,1)] VC(TA)=[(OA,1),(GUI,1),(TA,1)] <<internal>> /GetUserInfo (GUI) VC(GUI)=[(OA,1),(GUI,1)] <<internal>> /OpenAccount (OA) VC(OA)=[(OA,1)] /Client <<external>> 90 Chapter 5. Dependencies among Web APIs Figure 5.3: Sequence diagram for opening a bank account. The arrows in the diagram are labeled with the vector clocks associated to the execution of the invoked service. 5.5. Implementation 91 If we want to infer all the services that WriteDB depends on, we look for all the services x whose vector clock VC(x) satisfy the following boolean expression: V C(x) > V C(W DB)0 ∨ V C(x) > V C(W DB)00 ∨ ∨V C(x) > V C(W DB)000 The sole service which WriteDB depends on is NotifyUser. Consider the execution of the service OpenAccount (OA), and assume we want to infer the services that OpenAccount depends on directly. Those services are the services GetUserInfo(GUI), Deposit(D) and RequestCreditCard(RCC). Their vector clocks (VC(GUI), VC(D) and VC(RCC)) contain only one pair (respectively (GUI,1), (D,1) and (RCC,1)) with a value for n that is larger exactly by 1 than the corresponding values in the vector clock VC(OA). Among the services OA and WDB there are no direct dependencies because the vector clocks corresponding to the execution of WDB contain two pairs with different values for n. The values for n from the example in Figure 5.3 are all equal to 1. However, they are needed to detect the presence of cycles along the execution flows. Assume that the NotifyUser service invokes the WriteDB service introducing a cycle. In this case the vector clock associated to the second invocation of the service WriteDB is VC(WDB)=[(OA,1),(RCC,1),(WDB,2),(NU,1)]. 5.5 Implementation The implementation of the proposed approach should be non-intrusive, easyto-implement and portable to different SOA platforms. Only if these properties hold we can be sure that the approach can be adapted in an industry setting. In this section we propose an implementation that meets these requirements. The implementation requires three steps. First, the messages need to be instrumented to attach the vector clock data structure. Next, we need a technique to capture the incoming messages in order to retrieve the vector clock, update it and store its value in the database. Finally the outgoing messages have to be captured to attach the updated vector clock to them. To instrument the messages we use the SOAP header element. This element is meant to contain additional information (e.g., authentication information) not directly related to the particular message. For example, after attaching the vector clock to the message sent from the service GetUserInfo to the service WriteDB (see Figure 5.3), the message contains the following header: 92 Chapter 5. Dependencies among Web APIs <s o a p : E n v e l o p e> <soap:Header> <v c : V e c t o r C l o c k> <v c : p a i r> <v c : s>OpenAccount</ v c : s> <v c : n>1</ v c : n> </ v c : p a i r> <v c : p a i r> <v c : s>G e t U s e r I n f o</ v c : s> <v c : n>1</ v c : n> </ v c : p a i r> </ v c : V e c t o r C l o c k> </ soap:Header> ... </ s o a p : E n v e l o p e> Concerning the interception of the incoming and outgoing messages, we adopted a technique that relies on the Pipes and Filters [Hohpe and Woolf, 2003] architectural pattern. The Pipes and Filters pattern allows to divide a larger processing task into a sequence of smaller, independent processing steps, called Filters, that are connected by channels, called Pipes. This pattern is widely adopted to process incoming and outgoing messages in web service engines and frameworks such as Apache Axis2 and Apache CXF. Those frameworks use Filters to implement the message processing tasks (e.g., messages marshaling and unmarshaling) and they allow the developers to easily extend the chains of Filters to further process messages. Since this pattern is widely used, even by the Enterprise Service Bus platforms (e.g., MuleESB), we decided to use this pattern to implement the logic needed to retrieve, update, store and forward the vector clocks. Instrumenting the services would be an alternative implementation approach. However, instrumentation is risky since it modifies the implementation and can introduce bugs. To implement our approach we use the Apache CXF service framework. In Apache CXF the filters are called interceptors. Figure 5.4 shows the chains of interceptors between an Apache CXF Deployed Service and an Apache CXF Developed Consumer. When the consumer invokes a remote service, the Apache CXF runtime creates an outbound chain (Out Chain) to process the request. If the invocation starts a two-way message exchange, the runtime creates an inbound chain to process the response (omitted in Figure 5.4). When a service receives a request from a consumer, a similar process takes place. The Apache CXF runtime creates an inbound interceptor chain (In 5.5. Implementation 93 Figure 5.4: The chains containing our vector clock interceptors between a Apache CXF Deployed Service and a Apache CXF Developed Consumer Chain) to process the request. If the request is part of a two-way message exchange, the runtime also creates an outbound interceptor chain (omitted in Figure 5.4). In this implementation we add two interceptors. We add VectorClockInInterceptor in the In Chain to update/create the vector clock value and store it in the database. In the Out Chain we added the VectorClockOutInterceptor to attach the vector clock to the outgoing message, or to update and store the vector clock in the case of invocations to external services. Those interceptors can be added dynamically to the chain of interceptors. This feature allows us to use our approach without re-deploying the system under analysis. 94 Chapter 5. Dependencies among Web APIs 5.6 Experiments To investigate the impact of our approach on the service response time we designed and executed two experiments. The response time of a system can increase because the approach introduces two variables. First, we introduced two new filters in the Pipes and Filters pattern and the Apache CXF runtime is loaded with additional message processing tasks. Secondly, we introduced a new header element in the SOAP messages to attach the vector clock which increases the size of the messages passed between services. We performed two experiments in which we measure the impact of the instrumented Apache CXF framework (Experiment 1) and the impact of the increasing size of the messages (Experiment 2) on the response time. To perform our experiments the Apache CXF framework 2.4.1 is instrumented as described in the previous section. Tomcat 7.0.19 is used as an application server and Hibernate 3 as Java persistence framework. On the hardware part two platforms are connected through a 100 Mbit/s Ethernet connection: • Platform 1: MacBook pro 6.2 , processor 2.66 GHz Intel Core i7, memory 4 GB DDR3, Mac OS 10.6.5. • Platform 2: MacBook pro 7.1 , processor 2.4 GHz Intel Core 2 Duo, memory 4 GB DDR3, Mac OS 10.6.4. Each platform uses a MySQL 5.1.53 (Community Edition) database to store the vector clocks values for subsequent dependencies extraction. Execution times are measured using the Java timer method, System.currentTime(). This method returns the current value of the most precise available system timer, in milliseconds (ms). 5.6.1 Experiment 1 In the first experiment we investigate the impact of the instrumented version of the Apache CXF framework on the response time. We implemented the example shown in Figure 5.2 deploying the services within the boundary on Platform 1 and the external service on Platform 2. We deployed the services within the system in one platform to achieve more accurate timing and eliminate the network overhead, which is not relevant for this experiment. Moreover the implementation of each service contains only the logic needed to invoke other services. We measured the response time of the service OpenAccount in three different scenarios: 5.6. Experiments 95 Response Time (ms) 300.00 250.00 200.00 150.00 100.00 Clock ClockNoDB NoClock Figure 5.5: Box plots of the response time in milliseconds obtained for the Experiment 1 • NoClock: we executed the system without our vector clock approach. • Clock: we executed the system with our vector clock approach. • ClockNoDB: we executed the system with our vector clock approach without storing the vector clocks values in the database. For each scenario we executed the system 1000 times to minimize the influence of the operating system activities. Figure 5.5 shows the box plots of the response time measured for the three different scenarios while the following table shows median and average values in milliseconds. Scenario NoClock ClockNoDB Clock Median (ms) 116.6 249.4 286.4 Average (ms) 108 226 275 The results show that on average the difference among the response time is 167 ms between the scenarios with and without vector clocks. The overhead due to the storage in the database using Hibernate 3 is on average 49 ms. 96 Chapter 5. Dependencies among Web APIs The difference measured is relevant, but it is relative to a system which involves the execution of 7 services without any business logic. The impact of our approach can be lower in real systems since the increase in milliseconds introduced by the instrumented Apache CXF framework is expected to be a small percentage of the total response time when additional logic is also executed. 5.6.2 Experiment 2 In the second experiment we investigate the impact of the increasing message size on the response time. We implemented the system shown in Figure 5.6. The system is composed of 12 web services that we labeled from 1 to 12. Each web service Servicei invokes the Servicei+1 , except the last service Service12 . The invocations among services are synchronous. To take into account the network overhead we deployed the Servicei on the Platform 1 if i is an odd number and on Platform 2 if it is even. Similarly to Experiment 1 the services’ implementations do not contain any business logic except the logic needed to invoke the other service. Figure 5.6: System deployed to perform the Experiment 2 We measure the response time of the service Service1 while increasing the vector clock size from 1 to 2000 pairs. The vector clock is added to the message sent to Service1 that forwards the message to Service2 until the last service of the execution flow is reached. For each vector clock size, this scenario is executed 1000 times to minimize the influence of the operating system activities. The vector clocks are not stored to the database in order to achieve more accurate time measures. Figure 5.7 shows the median and average of the measured response times for each vector clock size. As shown by the plot the increasing size of the messages has a relevant impact on the response time. Basically, the more 5.6. Experiments 97 unique services are invoked along the execution flow the higher the response time. Figure 5.7: Average and median response time in milliseconds when increasing the vector clock size for experiment 2 5.6.3 Summary of the results Our experiments measured the impact of the approach on the response time. This impact is mainly due to the increasing size of the SOAP messages. The instrumentation of the CXF framework can be a minor issue for real systems. In order to validate whether the increase in message-size is not problematic in practice, we counted the number of services and operations in a set of industrial systems which use web services. These industrial systems have been previously analyzed by the Software Improvement Group12 and cover a wide range of domains. The following table reports the frequencies of the number of operations and the number of services within these systems: 12 http://www.sig.eu 98 Chapter 5. Dependencies among Web APIs #Services 1-10 11-100 101-201 #Systems 31 6 4 #Operations 1-10 11-100 101-500 > 501 #Systems 13 17 9 2 According to these results, applying our approach to extract dependencies in the biggest system (composed of 201 services) in our repository would lead to an increase of the response time of 140 ms in the worst case. This difference is significant for a system without any business logic, but we believe it is only a small percentage of the response time in real systems. In our future work we plan to investigate the impact of our approach in a subset of those systems. 5.7 Conclusion & Future Work In this chapter, we presented a novel approach to extract dynamic dependencies among services using the concept of vector clocks. They allow the reconstruction of an accurate dynamic dependency graph from the execution of a service oriented system. We implemented our approach into the Apache CXF framework using the Pipes and Filters pattern. This pattern makes our approach portable to a wide range of SOA platforms, such as Mule ESB and Apache Axis2. The information retrievable with our approach is of great interest for both researchers and developers of service-oriented systems. Amongst others, the dependencies can be used to study service usage patterns and anti-patterns. In addition, the information can be used to identify the potential consequences of a change or a failure in a service, also known in literature as change and failure impact analysis. As future work, we plan to apply our approach to extract dependencies in both open-source and industrial systems. The extracted graphs allows us to measure important quality attributes of the systems under analysis, such as changeability, maintainability and analyzability. Moreover, we plan to further investigate the impact of our approach on the response time of industrial systems. If the impact is significant, we plan to improve our approach to minimize the introduced overhead. . 6 Change-Prone Web APIs Several metrics have been proposed in literature to highlight change-prone software components in order to ease their maintainability. However, to the best of our knowledge, no such studies exist for web APIs (i.e., APIs exposed and accessible via networks) whose popularity has grown considerably over the last years. Web APIs are considered contracts between providers and consumers and stability is a key quality attribute of them. We present a qualitative and quantitative study of the change-proneness of web APIs with low external and internal cohesion. First, we report on an online survey to investigate the maintenance scenarios that cause changes to web APIs. Then, we define an internal cohesion metric and analyze its correlation with the changes performed in ten well known WSDL APIs. Our results provide several insights into the interface, method, and datatype change-proneness of web APIs with low internal and external cohesion. The results assist both providers and consumers in assessing the stability of web APIs, and provide directions for future research.1 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Research Questions and Approach . . . . . . . . . . . . . . . . . . . . . 103 6.3 Online Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Over the last years software systems have grown significantly from isolated software systems to distributed software systems (e.g., service-oriented systems) [Neukom, 2004; Murer et al., 2010]. These systems consist of interconnected, distributed components that are implemented and deployed by 1 This chapter has been published as technical report [Romano et al., 2013]. 99 100 Chapter 6. Change-Prone Web APIs different organizations or different departments within the same organization. In these systems the distributed component’s API, referred to as web API throughout this chapter, is considered a contract between the component’s provider and its consumers [Erl, 2007]. One of the key factors for deploying successful web APIs, and in general APIs, is assuring an adequate level of stability [Erl, 2007; Daigneau, 2011; Vásquez et al., 2013]. Changes in a web API might break the consumers’ systems forcing them to continuously adapt to new versions of the web API [Papazoglou, 2008; Daigneau, 2011]. For this reason, assessing the stability of web APIs is key to reduce the likelihood of continuous updates. To reduce the effort and the costs to maintain software systems several approaches have been defined to identify change-prone software components [Girba et al., 2004; Khomh et al., 2009; Posnett et al., 2011; Penta et al., 2008]. Based on these studies software engineers can use quality indicators (e.g., software metrics, heuristics to measure antipatterns) that can estimate the components’ change frequency and can assist them in taking appropriate countermeasures. However, to the best of our knowledge, none of these studies investigates change indicators for web APIs. We believe that this is mainly due to the lack of publicly available web APIs with long histories which makes performing such studies challenging for reasearchers. Change-proneness indicators would bring relevant benefits for both providers and consumers. On the one hand, consumers can estimate the change-proneness of suitable web APIs available on the market and subscribe to the most stable one. On the other hand, providers want to publish stable web APIs to reduce the maintenance effort and to attract more consumers and, consequently, increase their profits. Among all the structural properties of web APIs (e.g., complexity and size), we believe that the cohesion can affect their change-proneness. Our intuition is based on our work shown in Chapter 2 and existing work [Perepletchikov et al., 2007, 2010]. In Chapter 2, we showed that, among the existing source code metrics, the external cohesion has the strongest correlation with the number of changes performed in Java interfaces. Moreover, Perepletchikov et al. [Perepletchikov et al., 2007, 2010] showed that the cohesion can affect the understandability and, consequently, the maintainability of web APIs. In this chapter, to assist both providers and consumers, we use a mixed method approach [Creswell and Clark, 2010] to analyze the impact of the internal and external cohesion on the change-proneness of web APIs. Internal cohesion measures the cohesion of the operations (also referred to as methods) of a web API. External cohesion measures the extent to which the oper- 6.1. Background 101 ations of a web API are used together by external consumers (also referred to as clients). In the first part of our study, we use an online survey to investigate 1) the interface and method level change-proneness of web APIs with low external cohesion and 2) the interface and data-type level change-proneness of web APIs with low internal cohesion. The results show the likelihood with which maintenance scenarios can cause changes in web APIs affected by low internal and external cohesion. The second part of our study consists of a quantitative analysis of the change-proneness of web APIs with low internal cohesion. We first introduce the Data Type Cohesion (DTC) metric to overcome the problem of the existing internal cohesion metrics. Based on frequent discussions with industrial partners and colleagues, we believe that the existing metrics should be improved because they do not take into account the cohesion among data types. We then analyze the change-proneness of ten public WSDL2 (Web Service Description Language) APIs investigating the correlation between our DTC metric and the number of changes performed in the WSDL APIs. The results show that the values for the DTC metric are correlated with the number of changes the WSDL APIs undergo to. The contributions of this chapter are: • insights into the likelihood of maintenance scenarios to cause changes in web APIs with low internal and external cohesion. • a new internal cohesion metric that takes into account the cohesion among data types to highlight change-prone WSDL APIs. • guidelines for researchers to investigate the method level of web APIs with low external cohesion and the data-type level of web APIs with low internal cohesion. 6.1 Background The concept of software cohesion has been widely investigated in the context of different programming paradigms [Briand et al., 1998; Counsell et al., 2006; Perepletchikov et al., 2007; Kumar et al., 2011; Zhao and Xu, 2004]. In this chapter we adhere to the classification defined by Perepletchikov et al. [Perepletchikov et al., 2007, 2010] who investigated the cohesion of web APIs. According to their classification there are 8 different levels of cohesion 2 http://www.w3.org/TR/wsdl 102 Chapter 6. Change-Prone Web APIs involving web APIs: coincidental, logical, temporal, communicational, external, implementation, sequential, and conceptual. In this chapter we focus on the external and communicational cohesion to which we refer as internal cohesion. The internal cohesion measures the cohesion of the operations (also referred as methods throughout the chapter) declared in a web API. Similar to the method cohesion (i.e., LCOM) defined by Chidamber et al. [Chidamber and Kemerer, 1991, 1994], the internal cohesion expresses the extent to which the operations belong together counting their common parameters. The external cohesion measures the extent to which the operations of a web API are used by external consumers (also called clients). In the next subsections, first, we present existing metrics proposed in literature to measure the external and internal cohesion. Then, we present the existing antipatterns in web APIs that result from low internal and external cohesion. 6.1.1 Cohesion Metrics To compute the external cohesion of a web API, Perepletchikov et al. [Perepletchikov et al., 2007, 2010] proposed the SIUC (Service Interface Usage Cohesion) metric. This metric computes the sum of operations invoked by each client normalized by the number of clients and operations declared in the web API. To the best of our knowledge there are no further studies proposing other metrics for measuring the external cohesion. Existing studies propose different metrics to measure the internal cohesion of web APIs. Perepletchikov et al. [Perepletchikov et al., 2007, 2010] proposed the SIDC (Service Interface Data Cohesion) metric, Sindhgatta et al. [Sindhgatta et al., 2009] proposed the LCOS (Lack of COhesion in Service) and the SFCI (Service Functional Cohesion Index). Even though their formulas differ, these metrics have in common that they only measure the degree to which operations use common messages without considering the cohesion of messages. For this reason, in this paper, we refer to these existing metrics as message-level metrics. 6.1.2 Antipatterns In literature different antipatterns for web APIs, and more in general for APIs, have been proposed [Moha et al., 2012; Rotem-Gal-Oz, 2012; Král and Zemlicka, 2007; Cherbakov et al., 2006; Dudney et al., 2002; Martin, 2002]. Among the proposed antipatterns two antipatterns in web APIs are symptoms of low internal and external cohesion: the Multiservice [Dudney et al., 2002] and the Fat [Martin, 2002] antipatterns. The Multiservice antipattern was originally conceived by Dudney et al. 6.2. Research Questions and Approach 103 <<webAPI>> CommerceAPI placeOrder() reserveInventory() generateInvoice() acceptPayment() validateCredit() getOrderStatus() cancelOrder() getPaymentStatus() Figure 6.1: Example of a Multiservice web API (symptom of low internal cohesion). It exposes operations related to five different business entities (i.e., Order, Inventory, Invoice, Payment, and Credit). [Dudney et al., 2002] and it is also known as God Object in literature [Moha et al., 2012]. A Multiservice web API exposes many operations that are related to different business entities. The CommerceAPI shown in Figure 6.1 is an example of a Multiservice API. This API exposes operations related to five different business entities: Order, Inventory, Invoice, Payment, and Credit. Such a web API ends up to be low internally cohesive because of the different entities encapsulated by it. As a consequence, many clients can invoke simultaneously its operations causing performance bottlenecks [Moha et al., 2012]. The Fat antipattern was proposed by Martin in [Martin, 2002]. This antipattern occurs in web APIs and other types of APIs, such as Java interfaces. A Fat web API is an API with disjoint sets of operations that are invoked by different clients and, hence, they show low external cohesion. The BankAPI shown in Figure 6.2 is an example of a Fat API. The Student and Professional clients invoke two disjoint sets of operations. Martin proposed the Interface Segregation Principle (ISP) to refactor such APIs. The ISP states that Fat APIs need to be split into smaller APIs according to its clients’ usage. Each smaller API should be specific to a client and each client should only know about the set of operations it is interested in. 6.2 Research Questions and Approach The change-proneness of web APIs is relevant to design and maintain large distributed information systems [Murer, 2011; Murer et al., 2010]. To better 104 Chapter 6. Change-Prone Web APIs <<webAPI>> BankAPI accountBalanceForStudent() requestLoanForStudent() requestInsuranceForStudent() requestInsuranceForPro() requestLoanForPro() accountBalanceForPro() Student Client <<uses>> Professional Client Figure 6.2: Example of a Fat web API (symptom of low external cohesion). The Student and Professional clients invoke disjoint set of operations. understand the importance of assuring stable web APIs consider the scenario shown in Figure 6.3. PaymentAPI1 provides 3 changes per month invokes PaymentAPI2 P Provider1 provides 1 change per month BankClient PaymentAPI3 9 changes per month Provider2 provides Provider3 Figure 6.3: Scenario in which the web API consumer BankClient subscribes to the most stable API PaymentAPI2 among three available web APIs. In this scenario, a web API consumer (i.e., BankClient) wants to use a web API to receive payments from its customers. On the market there are three different providers (i.e., Provider1, Provider2, and Provider3) each providing a payment API (i.e., PaymentAPI1, PaymentAPI2, and PaymentAPI3) that adhere to BankClient’s business and functional requirements. BankClient is interested in a stable web API to reduce the need to adapt its system(s). Therefore, he decides to monitor the evolution of the three web APIs for a certain time. After this time he can use the most stable API (i.e., PaymentAPI2) with the lowest 6.2. Research Questions and Approach 105 change frequency (i.e., 1 change per month). In a real world scenario, where time-to-market is important for gaining competitive advantage, BankClient typically does not have the time to monitor the stability of different web APIs. Moreover, the number of past changes might not be available and they might not be a good indicator for future changes. For instance, an API might have been refactored to improve its change-proneness. Furthermore, from the perspective of providers, they are interested in providing stable web APIs to increase the likelihood with which clients subscribe to their APIs and, consequently, increasing their profits. In this chapter, we investigate the relationship between internal and external cohesion and the change-proneness of web APIs. The results can assist web APIs consumers and providers in estimating the stability of web APIs. In the following, we motivate and state our research questions, as well as, outline our research approach. 6.2.1 External Cohesion and Change-Proneness Concerning web APIs with low external cohesion we want to investigate which scenarios are more likely to cause future changes in Fat web APIs. Moreover, we want to analyze the change-proneness of methods exposed in such web APIs. APIs with low external cohesion can have two different types of methods. Shared methods are methods invoked by all different clients. In Figure 6.4 requestInsurance() is a shared method since both the Student and Professional clients invoke it. Non-shared methods are methods invoked only by a specific client (e.g., the requestLoanForStudent() method in Figure 6.4). We believe that these two classes of methods can be changed for different reasons and knowing these reasons can give further insights into the change-proneness of web APIs with low external cohesion. To assist providers in evaluating the change-proneness of their web APIs with low external cohesion, we answer the following research question: • RQ1: What are the scenarios in which developers change web APIs with low external cohesion? In which cases do they change the shared and non-shared methods? We investigate the change-proneness on two different levels: interface level (i.e., change-proneness of a web API as a whole) and method level (i.e., change-proneness of the methods exposed by a web API). The results from this research question assist only providers because consumers typically do not have access to the information needed to measure the external cohesion (i.e., how other consumers invoke the API). 106 Chapter 6. Change-Prone Web APIs <<webAPI>> BankAPI accountBalanceForStudent() requestLoanForStudent() requestInsurance() requestLoanForPro() accountBalanceForPro() Student Client <<uses>> Professional Client Figure 6.4: Web API with low external cohesion where only the method requestInsurance() is shared by the two different clients Student Client and Professional Client. 6.2.2 Internal Cohesion and Change-Proneness Similar to external cohesion we investigate which scenarios are more likely to cause changes in Multiservice web APIs (i.e., web APIs with low internal cohesion). Furthermore, we analyze the change-proneness of the data types declared within a web API. This allows to highlight the differences between the change-proneness of shared data types (i.e., data types referenced multiple times within a web API) and non-shared data types (i.e., data types referenced only once). To evaluate the change-proneness of web APIs with low internal cohesion, we answer the following research question: • RQ2: What are the scenarios in which developers change web APIs with low internal cohesion? In which cases do they change the shared and non-shared data types? We investigate the change-proneness on two different levels: interface level (i.e., change-proneness of a web API as a whole) and data-type level (i.e., change-proneness of the data types declared in a web API). Differently to RQ1, the results from RQ2 assist both, providers and consumers. Both have access to the web API to measure the internal cohesion. 6.2.3 Internal Cohesion Metrics as Change Indicators To make the results from RQ2 actionable in an industrial environment [Bouwers et al., 2013] a metric should be used to measure the internal cohesion. 6.2. Research Questions and Approach 107 However, as shown in Section 6.1, the existing metrics are message-level metrics that do not consider the usage of data types to compose messages. To understand this drawback consider the two examples in Figure 6.5. type2 operation1 message1 type1 type3 operation2 (a) Two operations operation1 and operation2 that use the same message message1. operation1 message1 type1 type2 operation2 message2 type4 type3 (b) Two operations operation1 and operation2 that use different messages message1 and message2 referencing indirectly the same data types type2 and type3. Figure 6.5: Example that shows the drawback of existing message-level internal cohesion metrics SIDC, LCOS, and SFCI. The web API shown in Figure 6.5a exposes two operations operation1 and operation2 that use the same message message1. The message-level metrics are capable to detect the cohesion of this web API, however, fail when measuring the cohesion of the web API shown in Figure 6.5b. This API has two operations operation1 and operation2 that use different messages, namely message1 and message2. In this case the message-level metrics result in a low value of cohesion. However, message1 and message2 reference the same data types type2 and type3. We argue that the web API is cohesive because both, type1 and type4 (referenced by respectively message1 and message2), are complex data types composed of type2 and type3. To overcome this problem Bansiya et al. [Bansiya and Davis, 2002] defined the CAMC (Cohesion Among Methods of Class) metric that measures the cohesion of object oriented classes. In this chapter we adapt the CAMC metric for web APIs proposing the Data Type Cohesion (DTC) metric. For a web API 108 Chapter 6. Change-Prone Web APIs s, DTC is computed as follows: P DT C(s) = x, yεOp(s) Co (x, y) |Op(s)| (6.1) where Op(s) represents the set of operations exposed in s. Co (x, y) is the cohesion between two operations x and y, and it is defined as: P m,nεM P(s) Cd t (m, n) Co (x, y) = (6.2) |M P(s)| where MP(s) is the set of all message pairs used by x and y; Cd t (m, n) is the cohesion between the messages m and n computed as: Cd t (m, n) = C om(m, n) C om(m, n) + U ncom(m, n) (6.3) where C om(m, n) represents the number of data types referenced by both messages m and n; and U ncom(m, n) is the number of data types referenced only in one message. To investigate quantitatively the change-proneness of web APIs with low internal cohesion we answer the following research question: • RQ3: To which extent does the DTC metric highlight change-prone WSDL APIs? Which data types declared in a WSDL API are more changeprone? Similar to RQ2 we investigate the change-proneness on two different levels: interface level and data-type level. The results from RQ3 are useful for both, providers and consumers, interested in measuring the internal cohesion in order to highlight change-prone web APIs and change-prone data types declared by them. 6.2.4 Research Approach To answer our research questions we adopt a mixed method approach [Creswell and Clark, 2010]. First, we answer RQ1 and RQ2 with a qualitative analysis consisting of an online survey. Then, following an exploratory sequential approach [Creswell and Clark, 2010], we refine the results from RQ2 with a quantitative analysis aimed at answering RQ3. Note, we do not quantitatively refine the results from RQ1 because the needed information ( i.e., how consumers invoke web APIs) are not available. We present the study, the analysis methods and the results of the qualitative and quantitative analyses respectively in Section 6.3 and Section 6.4. 6.3. Online Survey 6.3 109 Online Survey To answer the first two research questions RQ1 and RQ2, we performed an online survey consisting of three different parts. The first part of the survey introduces the terminology that might be used differently in academia and in industry. The next questions are on the background of participants. In particular, we asked information about their current position within their institutions/companies and their background in the areas involving web APIs (i.e., service-oriented, cloud computing, WSDL APIs, and RESTful APIs). In the second part, the questions are aimed at investigating the changeproneness of four SOA antipatterns (i.e., Fat, Multiservice, Tiny, and SandPile antipatterns). In this chapter, we focus on and report the results about the Fat antipattern (RQ1) and the Multiservice antipattern (RQ2). We do not report the results about the Tiny and the SandPile antipatterns because they are not symptoms of web APIs with low external and/or internal cohesion. In fact, they are symptoms of inadequate granularity [Moha et al., 2012] and subject of our future work. In the third and final part of the survey, we asked participants to share their experiences with other design practices that can affect the change-proneness of web APIs and have not been covered by the survey. They are meant to draw directions for future work on this subject. Then, we asked questions to assess their prior knowledge about the antipatterns presented in the survey. Before publishing our survey, we conducted three rounds of pilots with five software engineering researchers with a strong background in qualitative analyses. In each round we refined the survey questions and its structure based on their feedback. This step was necessary to attract participants in completing the survey. The complete survey is available on our website.3 In the following subsections, we first present information about the participants and their background and, then, we answer our research questions RQ1 and RQ2. For each research question we present the data used, the analysis method, and the results to answer it. 6.3.1 Participation Our survey was opened on July 1st, 2013 and closed on July 31st. We forwarded the survey to our industrial partners and academics working on research topics related to web API development. Moreover, we advertised it in google groups related to web APIs. During this time we collected responses from 79 participants among which 47 (59.5%) completed the entire survey an3 http://goo.gl/f0gi17 110 Chapter 6. Change-Prone Web APIs swering all questions. Given that participants needed to answer 36 questions, investing on average approximately 40 minutes of their time, we consider it a good number of participants and a high rate of completion [Smith et al., 2013]. Among the 79 participants 44 work in industry, 30 in academia, and 5 in both academia and industry. Participants rated their background on a 5-point Likert scale ranging from absent, weak, medium, good, to strong. Participants, who answered the questions of the second and third part, have at least a good background in at least one of the following areas: service-oriented, cloud computing, WSDL APIs, and RESTful APIs. Few participants have an absent or weak background in any of these topics and quit the survey just after the background questions. Interestingly, 72.7% of the participants answered that they do not know any metric/quality indicators to estimate the change-proneness of web APIs. The most common indicators used by the remaining 27.3% are the response time and information about the changes (e.g., number of changes between two versions, number of operations changed, etc.). 6.3.2 External Cohesion and Change-Proneness To investigate the answer to RQ1, we analyzed the change-proneness of web APIs with low external cohesion on two different levels: interface level and method level. Interface Level Change-Proneness Focusing on the interface level, we asked the participants to rank six scenarios that can lead a Fat web API to be changed. As discussed in Section 6.2, this antipattern is a symptom of web APIs with low external cohesion. Table 6.1 shows the list of scenarios. We derived them from our frequent discussions with our industrial and academic partners and colleagues. Furthermore, we asked the participants to state additional scenarios in a text box. For each scenario, 53 participants ranked the likelihood on a 5-point Likert scale: 0 (Won’t change), 1 (Might change), 2 (Likely to change), 3 (Very likely to change), and 4 (Sure will change). We first used the non-parametric Kruskal-Wallis rank sum test [Kruskal and Wallis, 1952] to analyze whether there is a difference between the scenarios to cause changes. Kruskal-Wallis tests whether samples originate from the same distribution comparing three or more sets of scores (i.e., the values of the 5-point Likert scale) that come from different groups (i.e., the different scenarios). We used the non-parametric Kruskal-Wallis test because the 6.3. Online Survey 111 Table 6.1: Scenarios that cause changes in Fat web APIs (i.e., web APIs with disjoint sets of operations that are invoked by different clients indicating low external cohesion). Id Fat1 Fat2 Fat3 Fat4 Fat5 Fat6 A Fat API is changed because ... its clients have troubles in understanding it. having a specific method for each client would introduce clones and the API would become hard to be maintained it is a bottleneck for the performance of the system. It should be split into APIs specific for each different client (i.e., Interface Segregation Principle [Martin, 2002]). different developers work on the specific functionalities for the different clients. if the functional requirements of a client change, the other clients will be affected as well. test cases for all clients should pass before the API can be deployed. distributions of scores given by the participants are ordinal and non-normally distributed. Moreover, this test has been designed to compare three or more distributions in contrast to the non-parametric Mann-Whitney test [Lehmann and D’Abrera, 1975] that compares two distributions. Performing the KruskalWallis rank sum test among the scores given to the different scenarios resulted in a p-value < 0.01. This shows that the given scenarios cause changes to Fat web APIs with different probabilities. The distributions of the scores given to the different scenarios are reported in Figure 6.6. To analyze these probabilities, we ranked the scenarios by the median and mean values. According to this ranking the Fat2 is the most likely scenario with a median value of 3. This means that a Fat web API is very likely to be changed to reduce the amount of clones and ease maintainability. The second most likely scenario is Fat1 . According to its median score of 2, a Fat web API is likely to be changed to improve its understandability for the clients. The other 4 scenarios have median values equal to 1 indicating that they might force a Fat web API to be changed. To conclude and answer the first part of RQ1, we can state that: 1) Fat web APIs are very likely to be changed to ease maintainability and reduce clones; 2) they are also likely to be changed for improving understandability. 0 1 2 3 4 Chapter 6. Change-Prone Web APIs Likelihood to occur 112 Fat1 Fat2 Fat3 Fat4 Fat5 Fat6 Scenarios Figure 6.6: Likelihood ranges from 0 (Won’t change) to 4 (Sure will change) for the scenarios causing changes in Fat APIs listed in Table 6.1. Method Level Change-Proneness Addressing the second part of RQ1, we focus on analyzing the change-proneness of methods declared in a web API with low external cohesion. As described in Section 6.2, Fat web APIs expose two classes of methods: 1) methods invoked by a specific client (i.e., non-shared methods) and 2) methods invoked by different clients (i.e., shared methods). In our survey, we asked the participants to state which class of methods is more likely to be changed. Out of 60 participants who answered this question 33 found that shared methods are more likely to be changed while 27 found that non-shared methods are more change-prone. In addition, we asked the participants to motivate their choice filling in a text box. To analyze their motivations we manually clustered the answers in different groups using the card sort technique [Barker]. This technique consists in sorting the cards (i.e., the provided motivations in our case) into meaningful groups and abstracting hierarchies to deduce general categories. We mined two frequent groups of answers from their motivations. On the one hand, 16 out of 33 participants found that shared methods are more likely to be changed because they should satisfy multiple requirements from different clients. On the other hand, 12 out of 27 participants found that non-shared methods are more change-prone because changing them affects fewer clients. To conclude and answer the second part of RQ1, we can notice that the participants have two different ideas about the change-proneness of methods. Even though their opinions are different they do not conflict and they do provide two useful insights: 6.3. Online Survey 113 • shared methods are changed when the requirements of their different clients evolve differently. • otherwise developers tend to change non-shared methods because the impact of a change is lower. 6.3.3 Internal Cohesion and Change-Proneness Similar to RQ1, we answer RQ2 analyzing the change-proneness of web APIs with low internal cohesion on two different levels: interface level and datatype level. Interface Level Change-Proneness The first part of RQ2 aims at investigating scenarios that can cause changes in Multiservice web APIs. As discussed in Section 6.1, this antipattern is a symptom of web APIs with low internal cohesion. Similar to before, we provided the participants with seven scenarios to be ranked on the same 5-point Likert scale. Table 6.2 lists the seven scenarios stemming from discussions with our industrial and academic partners. Furthermore, we asked them to state additional scenarios in a text box. 51 participants ranked these scenarios. To analyze the results we followed the same approach used for analyzing the Fat web APIs’ results before. First, we used the Kruskal-Wallis rank sum test to verify whether there is a statistical difference between the distributions of scores given to the different scenarios. The test resulted in a p-value<0.01 indicating that these scenarios cause changes to Multiservice web APIs with different probabilities. Then, we ranked the scenarios based on the median and mean values of their scores. The distributions of the scores given to the different scenarios are reported in Figure 6.7. The ranking shows that a Multiservice web API is very likely to be changed because of the different entities encapsulated by these web APIs (MS1 ). These changes can affect different clients even though they are not interested in the changed entity (MS2 ). Multiservice web APIs are also very likely to be changed to improve their understandability (MS7 ). Furthermore, the scenarios MS3 , MS4 , MS5 and MS6 are likely to cause changes. To conclude and answer the first part of RQ2, we can state that Multiservice web APIs are very likely to be changed because: 1) every time it changes many clients are affected (MS2 ); 2) the web API can change for different reasons caused by the different entities (MS1 ); and 3) understanding the web API is complicated for its clients (MS7 ). 114 Chapter 6. Change-Prone Web APIs Table 6.2: Scenarios that cause changes in Multiservice web APIs (i.e., APIs that expose many operations that are related to different business entities). Id MS1 MS2 MS3 MS4 MS5 MS6 MS7 A Multiservice API is changed because ... every business entity can change for different reasons (e.g., different evolving requirements). A new version should be published every time one of these entities changes. changes to the API affect many clients (even though they do not use the changed business entity). all the tests involving the different entities should pass before the entire web API is deployed. the number of invocations to the Multiservice web API is high due to the different business entities. proper pool tuning techniques are needed to achieve adequate performance due to the numerous clients. different developers work on different business entities. many business entities are exposed complicating the understanding of the API. Data-Type Level Change-Proneness Addressing the second part of RQ2, we focus on analyzing the change-proneness of data types declared in Multiservice web APIs with low internal cohesion. In our survey, we asked participants to select which of the two classes of data types is more change-prone: shared data types (referenced more than once in a web API) and non-shared data types (referenced only once). Out of 48 participants who answered this question 30 (62.5%) found that non-shared data types are more likely to be changed, while 18 (37.5%) found that shared data types are more change-prone. In addition, participants motivated their answers filling in a text box. Applying the card sort technique we manually clustered their motivations into two common groups of answers. On the one hand, 12 out of 18 participants stated that shared data types are likely to be changed because they are used by different messages and/or data types that can force them to change. In other words, they have multiple causes to change. On the other hand, 8 out of 30 participants stated that non-shared data types are more change-prone because 1 2 3 4 115 0 Likelihood to occur 6.4. Quantitative Analysis MS1 MS2 MS3 MS4 MS5 MS6 MS7 Scenarios Figure 6.7: Likelihood ranges from 0 (Won’t change) to 4 (Sure will change) for the scenarios causing changes in Multiservice APIs listed in Table 6.2. developers prefer to share stable data types that represent generic business abstractions. Similarly to the change-proneness of methods, the participants have two different opinions. However, they do not conflict and give two relevant insights into the change-proneness of data types: • shared data types are changed when their operations evolve differently. • otherwise developers tend to change non-shared data types because the impact of a change is lower. 6.4 Quantitative Analysis The goal of the quantitative analysis is to provide an answer to RQ3, and consequently refine the results from RQ2. To reach this goal, we analyzed the correlation between the DTC cohesion metric and the number of changes performed in the different versions of ten public WSDL APIs. Table 6.3 lists the selected WSDL APIs from Amazon,4 eBay,5 and FedEx6 with their basic characteristics. WSDL is a standard interface description language used by many service-oriented systems to describe the functionality offered by a web API. We selected these WSDL APIs because they have sufficiently long histories as indicated by the increase in number of operations and data types. Furthermore, they have been used and discussed for similar studies in prior research 4 http://aws.amazon.com http://developer.ebay.com 6 http://www.fedex.com/us/web-services 5 116 Chapter 6. Change-Prone Web APIs Table 6.3: WSDL APIs selected for the quantitative analysis showing the name (WSDL_API), the number of versions (Vers), the number of operations in the first and last versions (Ops), and the number of data types in the first and last version (Types). WSDL_API AmazonEC2 AmazonFPS AmazonQueueService AWSECommerceService AWSMechanicalTurkRequester eBay FedExPackageMovement FedExRateService FedExShipService FedExTrackService Vers 22 3 4 5 6 5 4 11 8 5 Ops 14-118 29-27 8-15 23-23 40-44 156-156 2-2 1-1 1-7 3-4 Types 60-463 19-18 26-51 35-35 86-102 897-902 15-15 43-140 74-166 29-33 [Fokaefs et al., 2011]. Even though a bigger data set is desirable, having access to WSDL APIs with long histories is not a trivial task. Most of them are used in a closed environment allowing access only to registered customers. 6.4.1 Interface Level Change-Proneness of WSDL APIs For analyzing the change-proneness of the selected WSDL APIs we first computed the values for the DTC metric for each version of each WSDL API. Next, we extracted the changes between each pair of subsequent versions of a WSDL API. The changes were extracted using our WSDLDiff tool (presented in Chapter 4) that loads the specification of two versions of a WSDL API and compares them by using the differencing algorithm provided by the Eclipse EMF Compare plugin.7 In particular, WSDLDiff extracts the types of the elements affected by changes (e.g., Operation, Message, Data Type) and the types of changes (e.g., removal, addition, move, attribute value update). With this, WSDLDiff is capable of extracting changes, such as "a message has been added to an operation" or "the name of an attribute in a data type has been modified". We refer to these changes as fine-grained changes. Using WSDLDiff for each version of a WSDL API we counted the number of fine-grained changes that occurred between the current and previous version. 7 http://www.eclipse.org/modeling/emf/ 6.4. Quantitative Analysis 117 We used the Spearman rank correlation for computing the correlation between the values of the DTC metric and the number of changes. Spearman compares the ordered ranks of two variables to measure a monotonic relationship. We chose the Spearman correlation because it does not make assumptions about the distribution, variances and the type of the relationship [S.Weardon and Chilko, 2004]. A Spearman value (i.e., rho) of +1 and -1 indicates high positive or high negative correlation, whereas 0 indicates that the variables under analysis do not correlate at all. Values greater than +0.3 and lower than -0.3 indicate a moderate correlation; values greater than +0.5 and lower than -0.5 are considered to be strong correlations [Hopkins, 2000]. The result of the Spearman correlation analysis shows that the DTC metric has a significant and moderate negative correlation with a rho value equal to -0.361 (i.e., rho<-0.3) and with a p-value equal to 0.007. Moreover, we computed the values of the existing message-level metrics (i.e., LCOS, SFCI, and SIDC) on the same WSDL APIs. We found out that their values are always 0 or 1. For instance, the value for LCOS is 1 in 62 out of 73 versions and 0 in 11 versions. Manually analyzing the WSDL APIs we noticed that this is due to their design. As shown by the example in Figure 6.5b messages reference different data types that are used as wrappers to isolate the data type declarations from the declaration of the operations and messages. For the WSDL APIs under analysis, this result confirms that existing metrics suffer from the problem explained in Section 6.2 and discussed in previous work [Bansiya and Davis, 2002]. We can conclude that DTC shows a moderate correlation, indicating that in increase in the internal cohesion is associated with a decrease in the number of changes. 6.4.2 Data-Type Level Change-Proneness of WSDL APIs To detail these results, we investigated the change-proneness of shared and non-shared data types. For each data type in each version, we computed the number of times they are referenced in the WSDL API and the number of changes as extracted by our WSDLDiff tool. Next, we used Spearman to compute the correlation between these two metrics. Table 6.4 presents the results of this analysis. Looking at the p-values of the correlation analysis, we note that significant results were obtained for 5 WSDL APIs (i.e., p-value< 0.01). Among them the values for three WSDL APIs show a strong correlation (i.e., rho<-0.5) and for 118 Chapter 6. Change-Prone Web APIs Table 6.4: Results of the Spearman correlation analysis between the number of references and number of changes of data types. Bold values highlight significant correlations. WSDL AmazonEC2 AmazonFPS AmazonQueueService AWSECommerceService AWSMechanicalTurkRequester eBay FedExPackageMovement FedExRateService FedExShipService FedExTrackService p-value 0.248 0.612 0.301 0.089 0.000 0.638 0.005 0.000 0.000 0.000 rho -0.048 -0.104 -0.130 0.291 -0.502 -0.015 -0.512 -0.418 0.193 -0.559 one WSDL API they show a moderate correlation (i.e., rho<-0.3). These correlations indicate that the more a data type is referenced the less change-prone it is. Manually analyzing a sample set of shared data types, we found that they represent generic business entities or satellite data used by operations of the same domain. Hence, we assume that their requirements do not evolve differently. For instance, the ClientDetail in FedexShipService is a shared data type referenced directly and indirectly on average 9 times by shipment operations that require information about the client. This data type encapsulates descriptive data about clients and it did not change across the releases. This result partially confirms the results of our survey namely: shared data types are change-prone if referenced by operations with different requirements, otherwise developers tend to change non-shared data types. Based on these results, we can answer RQ3 stating that the DTC metric is able to highlight change-prone WSDL APIs. Moreover, we can partially confirm the insights of our participants about change-prone data types. However, to fully validate this result a bigger data set is needed. An ideal data set would consist of several WSDL APIs with long histories and from different domains or companies. This is needed to avoid that the results might be WSDL or company specific. Unfortunately, as already discussed, getting access to these artifacts is challenging. 6.5. Discussion 6.5 119 Discussion In this section we summarize the results of our study, discuss the implications of the results and the threats to validity. 6.5.1 Summary of the Results Summarizing the findings of our study, we found that Fat web APIs are very likely to be improved to reduce clones and ease maintainability and they are likely to be changed to improve understandability (RQ1). Multiservice APIs are very likely to be improved because such a web API declares different business entities and a change in one entity typically affects all the clients. Similar to Fat web APIs, Multiservice APIs are also affected by understandability issues (RQ2). Analyzing the change-proneness of methods and data types we found that both shared messages and shared data types are likely to be changed if they are shared by clients and operations with different requirements (RQ1 and RQ2). For instance, if two clients with different requirements invoke the same operations, these operations change every time one of the two clients’ requirements change. Hence, they are more change-prone. If modification tasks are not driven by clients’ or operations’ requirements then developers tend to modify non-shared operations and non-shared data types to keep the impact of a change low (RQ1 and RQ2). To compute the internal cohesion and making the results of RQ2 actionable, useful metrics are needed [Bouwers et al., 2013]. This led us to introduce the DTC metric and to investigate its ability to highlight change-prone WSDL APIs. The quantitative study showed that DTC is able to highlight change-prone WSDL APIs. Moreover, we partially confirmed our survey participants’ insight: shared data types are change prone if they are referenced by operations with different requirements, otherwise non-shared data types are more likely to be changed (RQ3). 6.5.2 Implications of the Results The results of this study are useful for web API providers, web API consumers, and software engineering researchers. Providers & Consumers. Both, web API providers and consumers, can benefit from a new internal cohesion metric (DTC) that overcomes the problem of the message-level metrics. Using DTC they can measure the internal cohesion to estimate the interface level change-proneness of WSDL APIs (RQ3). Based on the metric values, consumers can select and subscribe to the most stable web API that shows the best internal cohesion, thereby reducing the risk 120 Chapter 6. Change-Prone Web APIs to continuously update their clients to new web API versions. Providers can use DTC to identify the set of most change-prone web APIs (with low values for DTC) that should undergo a refactoring. For example, in case of a Multiservice API, the provider should consider splitting the API into different web APIs each one encapsulating a different business entity. Providers. Furthermore, based on the values of the external cohesion metric they can estimate the change-proneness considering the maintenance scenarios likely to cause changes as suggested by our study. For instance, they can measure the SIUC metric as proposed by Perepletchikov [Perepletchikov et al., 2007, 2010] and refactor the web APIs with low values for external cohesion potentially affected by the Fat antipattern. They should refactor these APIs applying the Interface Segregation Principle described by Martin [2002]. According to this principle, Fat APIs should be split into different APIs so that clients only have to know about the methods they are interested in. Researchers. The results of this study are also valuable input to software engineering researchers. In this study we showed the impact of low external and internal cohesion on change-proneness of web APIs. As next step, researchers should investigate techniques for refactoring these kinds of web APIs. For instance, to the best of our knowledge there are no approaches able to apply the Interface Segregation Principle to refactor Fat APIs. Such an approach should mine the usage of a web API’ clients and, based on it, output the ideal sub APIs. This task is particularly challenging if a web API is invoked differently by many different clients. In general, the results of this study are a precious input for researchers interested in investigating the change-proneness of web APIs. Each maintenance scenario that causes changes in web APIs should be further investigated to further assist web API providers. 6.5.3 Threats to Validity In this study threats to construct validity concern the set of selected scenarios that we used to investigate changes in web APIs. This set is not complete. To mitigate this threat, we asked the participants of the survey to provide additional scenarios. Only three participants provided further scenarios. Hence, we cannot draw any statistical conclusion. Based on this result, we believe that we provided a good first set of scenarios that can be extended in future studies. With respect to internal validity, the main threat is the possibility that the structure of the survey could have affected the answers of participants. We mitigated this threat by randomly changing the order of the scenarios for each 6.6. Related Work 121 participant. While this randomization worked for the scenarios, the threat stemming from the order of the questions in our survey remains - participants could have gained knowledge from answering the earlier questions that could have affected the answers to latter questions. The threats to external validity have been mitigated thanks to our participants who work on software systems from different domains (e.g., banking systems, mobile applications, telecommunication systems, financial systems). Moreover, 18 participants are employed in international consulting companies with expertise in a wide range of software systems. Moreover, with regards to the quantitative analysis the set of WSDLs APIs should be enlarged in our future work to improve the generalization of the results. However, accessing WSDLs APIs with a long history is not an easy task. In fact, most of them are used in a closed environment and allowing access only to registered clients. 6.6 Related Work We identify three areas of related work: change-proneness, stability of APIs, and analysis of web APIs. Change-proneness. Khoshgoftaar and Szabo [1994] and Li and Henry [1993] were among the first researchers to investigate the impact of software structures on change-proneness. Khoshgoftaar et al. trained a regression model and a neural network using size and complexity metrics to predict change-prone components. The results show that the neural network is a stronger predictive model compared to the multiple regression model. Li et al. used the C&K metrics to predict maintenance effort improving the performance of prediction models. Girba et al. [2004] defined the Yesterday’s weather approach to predict change-prone classes based on values of metrics and the analysis of their evolution. Di Penta et al. [2008] showed that classes participating in antipatterns are more or less change-prone depending on the role they play in the antipattern. Khomh et al. [2009] investigated the impact of code smells on the change-proneness of Java classes. Their results show that classes affected by code smells are more change-prone and specific smells are more correlated than others to change-proneness. Zhou et al. [2009] examined the confounding effect of class size on the associations between metrics and change-proneness. They show that the size of a class is a relevant confounding variable to take into account to estimate its changeproneness. These studies represent a subset of existing work (e.g., [Tsantalis et al., 2005; Elish and Al-Khiaty, 2013]) that underlines the importance of our research on providing indicators for highlighting change-prone software components. However, no study exists that investigates such indicators for 122 Chapter 6. Change-Prone Web APIs highlighting change-prone web APIs. Stability of APIs. The stability of APIs is a well known problem in the research community. A recent study by Vásquez et al. [2013] shows that changeprone APIs negatively impact the success of Android apps. This work does not provide indicators for change-prone APIs but it shows the relevance of assuring an adequate stability of APIs. Recently, Raemaekers et al. [2012] analyzed the stability of third parties libraries using four metrics to show how third parties libraries evolve. Hou and Yao [2011] analyzed the evolution of AWT/Swing APIs and their findings show that the majority of the changes is performed in the early versions. Dig and Johnson [2006] analyzed four frameworks and one library finding that on average 80% of the API breaking changes are due to refactoring. Even though these studies show the relevance of investigating the stability of APIs there are few studies proposing metrics as indicators of change-prone APIs. In our previous work presented in Chapter 2, we investigated such indicators for interfaces. In Chapter 3 we analyzed the impact of antipatterns on the change-proneness of Java APIs. The results show that APIs are more change-prone if they participate in ComplexClass, SpaghettiCode, and SwissArmyKnife antipatterns. In Chapter 2 we showed that the external cohesion is the best performing metric to highlight and predict change-prone Java interfaces in the analyzed systems. Those studies were on Java APIs while the focus of this chapter is on web APIs analyzing metrics and antipatterns specifically defined for web APIs. Analyses of web APIs. In Chapter 4 we analyzed the evolution of four WSDL APIs. We proposed the WSDLDiff tool to extract automatically finegrained changes and we showed that it helps consumers in highlighting the most frequent changes in WSDL APIs. A similar analysis was performed by Fokaefs et al. [2011] in 2011. They manually extracted the changes from the different versions of the WSDL APIs. Several antipatterns for web APIs have been proposed in literature, however, none of them has been investigated to indicate change-prone web APIs. Moha et al. [2012] proposed an approach for specifying and detecting web API antipatterns. In their work they provide a complete and concise description of the most popular antipatterns. Perepletchikov et al. [2007] proposed five cohesion metrics, but an empirical evaluation of them for indicating change-prone APIs is missing. In their later study [Perepletchikov et al., 2010] they proposed three additional cohesion metrics and a controlled study. The results from this study show that the proposed metrics can help in predicting the analyzability of web APIs early in the software development life cycle, but not their stability. Our work is complementary to this existing work. Starting from the external and internal 6.7. Concluding remarks 123 cohesion defined by Perepletchikov et al. and the antipatterns described in Section 6.1, we present a qualitative and quantitative study of using cohesion metrics to indicate the change-proneness of web APIs. 6.7 Concluding remarks Assuring an adequate level of stability of web APIs is one of the key factors for deploying successful distributed systems [Erl, 2007; Daigneau, 2011; Vásquez et al., 2013]. While consumers want to rely on stable web APIs in order to prevent continuous updates of their systems, providers want to publish high quality web APIs in order to prevent such updates and to stay successful on the market. Previous work has shown that the cohesion of an API is an indicator for understandability and stability [Perepletchikov et al., 2007, 2010]. In this chapter, we extended this research to web APIs and investigated the relationship between internal and external cohesion and stability, measured as change-proneness. We first presented an online survey to rank a number of typical maintenance scenarios to improve web APIs affected by the Multiservice and Fat antipatterns, both symptoms of web APIs with low internal and external cohesion. The results narrow down the many possible scenarios to two scenarios for Fat APIs and three scenarios for Multiservice APIs in which changes are very likely to occur. Focusing on internal cohesion, we detailed these results in a quantitative study with ten public available web APIs specified in WSDL. Results showed that the DTC metric is able to highlight change-prone WSDL APIs. The results of our studies also open several directions for future work. Specifically, the method level and the data-type level change-proneness needs to be further investigated to better classify change-prone methods and data types. Furthermore, we plan to analyze the impact of granularity on the change-proneness of web APIs, for instance with the SandPile and Tiny antipatterns (both symptoms of APIs with inadequate granularity) [Moha et al., 2012]. . 7. Refactoring Fat APIs Recent studies have shown that the violation of the Interface Segregation Principle (ISP) is critical for maintaining and evolving software systems. Fat interfaces (i.e., interfaces violating the ISP) change more frequently and degrade the quality of the components coupled to them. According to the ISP the interfaces’ design should force no client to depend on methods it does not invoke. Fat interfaces should be split into smaller interfaces exposing only the methods invoked by groups of clients. However, applying the ISP is a challenging task when fat interfaces are invoked differently by many clients. In this chapter, we formulate the problem of applying the ISP as a multiobjective clustering problem and we propose a genetic algorithm to solve it. We evaluate the capability of the proposed genetic algorithm with 42,318 public Java APIs whose clients’ usage has been mined from the Maven repository. The results of this study show that the genetic algorithm outperforms other search based approaches (i.e., random and simulated annealing approaches) in splitting the APIs according to the ISP.1 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Problem Statement and Solution Genetic Algorithm . . . . . . . . . Random and Local Search . . . . Study . . . . . . . . . . . . . . . . . Threats to Validity . . . . . . . . . Related Work . . . . . . . . . . . . Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 131 135 137 146 147 148 When designing interfaces developers should refactor fat interfaces [Martin, 2002]. Fat interfaces are interfaces whose clients invoke different subsets of their methods. Such interfaces should be split into smaller interfaces each one specific for a different client (or a group of clients). This principle has 1 This chapter was published in the in the 30th International Conference on Software Maintenance and Evolution (ICSME 2014) [Romano et al., 2014]. 125 126 Chapter 7. Refactoring Fat APIs been formalized by Martin [Martin, 2002] in 2002 and is also known as the Interface Segregation Principle (ISP). The rationale behind this principle is that changes to an interface break its clients. As a consequence, clients should not be forced to depend upon interface methods that they do not actually invoke [Martin, 2002]. This guarantees that clients are affected by changes only if they involve the methods they invoke. Recent studies have shown that violation of the ISP and, hence, fat interfaces can be problematic for the maintenance of software systems. First, in Chapter 2 we showed that such interfaces are more change-prone than nonfat interfaces. Next, Abdeen et al. [Abdeen et al., 2013] proved that violations of the ISP lead to degraded cohesion of the components coupled to fat interfaces. Finally, Yamashita et al. [Yamashita and Moonen, 2013] showed that changes to fat interfaces result in a larger ripple effect. The results of these studies, together with Martin’s insights [Martin, 2002], show the relevance of designing and implementing interfaces according to the ISP. However, to the best of our knowledge, there are no studies that propose approaches to apply the ISP. This task is challenging when fat interfaces expose many methods and have many clients that invoke differently their methods, as shown in [Mendez et al., 2013]. In this case trying to manually infer the interfaces into which a fat interface should be split is unpractical and expensive. In this chapter, we define the problem of splitting fat interfaces according to the ISP as a multi-objective clustering optimization problem [Praditwong et al., 2011]. We measure the compliance with the ISP of an interface through the Interface Cohesion Metric (IUC). To apply the ISP we propose a multi-objective genetic algorithm that, based on the clients’ usage of a fat interface, infers the interfaces into which it should be split to conform to the ISP and, hence, with higher IUC values. To validate the capability of the proposed genetic algorithm we mine the clients’ usage of 42,318 public Java APIs from the Maven repository. For each API, we run the genetic algorithm to split the API into sub-APIs according to the ISP. We compare the capability of the genetic algorithm with the capability of other search-based approaches, namely a random algorithm and a multi-objective simulated annealing algorithm. The goal of this study is to answer the following research questions: Is the genetic algorithm able to split APIs into sub-APIs with higher IUC values? Does it outperform the random and simulated annealing approaches? The results show that the proposed genetic algorithm generates sub-APIs with higher IUC values and it outperforms the other search-based approaches. 7.1. Problem Statement and Solution 127 These results are relevant for software practitioners interested in applying the ISP. They can monitor how clients invoke their APIs (i.e., which methods are invoked by each client) and they can use this information to run the genetic algorithm and split their APIs so that they comply with the ISP. The remainder of this chapter is organized as follows. Section 7.1 introduces fat APIs, the main problems they suffer, and formulates the problem of applying the ISP as a multi-objective clustering problem. Section 7.2 presents the genetic algorithm to solve the multi-objective clustering problem. Section 7.3 presents the random and local search (i.e., simulated annealing) approaches implemented to evaluate the capability of the genetic algorithm. The study and its results are shown and discussed in Section 7.4 while threats to validity are discussed in Section 7.5. Related work is presented in Section 7.6. We draw our conclusions and outline directions for future work in Section 7.7. 7.1 Problem Statement and Solution In this section, first, we introduce fat APIs, their drawbacks, and the Interface Segregation Principle to refactor them. Then, we discuss the challenges of applying the Interface Segregation Principle for real world APIs. Finally, we present our solution to automatically apply the principle. 7.1.1 Fat APIs and Interface Segregation Principle The Iterface Segregation Principle (ISP) has been originally described by Martin in [Martin, 2002] and it copes with fat APIs. Fat APIs are APIs whose clients invoke different sets of their methods. As a consequence clients depend on interface methods that they do not invoke. These APIs are problematic and they should be refactored because their clients can be broken by changes to methods which they do not invoke. To refactor fat APIs Martin [Martin, 2002] introduced the ISP. The ISP states that fat APIs need to be split into smaller APIs (referred to as sub-APIs throughout this chapter) according to their clients’ usage. Any client should only know about the set of methods that it invokes. Hence, each sub-API should reflect the usage of a specific client (or of a class of clients that invoke the same set of methods). To better understand the ISP consider the example shown in Figure 7.1. The API shown in Figure 7.1a is considered a fat API because the different clients (i.e., Client1, Client2, and Client3) invoke different methods (e.g., Client1 invokes only method1, method2, and method3 out of the 10 methods declared in the API). According to the ISP, this API should be split into three sub-APIs as shown in Figure 7.1b. These sub-APIs are specific to the different clients (i.e., Client1, Client2, and Client3) and, as 128 Chapter 7. Refactoring Fat APIs a consequence, clients do not depend anymore on interface methods they do not invoke. FatAPI 1-method1() 2-method2() 3-method3() 4-method4() 5-method5() 6-method6() 7-method7() 8-method8() 9-method9() 10-method10() Client1 Client2 Client3 (a) A fat API with different clients (i.e., Client1, Client2, and Client3) invoking different sets of methods (denoted by rectangles). Clients depend on methods which they do not invoke. SubAPI1 1-method1() 2-method2() 3-method3() Client1 SubAPI2 4-method4() 5-method5() 6-method6() Client2 SubAPI3 7-method7() 8-method8() 9-method9() 10-method10() Client3 (b) The fat API is split into sub-APIs each one specific for a client. Clients depend only on methods which they invoke. Figure 7.1: An example of applying the Interface Segregation Principle. 7.1.2 Fat APIs and Change-Proneness Fat APIs are also problematic because they change frequently. In Chapter 2 we showed empirically that fat APIs are more change-prone compared to nonfat APIs. In this work we used external cohesion as a heuristic to detect fat APIs. The external cohesion was originally defined by Perepletchikov et al. [Perepletchikov et al., 2007, 2010] for web APIs and it measures the extent to which the methods declared in an API are used by their clients. An API is considered externally cohesive if all clients invoke all methods of the API. It is not externally cohesive and considered a fat API if they invoke different subsets of its methods. To measure the external cohesion we used the Interface Usage Cohesion metric (IUC) defined by Perepletchikov et al. [Perepletchikov et al., 2007, 2010]. This metric is defined as: Pn I U C(i) = used_methods( j,i) j=1 num_methods(i) n 7.1. Problem Statement and Solution 129 where j denotes a client of the API i; used_methods (j,i) is the function which computes the number of methods defined in i and used by the client j; num _methods(i) returns the total number of methods defined in i; and n denotes the number of clients of the API i. Note that the IUC values range between 0 and 1. Consider the example shown in Figure 7.1. The FatAPI in Figure 7.1a 3 3 4 + 10 + 10 )/3 = 0.366 indicating low external shows a value of IUCFatAPI = ( 10 cohesion that is a symptom of a fat API. The sub-APIs generated after applying the ISP (shown in Figure 7.1b) show higher external cohesion. They have the following values for IUC: IUCSubAPI1 = ( 33 )/1 = 1, IUCSubAPI2 = ( 33 )/1 = 1, and IUCSubAPI3 = ( 44 )/1 = 1. In Chapter 2 we investigated to which extent the IUC metric can be used to highlight change-prone Java interface classes. The results showed that the IUC metric exhibits the strongest correlation with the number of source code changes performed in Java interface classes compared to other software metrics (e.g., C&K metrics [Chidamber and Kemerer, 1994]). The IUC metric also improved the performance of prediction models in predicting change-prone Java interface classes. These results, together with Martin’s insights [Martin, 2002] and results of previous studies [Abdeen et al., 2013; Yamashita and Moonen, 2013], motivated us to investigate and develop an approach to refactor fat APIs using the ISP. 7.1.3 Problem The problem an engineer can face in splitting a fat API is coping with API usage diversity. In 2013, Mendez et al. [Mendez et al., 2013] investigated how differently the APIs are invoked by their clients. They provided empirical evidence that there is a significant usage diversity. For instance, they showed that Java’s String API is used in 2,460 different ways by their clients. Clients do not invoke disjoint sets of methods (as shown in Figure 7.1a) but the set of methods can overlap and can be significantly different. As a consequence, we argue that manually splitting fat APIs can be time consuming and error prone. A first approach to find the sub-APIs consists in adopting brute-force search techniques. These techniques enumerate all possible sub-APIs and check whether they maximize the external cohesion and, hence, the value for the IUC metric. The problem with these approaches is that the number of possible sub-APIs can be prohibitively large causing a combinatorial explosion. Imagine for instance to adopt this approach for finding the sub-APIs for the AmazonEC2 web API. This web API exposes 118 methods in version 23. The number of 20- 130 Chapter 7. Refactoring Fat APIs combinations of the 118 methods in AmazonEC2 are equal to: 118! 118 = ≈ 2 ∗ 1021 20 20!98! This means that for evaluating all the sub-APIs with 20 methods the search requires to analyze at least 2 ∗ 1021 possible combinations, which can take several days on a standard PC. As a consequence, brute-force search techniques are not an adequate solution for this problem. 7.1.4 Solution To overcome the aforementioned problems we formulate the problem of finding sub-APIs (i.e., applying the ISP) as a clustering optimization problem defined as follows. Given the set of n methods X={X1 ,X2 ...,Xn } declared in a fat API, find the set of non-overlapping clusters of methods C={C1 ,C2 ...,Ck } that maximize IUC(C) and minimize clusters(C); where IUC(C) computes the lowest IUC value of the clusters in C and clusters(C) computes the number of clusters. In other words, we want to cluster the methods declared in a fat API into sub-APIs that show high external cohesion, measured through the IUC metric. This problem is an optimization problem with two objective functions, also known as multi-objective optimization problem. The first objective consists in maximizing the external cohesion of the clusters in C. Each cluster in C (i.e., a sub-API in our case) will have its own IUC value (like for the sub-APIs in Figure 7.1b). To maximize their IUC values we maximize the lowest IUC value measured through the objective function IUC(C). The second objective consists in minimizing the number of clusters (i.e., sub-APIs). This objective is necessary to avoid solutions containing as many clusters as there are methods declared in the fat API. If we assign each method to a different sub-API, all the sub-APIs would have an IUC value of 1, showing the highest external cohesion. However, such sub-APIs do not group together the methods invoked by the different groups of clients. Hence, the clients would depend on many sub-APIs each one exposing a single method. To solve this multi-objective clustering optimization problem we implemented a multi-objective genetic algorithm (presented in next section) that searches for the Pareto optimal solutions, namely solutions whose objective function values (i.e., IUC(C) and clusters(C) in our case) cannot be improved without degrading the other objective function values. 7.2. Genetic Algorithm 131 Moreover, to compare the performance of the genetic algorithm with random and local search approaches we implemented a random approach and a multi-objective simulated annealing approach that are presented in Section 7.3. 7.2 Genetic Algorithm To solve multi-objective optimization problems different algorithms have been proposed in literature (e.g., [Deb et al., 2000; Rudolph, 1998; Zitzler and Thiele, 1999]). In this chapter, we use the multi-objective genetic algorithm NSGA-II proposed by Deb et al. [Deb et al., 2000] to solve the problem of finding sub-APIs for fat APIs according to the ISP, as described in the previous section. We chose this algorithm because 1) it has been proved to be fast, 2) to provide better convergence for most multi-objective optimization problems, and 3) it has been widely used in solving search based software engineering problems, such as presented in [Deb et al., 2000; Yoo and Harman, 2007; Zhang et al., 2013; Li et al., 2013]. In the following, we first introduce the genetic algorithms. Then, we show our implementation of the NSGA-II used to solve our problem. For further details about the NSGA-II we refer to the work by Deb et al. [Deb et al., 2000]. Genetic Algorithms (GAs) have been used in a wide range of applications where optimization is required. Among all the applications, GAs have been widely studied to solve clustering problems [Hruschka et al., 2009]. The key idea of GAs is to mimic the process of natural selection providing a search heuristic to find solutions to optimization problems. A generic GA is shown in Figure 8.4. Different to other heuristics (e.g., Random Search, Brute-Force Search, and Local search) that consider one solution at a time, a GA starts with a set of candidate solutions, also known as population (step 1 in Figure 8.4). These solutions are randomly generated and they are referred to as chromosomes. Since the search is based upon many starting points, the likelihood to explore a wider area of the search space is higher than other searches. This feature reduces the likelihood to get stuck in a local optimum. Each solution is evaluated through a fitness function (or objective function) that measures how good a candidate solution is relatively to other candidate solutions (step 2). Solutions from the population are used to form new populations, also known as generations. This is achieved using the evolutionary operators. Specifically, first a pair of solutions (parents) is selected from the population through a selection operator (step 4). From these parents two offspring solutions are generated through the crossover operator (step 5). The crossover operator 132 Chapter 7. Refactoring Fat APIs Create initial population of chromosomes 1 2 Evaluate fitness of each chromosome 3 Max Iterations 4 Select next generation (Selection Operator) 5 Perform reproduction (Crossover operator) 6 Perform mutation (Mutation operators) 7 Output best chromosomes Figure 7.2: Different steps of a genetic algorithm. is responsible to generate offspring solutions that combine features from the two parents. To preserve the diversity, the mutation operators (step 6) mutate the offspring. These mutated solutions are added to the population replacing solutions with the worst fitness function values. This process of evolving the population is repeated until some condition (e.g., reaching the max number of iterations in step 3 or achieving the goal). Finally, the GA outputs the best solutions when the evolution process terminates (step 7). To implement the GA and adapt it to find the set of sub-APIs into which a fat API should be split we next define the fitness function, the chromosome (or solution) representation, and the evolutionary operators (i.e., selection, crossover, and mutation). 7.2.1 Chromosome representation To represent the chromosomes we use a label-based integer encoding widely adopted in literature [Hruschka et al., 2009] and shown in Figure 8.5. According to this encoding, a solution is an integer array of n positions, where n is the number of methods exposed in a fat API. Each position corresponds to a specific method (e.g., position 1 corresponds to the method method1() in Fig- 7.2. Genetic Algorithm 133 ure 7.1a). The integer values in the array represent the clusters (i.e., sub-APIs in our case) to which the methods belong. For instance in Figure 8.5, the methods 1,2, and 10 belong to the same cluster labeled with 1. Note that two chromosomes can be equivalent even though the clusters are labeled differently. For instance the chromosomes [1,1,1,1,2,2,2,2,3,3] and [2,2,2,2,3,3,3,3,1,1] are equivalent. To solve this problem we apply the renumbering procedure as shown in [Falkenauer, 1998] that transforms different labelings of equivalent chromosomes into a unique labeling. 1 2 3 4 5 6 7 8 9 10 1 1 2 3 2 4 5 3 6 1 Figure 7.3: Chromosome representation of our candidate solutions. 7.2.2 Fitness Functions The fitness function is a function that measures how good a solution is. For our problem we have two fitness functions corresponding to the two objective functions discussed in Section 7.1, namely IUC(C)) and clusters(C). IUC(C) returns the lowest IUC value of the clusters in C and clusters(C) returns the number of clusters in C. Hence, the two fitness functions are f1 =IUC(C) and f2 =clusters(C). While the value of f1 should be maximized the value of f2 should be minimized. Since we have two fitness functions, we need a comparator operator that, given two chromosomes (i.e., candidate solutions), returns the best one based on their fitness values. As comparator operator we use the dominance comparator as defined in NSGA-II. This comparator utilizes the idea of Pareto optimality and the concept of dominance for the comparison. Precisely, given two chromosomes A and B, the chromosome A dominates chromosome B (i.e., A is better than B) if 1) every fitness function value for chromosome A is equal or better than the corresponding fitness function value of the chromosome B, and 2) chromosome A has at least one fitness function value that is better than the corresponding fitness function value of the chromosome B. 7.2.3 The Selection Operator The selection operator selects two parents from a population according to their fitness function values. We use the Ranked Based Roulette Wheel (RBRW) that is a modified roulette wheel selection operator as proposed by Al Jadaan and Rajamani [2008]. RBRW ranks the chromosomes in the population by the fitness values: the highest rank is assigned to the chromosome with the best 134 Chapter 7. Refactoring Fat APIs fitness values. Hence, the best chromosomes have the highest probabilities to be selected as parents. 7.2.4 The Crossover Operator Once the GA has selected two parents (ParentA and ParentB) to generate the offspring, the crossover operator is applied to them with a probability Pc . As crossover operator we use the operator defined specifically for clustering problems by Hruschka et al. [2009]. In order to illustrate how this operator works consider the example shown in Figure 8.6 from [Hruschka et al., 2009]. The operator first selects randomly k (1≤k≤n) clusters from ParentA, where n is the number of clusters in ParentA. In our example assume that the clusters labeled 2 (consisting of methods 3, 5, and 10) and 3 (consisting of method 4) are selected from ParentA (marked red in Figure 8.6). The first child (ChildC) originally is created as copy of the second parent ParentB (step 1). As second step, the selected clusters (i.e., 2 and 3) are copied into ChildC. Copying these clusters changes the clusters 1, 2, and 3 in ChildC. These changed clusters are removed from ChildC (step 3) leaving the corresponding methods unallocated (labeled with 0). In the fourth step (not shown in Figure 8.6) the unallocated methods are allocated to an existing cluster that is randomly selected. The same procedure is followed to generate the second child ChildD. However, instead of selecting randomly k clusters from ParentB, the changed clusters of ChildC (i.e., 1,2, and 3) are copied into ChildD that is originally a copy of ParentA. 7.2.5 The Mutation Operators After obtaining the offspring population through the crossover operator, the offspring is mutated through the mutation operator with a probability Pm . This step is necessary to ensure genetic diversity from one generation to the next ones. The mutation is performed by randomly selecting one of the following cluster-oriented mutation operators [Falkenauer, 1998; Hruschka et al., 2009]: • split: a randomly selected cluster is split into two different clusters. The methods of the original cluster are randomly assigned to the generated clusters. • merge: moves all methods of a randomly selected cluster to another randomly selected cluster. • move: moves methods from one cluster to another. Both methods and clusters are randomly selected. 7.3. Random and Local Search 135 ParentA 1 2 3 4 5 6 7 8 9 10 1 1 2 3 2 4 5 1 2 5 ParentB 1 2 3 4 5 6 7 8 9 10 4 2 1 2 3 3 2 1 2 4 1: copy ParentB into ChildC ChildC 4 2 1 2 3 3 2 1 2 4 2: copy clusters 2 and 3 from ParentA to ChildC ChildC 4 2 2 3 2 3 2 1 2 4 3: remove changed methods from B (i.e., 1,2,3) ChildC 4 0 2 3 2 0 0 0 2 4 4: unallocated objects are allocated to randomly selected clusters Figure 7.4: Example of crossover operator for clustering problems [Hruschka et al., 2009]. We implemented the proposed genetic algorithm on top of the JMetal2 framework that is a Java framework that provides state-of-the-art algorithms for optimization problems, including the NSGA-II algorithm. 7.3 Random and Local Search To better evaluate the performance of our proposed genetic algorithm we implemented a random algorithm and a local search algorithm (i.e., a multiobjective simulated annealing algorithm) that are presented in the following sub-sections. 7.3.1 Random Algorithm The random algorithm tries to find an optimal solution by generating random solutions. To implement the random algorithm we use the same solution representation (i.e., chromosome representation) used in the genetic algorithm described in Section 7.2. The algorithm iteratively generates a random solution and evaluates it using the same fitness functions defined for the genetic algorithm. When the maximum number of iterations is reached the best so2 http://jmetal.sourceforge.net 136 Chapter 7. Refactoring Fat APIs lution is output. This algorithm explores the search space randomly relying on the likelihood to find a good solution after a certain number of iterations. We use a random search as baseline because this comparison is considered the first step to evaluate a genetic algorithm [Sivanandam and Deepa, 2007]. 7.3.2 Multi-Objective Simulated Annealing As second step to evaluate the performance of our proposed genetic algorithm we implemented a local search approach. A local search algorithm (e.g., hill-climbing) starts from a candidate solution and then iteratively tries to improve it. Starting from a random generated solution the solution is mutated obtaining the neighbor solution. If the neighbor solution is better than the current solution (i.e., it has higher fitness function values) it is taken as current solution to generate a new neighbor solution. This process is repeated until the best solution is obtained or the maximum number of iterations is reached. The main problem of such local search approaches is that they can get stuck in a local optimum. In this case the local search approach cannot further improve the current solution. To mitigate this problem advanced local search approaches have been proposed like simulated annealing. The simulated annealing algorithm was inspired from the process of annealing in metallurgy. This process consists in heating and cooling a metal. Heating the metal alters its internal structure and, hence, its physical properties. On the other hand, when the metal cools down its new internal structure becomes fixed. The simulated annealing algorithm simulates this process. Initially the temperature is set high and then it is decreased slowly as the algorithm runs. While the temperature is high the algorithm is more likely to accept a neighbor solution that is worse than the current solution, reducing the likelihood to get stuck in a local optimum. At each iteration the temperature is slowly decreased by multiplying it by a cooling factor α where 0 < α < 1. When the temperature is reduced, worse neighbor solutions are accepted with a lower probability. Hence, at each iteration a neighbor solution is generated mutating the current solution. If this solution has better fitness function values it is taken as current solution. Otherwise it is accepted with a certain probability called acceptance probability. This acceptance probability is computed by a function based on 1) the difference between the fitness function values of the current and neighbor solution and 2) the current temperature value. To adapt this algorithm for solving our multi-objective optimization problem we implemented a Multi-Objective Simulated Annealing algorithm following the approach used by Shelburg et al. [2013]. To represent the solutions 7.4. Study 137 we use the same solution representation used in the genetic algorithm (i.e., label-based integer encoding). We generate the neighbor solutions using the mutation operators used in our genetic algorithm. We compare two solutions using the same fitness functions and dominance comparator of our genetic algorithm. The acceptance probability is computed as in [Shelburg et al., 2013] with the following function: Accept P r o b(i, j, t emp) = e −abs(c(i, j)) t emp where i and j are the current and neighbor solutions; temp is the current temperature; and c(i,j) is a function that computes the difference between the fitness function values of the two solutions i and j. This difference is computed as the average of the differences of each fitness function values of the two solutions according to the following equation: P|D| c(i, j) = k=1 (ck ( j) − ck (i)) |D| where D is the set of fitness functions and ck ( j) is the value of the fitness function k of the solution j. In our case the fitness functions are the IUC(C) and clusters(c) functions used in the genetic algorithm. Note that since this difference is computed as average it is relevant that the fitness function values are measured on the same scale. For this reason the values of the fitness function clusters(C) are normalized to the range between 0 and 1. For further details about the multi-objective simulated annealing we refer to the work in [Nam and Park, 2000; Shelburg et al., 2013]. 7.4 Study The goal of this empirical study is to evaluate the effectiveness of our proposed genetic algorithm in applying the ISP to Java APIs. The quality focus is the ability of the genetic algorithm to split APIs in sub-APIs with higher external cohesion that is measured through the IUC metric. The perspective is that of API providers interested in applying the ISP and in deploying APIs with high external cohesion. The context of this study consists of 42,318 public Java APIs mined from the Maven repository. In this study we answer the following research questions: Is the genetic algorithm able to split APIs into sub-APIs with higher IUC values? Does it outperform the random and simulated annealing approaches? 138 Chapter 7. Refactoring Fat APIs In the following, first, we show the process we used to extract the APIs and their clients’ usage from the Maven repository. Then, we show the procedure we followed to calibrate the genetic algorithm and the simulated annealing algorithms. Finally, we present and discuss the results of our study. 7.4.1 Data Extraction The public APIs under analysis and their clients’ usage have been retrieved from the Maven repository.3 The Maven repository is a publicly available data set containing 144,934 binary jar files of 22,205 different open-source Java libraries, which is described in more detail in [Raemaekers et al., 2013]. Each binary jar file has been scanned to mine method calls using the ASM4 Java bytecode manipulation and analysis framework. The dataset was processed using the DAS-3 Supercomputer5 consisting of 100 computing nodes. To extract method calls we scanned all .class files of all jar files. Class files contain fully qualified references to the methods they call, meaning that the complete package name, class name and method name of the called method is available in each .class file. For each binary file, we use an ASM bytecode visitor to extract the package, class and method name of the callee. Once we extracted all calls from all .class files, we grouped together calls to the same API. As clients of an API we considered all classes declared in other jar files from the Maven repository that invoke public methods of that API. Note that different versions of the same class are considered different for both clients and APIs. Hence, if there are two classes with the same name belonging to two different versions of a jar file they are considered different. To infer which version of the jar file a method call belongs to we scanned the Maven build file (pom.xml) for dependency declarations. In total we extracted the clients’ usage for 110,195 public APIs stored in the Maven repository. We filtered out APIs not relevant for our analysis by applying the following filters: • APIs should declare at least two methods. • APIs should have more than one client. • IUC value of the APIs should be less than one. After filtering out non relevant APIs we ended up with a data set of 42,318 public APIs whose number of clients, methods, and invocations are shown by 3 http://search.maven.org http://asm.ow2.org 5 http://www.cs.vu.nl/das3/ 4 139 0 20 40 60 80 100 7.4. Study #Methods #Clients #Invocations Figure 7.5: Box plots of number of methods (#Methods), clients (#Clients), and invocations (#Invocations) for the public APIs under analysis. Outliers have been removed for the sake of simplicity. the box plots in Figure 7.5 where outliers have been removed for the sake of simplicity. The median number of methods exposed in the APIs under analysis is 4 while the biggest API exposes 370 methods. The median number of clients is 10 with a maximum number of 102,4456 (not shown in Figure 7.5). The median number of invocations to the APIs is 17 with a maximum number of 270,5697 (not shown in Figure 7.5). 7.4.2 GA and SA Calibration To calibrate the GA and SA algorithms we followed a trial-and-error procedure with 10 toy examples. Each toy example consists of an API with 10 methods and 4 clients. For each of the 10 toy examples we changed the clients’ usage. Then, we evaluated the IUC values output by the algorithms with different parameters. For each different parameter, we ran the algorithms ten times. We used the Mann-Whitney and Cliff’s Delta tests to evaluate the difference between the IUC values output by each run. For the GA we evaluated the output with the following parameters: • population size was incremented stepwise by 10 from 10 to 200 individ6 7 clients of the API org.apache.commons.lang.builder.EqualsBuilder invocations to the API org.apache.commons.lang.builder.EqualsBuilder 140 Chapter 7. Refactoring Fat APIs uals. • numbers of iterations was incremented stepwise by 1,000 from 1,000 to 10,000. • crossover and mutation probability were increased stepwise by 0.1 from 0.0 to 1.0. We noticed a slower convergence of the GA only when the population size was less than 50, the number of iterations was less than 1,000, and the crossover and mutation probability was less than 0.7. Hence, we decided to use the default values specified in JMetal (i.e., population of 100 individuals, 10,000 iterations, crossover and mutation probability of 0.9). Similarly, the output of the SA algorithm was evaluated with different values for the cooling factor. The cooling factor was incremented stepwise by 0.1 from 0.1 to 1.0. We did not register any statistically significant difference and we chose a starting temperature of 0.0003 and a cooling factor of 0.99965 as proposed in [Shelburg et al., 2013]. The number of iterations for the SA and RND algorithms is 10,000 to have a fair comparison with the GA. 7.4.3 Results To answer our research questions, first, we compute the IUC value for each public API using the extracted invocations. We refer to this value as IUCbefore . Then, we run the genetic algorithm (GA), the simulated annealing algorithm (SA), and random algorithm (RND) with the same number of iterations (i.e., 10,000). For each API under analysis, these algorithms output the set of subAPIs into which the API should be split. Each sub-API will show a different IUC value. Among these sub-APIs we take the sub-API with the lowest IUC value to which we refer as IUCafter . We chose the lowest IUC value because this gives us the lower boundary for the IUC values of the resulting sub-APIs. Figure 7.6 shows the distributions of IUCafter values and number of subAPIs output by the different algorithms. The box plots in Figure 7.6a show that all the search-based algorithms produced sub-APIs with higher IUCafter values compared to the original APIs (ORI). The genetic algorithm (GA) produced sub-APIs that have higher IUCafter values than the original APIs (ORI) and the sub-APIs generated by the simulated annealing algorithm (SA) and by the random algorithm (RND). The second best algorithm is the random algorithm that outperforms the simulated annealing. The higher IUCafter values of the genetic algorithm are associated with a higher number of sub-APIs as shown in Figure 7.6b. These box plots show that 7.4. Study 141 5 1 0.0 2 0.2 3 4 #SubAPIs 0.6 0.4 IUC 6 0.8 7 8 1.0 the median number of sub-APIs are 2 for the genetic algorithm and the random algorithm. The simulated annealing generated a median number of 1 API, meaning that in 50% of the cases it kept the original API without being able to split it. We believe that the poor performance of the simulated annealing is due to its nature. Even though it is an advanced local search approach it is still a local search approach that can get stuck in a local optimum. To give a better view of the IUC values of the sub-APIs we show the distributions of IUC values measured on the sub-APIs generated by the genetic algorithm in Figure 7.7. Min represents the distribution of IUC values of sub-APIs with the lowest IUC (i.e., IUCafter ). Max represents the distribution of IUC values of sub-APIs with the highest IUC. Q1, Q2, and Q3 represent respectively the first, second, and third quartiles of the ordered set of IUC values of the sub-APIs. ORI GA SA RND (a) Box plots of IUC values measured on the original APIs (ORI) and IUCafter measured on the sub-APIs output by the genetic algorithm (GA), by the simulated annealing algorithm (SA), and by the random algorithm (RND). GA SA RND (b) Number of sub-APIs generated by the genetic algorithm (GA), the simulated annealing algorithm (SA), and the random algorithm (RND). Figure 7.6: IUC values and number of sub-APIs generated by the different search-based algorithms. The box plots in Figure 7.6 already give insights into the capability of the different search-based algorithms of applying the ISP. To provide statistical evidence of their capability we compute the difference between the distributions of IUCbefore and IUCafter generated by the different algorithms using the paired Mann-Whitney test [Mann and R., 1947] and the paired Cliff’s Delta d effect size [Grissom and Kim, 2005]. First, we use the Mann-Whitney test to analyze 142 Chapter 7. Refactoring Fat APIs IUC 0.4 0.6 0.8 1.0 IUC of sub−APIs output by the GA min Q1 Q2 Q3 max Figure 7.7: Box plots of IUC values measured on the sub-APIs output by the genetic algorithm. Outliers have been removed for the sake of simplicity. whether there is a significant difference between the distributions of IUCbefore and IUCafter . Significant differences are indicated by Mann-Whitney p-values ≤ 0.01. Then, we use the Cliff’s Delta effect size to measure the magnitude of the difference. Cliff’s Delta estimates the probability that a value selected from one group is greater than a value selected from the other group. Cliff’s Delta ranges between +1, if all selected values from one group are higher than the selected values in the other group, and -1, if the reverse is true. 0 expresses two overlapping distributions. The effect size is considered negligible for d < 0.147, small for 0.147 ≤ d < 0.33, medium for 0.33 ≤ d < 0.47, and large for d ≥ 0.47 [Grissom and Kim, 2005]. We chose the Mann-Whitney test and Cliff’s Delta effect size because the distributions of IUC values are not normally distributed as shown by the results of the Shapiro test. The Mann-Whitney test and Cliff’s Delta effect size are suitable for non-normal distribution because they do not require assumptions about the variances and the types of the distributions (i.e., they are non-parametric tests). The results of the Mann-Whitney test and Cliff’s Delta effect size are shown in Table 7.1. The distribution of IUCafter values measured on the sub-APIs generated by the genetic algorithm is statistically different (M-W p-value<2.20E-16) from the original IUC values (GA vs ORI). The Cliff’s Delta is 0.732 if we consider all the APIs (ALL) and 1 if we consider only APIs with more than 2 methods (#Methods>2). In both cases the Cliff’s delta is greater than 0.47 and, hence, the effect size is considered statistically large. We obtained similar results comparing the distributions of IUCafter values of the sub-APIs generated by 7.4. Study 143 APIs ALL #Methods>2 APIs ALL #Methods>2 APIs ALL #Methods>2 GA vs ORI Cliff’s delta 0.732 1 GA vs SA M-W p-value Cliff’s delta <2.20E-16 0.705 <2.20E-16 0.962 GA vs RND M-W p-value Cliff’s delta <2.20E-16 0.339 <2.20E-16 0.463 M-W p-value <2.20E-16 <2.20E-16 Magnitude large large Magnitude large large Magnitude medium medium Table 7.1: Mann-Whitney p-value (M-W p-value) and Cliff’s delta between the distributions of IUCafter values measured on the sub-APIs generated by the genetic algorithm and measured on the original APIs (i.e., GA vs ORI) and on the sub-APIs generated by the simulated annealing (i.e., GA vs SA) and random algorithm (i.e., GA vs RND). The table reports the results for all the APIs under analysis (i.e., ALL) and for APIs with more than 2 methods (i.e., #Methods>2). the genetic algorithm and the simulated annealing algorithm (GA vs SA). The Mann-Whitney p-value is <2.20E-16 and the Cliff’s delta is large (i.e., 0.705 for ALL and 0.962 for #Methods>2). The distributions of IUCafter values of the genetic algorithms and random algorithm (GA vs RND) are also statistically different (M-W p-value<2.20E-16). Its effect size is medium (i.e., 0.339 for ALL and 0.463 for #Methods>2). Moreover, from the results shown in Table 7.1 we notice that the Cliff’s delta effect size is always greater when we consider only APIs with more than two methods. This result shows that the effectiveness of the genetic algorithm, random algorithm, and simulated annealing algorithm might depend on the number of methods declared in the APIs, number of clients, and number of invocations. To investigate whether these variables have any impact on the effectiveness of the algorithms, we analyze the Cliff’s Delta for APIs with increasing numbers of methods, clients, and invocations. First, we partition the data set grouping together APIs with the same number of methods. Then, we compute the Cliff’s Delta between the distributions of IUCbefore and IUCafter for each different group. Finally, we use the paired Spearman correlation test to investigate the correlation between the Cliff’s Delta measured on the different 144 Chapter 7. Refactoring Fat APIs GA vs ORI GA vs SA GA vs RND GA vs ORI GA vs SA GA vs RND GA vs ORI GA vs SA GA vs RND #Methods rho corr 0.070 none -0.028 none 0.617 strong #Clients p-value rho corr 5.199E-13 0.446 moderate 8.872E-12 0.429 moderate 2.057e-08 0.424 moderate #Invocations p-value rho corr <2.20E-16 0.541 strong <2.20E-16 0.520 strong 9.447E-14 0.477 moderate p-value 0.6243 0.8458 8.127E-06 Table 7.2: P-values and rho values of the Spearman correlation test to investigate the correlation between the Cliff’s Delta and number of methods, clients, and invocations. Values in bold indicate significant correlations. Corr indicates the magnitude of the correlations. groups and their number of methods. We use the same method to analyze the correlation between the Cliff’s Delta and the number of clients and invocations. The Spearman test compares the ordered ranks of the variables to measure a monotonic relationship. We chose the Spearman correlation because the distributions under analysis are non-normal (normality has been tested with the Shapiro test). The Spearman test is a non-parametric test and, hence, it does not make assumptions about the distribution, variances and the type of the relationship [S.Weardon and Chilko, 2004]. A Spearman rho value of +1 and -1 indicates high positive or high negative correlation, whereas 0 indicates that the variables under analysis do not correlate at all. Values greater than +0.3 and lower than -0.3 indicate a moderate correlation; values greater than +0.5 and lower than -0.5 are considered to be strong correlations [Hopkins, 2000]. The results of the Spearman correlation tests are shown in Table 7.2. We notice that the Cliff’s Delta between the distributions of IUCafter values of the genetic algorithm and the random algorithm (i.e., GA vs RND) increases with larger APIs. The Cliff’s Delta effect size are strongly correlated (i.e., rho=0.617) with the number of methods (#Methods). This indicates that 7.4. Study 145 the more methods an API exposes the more the genetic algorithm outperforms the random algorithm generating APIs with higher IUC. Moreover, with increasing number of clients (i.e., #Clients) and invocations (i.e., #Invocations) the Cliff’s Delta between the distributions of IUCafter values of the genetic algorithm and the other search algorithms increases as well. This is indicated by rho values that are greater than 0.3. Based on these results we can answer our research questions stating that 1) the genetic algorithm is able to split APIs into sub-APIs with higher IUC values and 2) it outperforms the other search-based algorithms. The difference in performance between the genetic algorithm and random algorithm increases with an increasing number of methods declared in the APIs. The difference in performance between the genetic algorithm and the other search-based techniques increases with an increasing number of clients and invocations. 7.4.4 Discussions of the Results The results of our study are relevant for API providers. Publishing stable APIs is one of their main concerns, especially if they publish APIs on the web. APIs are considered contracts between providers and clients and they should stay as stable as possible to not break clients’ systems. In Chapter 2 we showed empirically that fat APIs (i.e., APIs with low external cohesion) are more changeprone than non-fat APIs. To refactor such APIs Martin [2002] proposed the Interface Segregation Principle (ISP). However, applying this principle is not trivial because of the large API usage diversity [Mendez et al., 2013]. Our proposed genetic algorithm assists API providers in applying the ISP. To use our genetic algorithm providers should monitor how their clients invoke their API. For each client they should record the methods invoked in order to compute the IUC metric. This data is used by the genetic algorithm to evaluate the candidate solutions through fitness functions as described in Section 7.2. The genetic algorithm is then capable to suggest the sub-APIs into which an API should be split in order to apply the ISP. This approach is particularly useful to deploy stable web APIs. One of the key factors for deploying successful web APIs is assuring an adequate level of stability. Changes in a web API might break the consumers’ systems forcing them to continuously adapt them to new versions of the web API. Using our approach providers can deploy web APIs that are more externally cohesive and, hence, less change-prone as shown in Chapter 2. Moreover, since our approach is automated, it can be integrated into development and continuous integration environments to continuously monitor the conformance of APIs to the ISP. Providers regularly get informed when and how to refactor 146 Chapter 7. Refactoring Fat APIs an API. However, note that the ISP takes into account only the clients’ usage and, hence, the external cohesion. As a consequence, while our approach assures that APIs are external cohesive, it currently does not guarantee other quality attributes (e.g., internal cohesion). As part of our future work we plan to extend our approach in order to take into account other relevant quality attributes. 7.5 Threats to Validity This section discusses the threats to validity that can affect the empirical study presented in the previous section. Threats to construct validity concern the relationship between theory and observation. In our study this threat can be due to the fact that we mined the APIs usage through a binary analysis. In our analysis we have used binary jar files to extract method calls. The method calls that are extracted from compiled .class files are, however, not necessarily identical to the method calls that can be found in the source code. This is due to compiler optimizations. For instance, when the compiler detects that a certain call is never executed, it can be excluded. However, we believe that the high number of analyzed APIs mitigates this threat. With respect to internal validity, the main threat is the possibility that the tuning of the genetic algorithm and the simulated annealing algorithm can affect the results. We mitigated this threat by calibrating the algorithms with 10 toys examples and evaluating statistically their performance while changing their parameters. Threats to conclusion validity concern the relationship between the treatment and the outcome. Wherever possible, we used proper statistical tests to support our conclusions. In particular we used non-parametric tests which do not make any assumption on the underlying data distribution that was tested against normality using the Shapiro test. Note that, although we performed multiple Mann-Whitney and Spearman tests, p-value adjustment (e.g., Bonferroni) is not needed as we performed the tests on independent and disjoint data sets. Threats to external validity concern the generalization of our findings. We mitigated this threat evaluating the proposed genetic algorithm on 42,318 public APIs coming from different Java systems. The invocations to the APIs have been mined from the Maven repository. These invocations are not a complete set of invocations to the APIs because they do not include invocations from software systems not stored in Maven. However, we are confident that 7.6. Related Work 147 the data set used in this chapter is a representative sample set. 7.6 Related Work Interface Segregation Principle. After the introduction of the ISP by Martin [2002] in 2002 several studies have investigated the impact of fat interfaces on the quality of software systems. In 2013, Abdeen et al. [2013] investigated empirically the impact of interfaces’ quality on the quality of implementing classes. Their results show that violations of the ISP lead to degraded cohesion of the classes that implement fat interfaces. In 2013, Yamashita and Moonen [2013] investigated the impact of intersmell relations on software maintainability. They analyzed the interactions of 12 code smells and their relationships with maintenance problems. Among other results, they show that classes violating the ISP manifest higher afferent coupling. As a consequence changes to these classes result in a larger ripple effect. In Chapter 2, we showed that violations of the ISP can be used to predict change-prone interfaces. Among different source code metrics (e.g., C&K metrics [Chidamber and Kemerer, 1994]) we demonstrated that fat interfaces (i.e., interfaces showing a low external cohesion measured through the IUC metric) are more change-prone than non-fat interfaces. Moreover, our results proved that the IUC metric can improve the performance of prediction models in predicting change-prone interfaces. The results of this related work show the relevance of applying the ISP and motivated us in defining the approach presented in this chapter. Search Based Software Engineering. Over the last years genetic algorithms, and in general search based algorithms, have become popular to perform refactorings of software systems. The approach closest to ours has been presented by Praditwong et al. [2011] in 2011. The authors formulated the problem of designing software modules that adhere to quality attributes (e.g., coupling and cohesion) as multi-objective clustering search problem. Similarly to our work, they defined a multi-objective genetic algorithm that clusters software components into modules. Moreover, they show that multi-objective approaches produce better solutions than existing single-objective approaches. This work influenced us in defining the problem as multi-objective problem instead of a single-objective problem. However, the problem we solve is different from theirs. Our approach splits fat API accordingly to the ISP and uses different fitness functions. 148 Chapter 7. Refactoring Fat APIs Prior to this work [Praditwong et al., 2011], many other studies proposed approaches to cluster software components into modules (e.g., [Mitchell and Mancoridis, 2006; Mancoridis et al., 1999, 1998; Mitchell and Mancoridis, 2002; Mahdavi et al., 2003; Harman et al., 2005]). These studies propose single-objective approaches that have been proven to produce worse solutions by Praditwong et al. [2011]. To the best of our knowledge there are no studies that propose approaches to split fat APIs accordingly to the ISP as proposed in this chapter. 7.7 Conclusions and Future Work In this chapter we proposed a genetic algorithm that automatically obtains the sub-APIs into which a fat API should be split according to the ISP. Mining the clients’ usage of 42,318 Java APIs from the Maven repository we showed that the genetic algorithm is able to split APIs into sub-APIs. Comparing the resulting sub-APIs, based on the IUC values, we showed that the genetic algorithms outperforms the random and simulated annealing algorithms. The difference in performance between the genetic algorithm and the other searchbased techniques increases with APIs with an increasing number of methods, clients, and invocations. Based on these results API providers can automatically obtain and refactor the set of sub-APIs based on how clients invoke the fat APIs. While this approach is already actionable and useful for API providers, we plan to further improve it in our future work. First, we plan to evaluate qualitatively the sub-APIs generated by the genetic algorithm. The higher IUC values guarantee that sub-APIs are more external cohesive and, hence, they better conform to the ISP. However, we have not investigated yet what developers think about the sub-APIs. Hence, we plan to contact developers and perform interviews to investigate the quality of these sub-APIs. Next, we plan to extend our approach taking into account other quality attributes, such as internal cohesion. Finally, we plan to slightly modify the genetic algorithm to generate overlapping sub-APIs (i.e., sub-APIs that share common methods). . 8 Refactoring Chatty web APIs The relevance of the service interfaces’ granularity and its architectural impact have been widely investigated in literature. Existing studies show that the granularity of a service interface, in terms of exposed operations, should reflect their clients’ usage. This idea has been formalized in the Consumer-Driven Contracts pattern (CDC). However, to the best of our knowledge, no studies propose techniques to assist providers in finding the right granularity and in easing the adoption of the CDC pattern. In this chapter, we propose a genetic algorithm that mines the clients’ usage of service operations and suggests Façade services whose granularity reflect the usage of each different type of clients. These services can be deployed on top of the original service and they become contracts for the different types of clients satisfying the CDC pattern. A first study shows that the genetic algorithm is capable of finding Façade services and outperforms a random search approach.1 8.1 8.2 8.3 8.4 8.5 Problem Statement and Solution The Genetic Algorithm . . . . . . Study . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 156 161 165 166 One of the key factors for deploying successful services is assuring an adequate level of granularity [Hohpe and Woolf, 2003; Daigneau, 2011; Murer et al., 2010; Haesen et al., 2008; Kulkarni and Dwivedi, 2008]. The choice of how operations should be exposed through a service interface can have an impact on both performance and reusability [Hohpe and Woolf, 2003; Murer et al., 2010]. This level of granularity is also know in literature as functionality granularity [Haesen et al., 2008]. For the sake of simplicity we refer to it simply as granularity throughout this chapter. Choosing the right granularity 1 This chapter was published in the 10th World Congress on Services (Services 2014) [Romano and Pinzger, 2014]. 149 150 Chapter 8. Refactoring Chatty web APIs is not a trivial task. On the one hand, fine-grained services lead their clients to invoke their interfaces multiple times worsening the performance [Hohpe and Woolf, 2003; Daigneau, 2011]. On the other hand, coarse-grained services can reduce reusability because their use is limited to very specific contexts [Hohpe and Woolf, 2003; Daigneau, 2011]. To find a trade-off between fine-grained and coarse-grained services the Consumer-Driven Contracts (CDC) pattern has been proposed [Daigneau, 2011]. This pattern states that the granularity of a service interface should reflect their clients’ usage satisfying their requirements and becoming a contract between clients and providers. In literature several studies have investigated the impact of granularity (e.g., [Hohpe and Woolf, 2003; Daigneau, 2011; Murer et al., 2010; Haesen et al., 2008; Kulkarni and Dwivedi, 2008]), have classified the different levels of granularity (e.g., [Haesen et al., 2008]), and have proposed metrics to measure them (e.g., [Khoshkbarforoushha et al., 2010; Alahmari et al., 2011]). However, to the best of our knowledge, there are no studies proposing techniques to assist service providers in finding the right granularity and adopting the CDC pattern. This task can be expensive because many clients invoke a service interface in different ways. Providers should, first, analyze the usage of many clients and, then, design a service interface that satisfies all the clients’ requirements. In this chapter, we propose a genetic algorithm to assist service providers in finding the adequate granularity and adopting the CDC pattern. This algorithm mines the clients’ usage of a service interface and it retrieves Façade services [Krafzig et al., 2004] whose interfaces have an adequate granularity for each different type of clients. These Façade services become contracts that reflect clients’ usage easing the adoption of the CDC pattern. Moreover, providers can deploy them on top of the existing service making this approach actionable without modifying it. The contributions of this chapter are as follows: • a genetic algorithm designed to infer Façade services from clients’ usage that represent contracts with the different types of clients. • a study to evaluate the capability of the genetic algorithm compared to the capability of a random search approach. The results show that the genetic algorithm is capable of finding Façade services and it outperforms the random search. The remainder of this chapter is organized as follows. Section 8.1 presents the problem and the proposed solution. Section 8.2 shows the proposed ge- 8.1. Problem Statement and Solution 151 netic algorithm. Section 8.3 presents the study, its results, and discusses them. Related work is presented in Section 8.4 while in Section 8.5 we draw our conclusions and outline directions for future work. 8.1 Problem Statement and Solution In this section, first, we introduce the problem of finding the adequate granularity of service interfaces presenting the Consumer-Driven Contracts pattern. Then, we present our solution to address this problem. 8.1.1 Problem Statement Choosing the adequate granularity of a service is a relevant task and a widely discussed topic [Hohpe and Woolf, 2003; Daigneau, 2011; Murer et al., 2010; Haesen et al., 2008; Kulkarni and Dwivedi, 2008]. On the one hand, fine-grained services can lead to service-oriented systems with inadequate performance due to an excessive number of remote calls [Hohpe and Woolf, 2003]. Consider for instance the fragment of a service interface to order an item shown in Figure 8.1. Figure 8.1a shows a fine-grained design for this service that exposes methods to set shipment and billing information for ordering an item. This design is efficient if the methods’ invocation happens in a local environment (e.g., in a software system deployed on a single machine) [Hohpe and Woolf, 2003]. In a distributed environment (e.g., in a service-oriented system) a client needs to invoke three methods (i.e., setBillingAddress(), setShippingAdress(), and addPriorityShipment()) to set the needed information. This causes a significant communication overhead since three methods needs to be invoked over a network. On the other hand, the coarse-grained OrderItem (shown in Figure 8.1b) exposes only one method (i.e., setShipmentInfo()) to set all the information related to the shipment and the billing. In this way clients invoke the service only once reducing the communication overhead. However, if the services are too coarse-grained they can limit the reusability because their use will be limited to very specific contexts [Hohpe and Woolf, 2003; Murer et al., 2010; Daigneau, 2011]. In our example in Figure 8.1, the clients of the coarsegrained service (Figure 8.1b) are constrained to set the billing address, the shipping address, and to add the priority shipment details. The service is not suitable for contexts where, for instance, priority shipments are not allowed. Maintenance tasks are needed to adapt coarse-grained services to different contexts. Hence, finding the adequate granularity of a service requires finding a trade-off between having a too fine-grained or a too coarse-grained service. 152 Chapter 8. Refactoring Chatty web APIs This allows to publish a service with an acceptable communication overhead and an adequate level of reusability. OrderItem <<FineGrained>> -setBillingAddress() -setShippingAddress() -addPriorityShipment() (a) Fine-grained version exposing different methods for setting each different needed information. OrderItem <<CoarseGrained>> -setShipmentInfo() (b) Coarse-grained version exposing a method to set all the needed information. Figure 8.1: An example of fine-grained and coarse-grained service interfaces to set the shipping and the billing data for ordering an item. To find such an adequate level of granularity the Consumer-Driven Contracts (CDC) has been defined for service interfaces [Daigneau, 2011]. The CDC pattern states that a service interface should reflect their clients needs through its granularity. In this way the service interface is considered a contract that satisfies the clients’ requirements. Applying the CDC pattern is not a trivial task. A service has usually several clients with different requirements invoking its interface differently. To deploy a service with an adequate granularity (using the CDC pattern) providers should know all these requirements. Within an enterprise or a corporate environment providers know their clients and they can understand how clients expect to use a service. However, clients are usually not known a priori and they bind a service only after it has been published and advertised. Moreover the number of clients and their different requirements can be huge and change over time. 8.1.2 Solution Our solution to the aforementioned problem consists in applying a cluster analysis. This analysis consists in clustering the set of methods in such a way that methods in the same cluster are invoked together by the clients. The goal of our cluster analysis is to find clusters that minimize the number of remote invocations to a service. To better understand the cluster analysis for the granularity problem consider the example in Figure 8.2. The OrderItem extends the service shown 8.1. Problem Statement and Solution 153 OrderItem <<FineGrained>> 1-setBillingAddress() 2-setShippingAddress() 3-setPriorityShipment() 4-addPaymentDetails() 5-addWishCardType() 6-addWishCardMsg() 7-trackShipmentByApp() 8-trackShipmentByEmail() 9-trackShipmentBySMS() 10-notifyArrivalTime() Client1 Client2 Client3 Client4 Figure 8.2: An example of a service interface to order an item for an ecommerce system. The rectangles represent independent methods that are invoked by a client. in Figure 8.1a exposing further methods to 1) add payment details (addPaymentDetails()), 2) add a wish card to an order (addWishCardType() and addWishCardMsg()), and 3) to track the shipment (trackShipmentByApp(), trackShipmentByEmail(), trackShipmentBySMS(), and notifyArrivalTime()). Imagine this service has four clients (Client1, Client2, Client3, and Client4). These clients invoke different sets of independent methods denoted in Figure 8.2 by rectangles (e.g., Client1 invokes setBillingAddress(), setShippingAddress(), and setPriorityShipment()). These methods are considered independent because the invocation of one method does not require the invocation of the other ones [Wu et al., 2013]. In total there are 13 remote invocations: 3 performed by Client1, 3 by Client2, 3 by Client3, and 4 by Client4. In this example we can retrieve three clusters (shown in Figure 8.3a) that minimize the number of remote invocations: • Cluster1 (i.e., Shipment): consists of setBillingAddress(), setShippingAddress(), and setPriorityShipments(). • Cluster2 (i.e., WishCard): consists of addWishCardType() and addWishCardMsg(). • Cluster3 (i.e., TrackShipment): consists of trackShipmentByApp() trackShipmentByEmail(), trackShipmentBySMS() and notifyArrivalTime(). Once we know the clusters we can combine the fine-grained methods belonging to a cluster into a single coarse-grained method. These coarse-grained methods can be exposed through Façade services [Krafzig et al., 2004] as 154 Chapter 8. Refactoring Chatty web APIs OrderItem <<FineGrained>> 1-setBillingAddress() 2-setShippingAddress() 3-setPriorityShipment() 4-addPaymentDetails() 5-addWishCardType() 6-addWishCardMsg() 7-trackShipmentByApp() 8-trackShipmentByEmail() 9-trackShipmentBySMS() 10-notifyArrivalTime() Shipment <<Cluster1>> -setShipmentInfo() Client1 Client2 WishCard <<Cluster2>> -setWishCard() TrackShipment <<Cluster3>> Client3 Client4 -setTrackingShip() (a) The Shipment, WishCard, and TrackShipment have been introduced. This design has 9 local invocations and 6 remote invocations. OrderItem <<FineGrained>> 1-setBillingAddress() 2-setShippingAddress() 3-setPriorityShipment() 4-addPaymentDetails() 5-addWishCardType() 6-addWishCardMsg() 7-trackShipmentByApp() 8-trackShipmentByEmail() 9-trackShipmentBySMS() 10-notifyArrivalTime() Shipment <<Cluster1>> -setShipmentInfo() Client2 <<Cluster2>> -setClient2Details() TrackShipment <<Cluster3>> Client1 Client2 Client3 Client4 -setTrackingShip() (b) The Shipment, Client2, and TrackShipment have been introduced. This design has 10 local invocations and 6 remote invocations. Figure 8.3: Two possible refactorings of the service interface shown in Figure 8.2 using the proposed cluster analysis and using the Façade pattern. Black arrows indicate local invocations while non-black arrows indicate remote invocations. 8.1. Problem Statement and Solution 155 shown in Figure 8.3a. Façade services (i.e., Shipment, WishCard, and TrackShipment in our example) have been defined to provide different views of lower level services (i.e., OrderItem in our example). Since the invocations from Façade services to lower-level services are local invocations (shown with black arrows in Figure 8.3), the total number of remote invocations (shown with non-black arrows in Figure 8.3) has been reduced from 9 to 6. Moreover, adopting this design choice allows to keep public the fine-grained OrderItem that can be still invoked by current clients without breaking their behavior. Choosing the clusters that minimize the number of remote invocations can lead to multiple solutions. Imagine for instance that we change Cluster2 adding the method addPaymentDetails() as shown in Figure 8.3b. This cluster is optimal for Client2 that should perform only one remote invocation. However, Client3 cannot invoke anymore the Façade service associated to the Cluster2 because it contains a method (i.e., addPaymentDetails()) in which it is not interested. The number of remote invocations is still equal to 6. At this point an engineer should decide which architectural design is more suitable for her specific domain. The decision might be influenced by three different factors: • Cohesion of Façade services: the design in Figure 8.3a might be preferred because the WishCard service is more cohesive than the Client2 service since it exposes related methods (methods related to the wish card concern). • Number of local invocations: the design in Figure 8.3a might be preferred because it has 9 local invocations while the design in Figure 8.3b has 10 local invocations. • Relevance of different clients: the service provider might want to give a better service (e.g., upon a higher registration fee) to Client2 and, hence, adopt the design in Figure 8.3b. 8.1.3 Contributions In this chapter we propose a search-based approach to retrieve the clusters of methods that minimize the number of remote invocations. As explained previously, the methods belonging to the same cluster can be exposed through a Façade service whose granularity reflects clients’ usage and, hence, satisfies the CDC pattern. A first approach to find these clusters consists in adopting brute-force search techniques. These techniques consist of enumerating all possible clusters and checking whether they minimize the number of invocations. The 156 Chapter 8. Refactoring Chatty web APIs problem of these approaches is that the number of possible clusters can be prohibitively large causing a combinatorial explosion. Imagine for instance to adopt this approach for finding the right granularity of the AmazonEC2 web service. This web service exposes 118 methods in the version 23. The number of 20-combinations of the 118 methods in AmazonEC2 are equal to: 118! 118 ≈ 2 ∗ 1021 = 20 20!98! This means that for only evaluating all the clusters with 20 methods the search will require executing at least 2 ∗ 1021 computer instructions, which will take several days on a typical PC. Moreover, we should evaluate clusters with size ranging from 2 to 118 causing the number of computer instructions to further increase. To solve this issue we propose a genetic algorithm (shown in Section 8.2) that mimicking the process of natural selection finds optimal solutions (i.e., cluster that minimize the number of remote invocations) in acceptable time without requiring special hardware configurations (e.g., the use of supercomputers). Moreover, we perform a first study aimed at investigating the capability of the proposed approach in finding Façade services that is presented in Section 8.3. In this chapter we do not cover the problem of mining independent methods because it has already been subject of related work [Wu et al., 2013] that can be integrated in our approach. Furthermore, related work [Wu et al., 2013] shows that 78.1% of the methods in their analyzed web services are independent. This percentage shows that most of the methods can be clustered into coarse-grained methods,further motivating the need of performing this task with a proper approach. 8.2 The Genetic Algorithm Genetic Algorithms (GAs) have been used in a wide range of applications where optimization is required. Among all the applications GAs have been widely studied to solve clustering problems [Hruschka et al., 2009]. GAs mimic the process of natural selection to provide a search heuristic able to solve optimization problems. A generic GA is shown in Figure 8.4 and consists of seven different steps. In the fist step, the GA creates a set of randomly generated candidate solutions (also known as chromosomes) called population (step 1 in Figure 8.4). 8.2. The Genetic Algorithm 157 Create initial population of chromosomes 1 2 Evaluate fitness of each chromosome 3 Max Evaluations 4 Select next generation (Selection Operator) 5 Perform reproduction (Crossover operator) 6 Perform mutation (Mutation operators) 7 Output best chromosomes Figure 8.4: Different steps of a genetic algorithm. In the second step, the candidate solutions are evaluated through a fitness function (step 2). This function measures the goodness of a candidate solution. Then, the population is evolved iteratively through evolutionary operators (steps 4, 5, and 6) until some conditions are satisfied (e.g., reaching the max number of fitness evaluations in step 3 or achievement of the goal). Each evolution iteration is performed through a selection operator (step 4), a crossover operator (step 5), and a mutation operator (step 6). The selection operators selects a pair of solutions (parents) from the population. The parents are used by the crossover operator to generate two offspring solutions (step 5). The offspring solutions are generated in such a way that combine features from the two parents. The mutation operators (step 6) mutate the offspring in order to preserve the diversity. The mutated solutions are added to the population replacing solutions with the worst fitness scores. Finally, the GA outputs the best solutions when the evolution process terminates (step 7). To implement the GA and adapt it to find the set of clusters that minimize the number of remote invocations we have to define the fitness function, the chromosome (or solution) representation, and the evolutionary operators (i.e., selection, crossover, and mutation) that are shown in the following subsections. 158 Chapter 8. Refactoring Chatty web APIs ClientID Client1 Client2 Client3 Client4 InvokedMethods 1;2;3 4;5;6 5;6;7 7;8;9;10 Table 8.1: Data set containing independent methods invoked by each different client in Figure 8.2. 8.2.1 Chromosome representation The chromosomes are represented with a label-based integer encoding widely adopted in literature [Hruschka et al., 2009] and shown in Figure 8.5. According to this encoding, a solution is represented by an integer array of n positions, where n is the number of methods exposed in a service. Each position corresponds to a specific method (e.g., position 1 corresponds to the method setBillingAddress() in Figure 8.2). The integer values in the array represent the cluster to which the methods belong. For instance in Figure 8.5, the methods 1,2, and 10 belong to the same cluster labeled with 1. Note that two chromosomes can be equivalent even though the clusters are labeled differently. For instance the chromosomes [1,1,1,1,2,2,2,2,3,3] and [2,2,2,2,3,3,3,3,1,1] represent the same clusters. To solve this problem we apply the renumbering procedure as shown in [Falkenauer, 1998] that transforms different labelings of equivalent clusterings into a unique labeling. 1 2 3 4 5 6 7 8 9 10 1 1 2 3 2 4 5 3 6 1 Figure 8.5: Chromosome representation of our candidate solutions. 8.2.2 Fitness The fitness function is a function that measures how "good" a solution is. Our fitness function counts for each chromosome the number of remote invocations needed by the clients. Imagine that the clients’ usage information of Figure 8.2 are saved in the data set shown in Table 8.1. In this data set, each row contains the id of the client (i.e., ClientID) and the set of independent methods invoked by it (i.e., InvokedMethods). The InvokedMethods are sets of methods where each integer value corresponds to a different method in the service. We label the methods in the OrderItem (shown in Figure 8.2) from 1 to 10 depending on the order they appear in the service 8.2. The Genetic Algorithm 159 (e.g., setBillingAddress() is labeled with 1, setShippingAddress() is labeled with 2, etc.). Once we have this data set, we compute the fitness function as the sum of the number of remote invocations required to invoke each InvokedMethods set in the data set. If the methods (or a subset of methods) in an InvokedMethods set belong to a cluster containing no other methods, the methods in this cluster account for 1 invocation in total. Otherwise each different method accounts for 1. Consider for instance the chromosome [1,1,1,1,2,2,2,2,3,3]. This chromosome clusters together the methods 1, 2, 3, and 4 (i.e., cluster 1), the methods 5, 6, 7, and 8 (i.e., cluster 2), and the methods 9 and 10 (i.e., cluster 3). In this case the number of remote invocations to execute the InvokedMethods of Client1 (i.e., 1;2;3) is 3 because the cluster 1 contains the method 4 that is not needed by it. Hence, Client1 cannot invoke the Façade service represented by the cluster labeled 1 and invokes the methods of the original service OrderItem. If we change the chromosome into [1,1,1,2,2,2,2,2,3,3], the total number of invocations is equal to 1 because Client1 can execute the single operation declared in the Façade service represented by the cluster 1. If the chromosome becomes [1,1,2,2,2,2,2,2,3,3] then the total number of remote invocations is equal to 2. The client invokes once the method of cluster 1 to invoke the methods 1 and 2. Then it invokes method 3 in the original service. 8.2.3 The Selection Operator To select the parents we use the Ranked Based Roulette Wheel (RBRW) operator. This operator is a modified roulette wheel selection operator that has been proposed by Al Jadaan and Rajamani [2008]. RBRW ranks the chromosomes in the population by the fitness value: the highest rank is assigned to the chromosome with the best fitness value. Hence, the best chromosomes have the highest probabilities to be selected as parents. 8.2.4 The Crossover Operator The two parents (ParentA and ParentB) are then used to generate the offspring. The crossover operator is applied with a probability Pc . To perform the crossover we use the operator defined for clustering problems byHruschka et al. [2009]. Consider the example shown in Figure 8.6 from [Hruschka et al., 2009]. The operator first selects randomly k (1≤k≤n) clusters from ParentA, where n is the number of clusters in ParentA. Assume that the clusters 2 and 3 are selected from ParentA (marked in red in Figure 8.6). The first child (ChildC) originally is created as copy of the second parent ParentB (step 1). As second step, the selected clusters (i.e., 2 and 3) are copied into 160 Chapter 8. Refactoring Chatty web APIs ChildC. Copying these clusters changes the clusters 1, 2, and 3 in ChildC. These changed clusters are removed from ChildC (step 3) leaving the corresponding methods unallocated (labeled with 0). In the forth step the unallocated methods are allocated to the cluster with the nearest centroid. The same procedure is followed to generate the second child ChildD. However, instead of selecting randomly k clusters from ParentB, the changed clusters of ChildC (i.e., 1,2, and 3) are copied into ChildD that is originally a copy of ParentA. ParentA ParentB 1 1 2 3 2 4 5 1 2 5 4 2 1 2 3 3 2 1 2 4 1: copy ParentB into ChildC ChildC 4 2 1 2 3 3 2 1 2 4 2: copy clusters 2 and 3 from ParentA to ChildC ChildC 4 2 2 3 2 3 2 1 2 4 3: remove changed methods from B (i.e., 1,2,3) ChildC 4 0 2 3 2 0 0 0 2 4 4: unallocated objects are allocated to randomly selected clusters Figure 8.6: Example of crossover operator for clustering problems [Hruschka et al., 2009]. 8.2.5 The Mutation Operators Finally, the offspring is mutated through the mutation operator with a probability Pm . This step ensures genetic diversity from one generation to the next ones. We perform the mutation selecting one of the following cluster-oriented mutation operators (randomly selected) [Falkenauer, 1998; Hruschka et al., 2009]: • split: a randomly selected cluster is split into two different clusters. The methods of the original cluster are randomly assigned to the generated clusters. 8.3. Study 161 • merge: moves all methods of a randomly selected cluster to another randomly selected cluster. • move: moves methods between clusters. Both methods and clusters are randomly selected. 8.2.6 Implementation We implemented the proposed genetic algorithm on top of the JMetal2 framework. JMetal is a Java framework that provides state-of-the-art algorithms for optimization problems. We calibrated the genetic algorithm as follows: • the population is composed by 100 chromosomes. The initial population is randomly generated; • the crossover and mutation probability is 0.9; • the maximum number of fitness evaluation (step 3 in Figure 8.4) is 100,000. 8.3 Study The goal of this study is to evaluate the capability of our approach in finding Façade services that minimize the number of remote invocations and reflect clients’ usage. The perspective is that of service providers interested in applying the Consumer-Driven Contracts pattern using Façade services with adequate granularity. In this study we answer the following research question: To which extent is the propose GA capable of identifying Façade services that minimize the number of remote invocations and reflect clients’ usage? In the following subsections, first, we present the analysis we performed to answer our research question. Then, we show the results and answer the research question. Finally, we discuss the results and the threats to validity of our study. 8.3.1 Analysis To answer our research question we run the genetic algorithm (GA) defined in Section 8.2 to find the Façade services for the working example shown in Figure 8.2. To measure the performance of our GA we register the number 2 http://jmetal.sourceforge.net 162 Chapter 8. Refactoring Chatty web APIs of GA fitness evaluations needed to find the Façade services shown in Figure 8.3a and Figure 8.3b. Also, we compare the GA with a random search (RS), in which the solutions are randomly generated but no genetic evolution is applied. Both the GA and RS are executed 100 times and the number of fitness evaluations required to find the Façade services are compared through statistical tests. We use a random search as baseline because this comparison is considered the first step to evaluate a genetic algorithm [Sivanandam and Deepa, 2007]. Comparisons with other search-based approaches (e.g., local search algorithms) will be subject of our future work. First, we use the Mann-Whitney test to analyze whether there is a significant difference between the number of fitness evaluations required by the GA and the ones required by the RS. Significant differences are indicated by Mann-Whitney p-values ≤ 0.01. Then, we use the Cliff’s Delta d effect size to measure the magnitude of the difference. Cliff’s Delta estimates the probability that a value selected from one group is greater than a value selected from the other group. Cliff’s Delta ranges between +1 if all selected values from one group are higher than the selected values in the other group and -1 if the reverse is true. 0 expresses two overlapping distributions. The effect size is considered negligible for d < 0.147, small for 0.147≤ d < 0.33, medium for 0.33≤ d < 0.47, and large for d ≥ 0.47. We chose the Mann-Whitney test and Cliff’s Delta effect size because they do not require assumptions about the variances and the types of the distributions (i.e., they are non-parametric tests). Moreover, to analyze the capability of the GA in finding Façade services for bigger services, we increase stepwise the number of methods declared in OrderItem keeping unchanged the original methods (i.e., 1-10), their clients, and the clients’ usage (as shown in Figure 8.2). In this way we enlarge the search space and we analyze whether the GA is able to find the same Façade services. For each different size of the OrderItem we perform the same analysis: 1) we execute 100 times the GA and RS, 2) we register the number of fitness evaluations needed for finding the Façade services shown in Figure 8.3, and 3) we perform the Mann-Withney and Cliff’s Delta test to analyze statistically the differences between the distributions. We increment the size of the service up to 118 methods, that is the size of the biggest WSDL interface (AmazonEC2) analyzed in our previous work shown in Chapter 4. 8.3.2 Results Table 8.2 shows the percentage of executions in which GA and RS find the right Façade services shown in Figure 8.3. The results show that, while the 8.3. Study 163 #Methods 10 11 12 13 14 15 16 118 GA 100% 100% 100% 100% 100% 100% 100% 100% RS 82% 70% 65% 35% 20% 10% 0% 0% Table 8.2: Percentage of successful executions in which GA and RS find the Façade services shown in Figure 8.3. GA is always capable of finding the Façade services, the capability of the RS decreases with an increasing number of methods. For services with 16 or more methods the RS is not capable to find the Façade services. The number of fitness evaluations required by the GA and RS are shown in the form of box plots in Figure 8.7. The median number of fitness evaluations for the OrderItem with 118 methods required by the GA (not shown in Figure 8.7) is equal to 5754 (with a median execution time of 295 seconds3 ). Comparing it to the median number of fitness evaluations for the service with 10 methods (i.e., 1049 fitness evaluations with a median execution time of 34.5 seconds) shows that GA scales well with an increasing number of methods. Moreover, the distributions of the number of fitness evaluations required by the GA and the RS is statistically different as shown by the Mann-Whitney p-values (<0.01) in Table 8.3. The magnitude of these differences is always large as shown by Cliff’s Deltas d (=1) in Table 8.3. All the distributions, except RS12 in Figure 8.7, are not normally distributed (normality has been tested with the Shapiro test and a confidence level of 0.05). As a consequence the non-parametric tests used in our analysis are the most suitable for these distributions. Based on these results, we can answer our research question stating that the GA is capable to find Façade services and outperforms the RS approach. 3 Execution times has been evaluated on a MacBook Pro Mid 2010, processor 2.66 GHz Intel Core i7, memory 4 GB 1067 MHz DDR3, OS 10.8.5. 4e+06 8e+06 Chapter 8. Refactoring Chatty web APIs 0e+00 #Evaluations 164 GA10 RS10 GA11 RS11 GA12 RS12 GA13 RS13 GA14 RS14 GA15 RS15 GA16 RS16 Figure 8.7: Box plots showing the number of fitness evaluations (#Evaluations) required by GA and RS. GAX and RSX label the box plots for the OrderItem with X methods. #Methods 10 11 12 13 14 15 MW p-value < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 Cliff d 1 1 1 1 1 1 Table 8.3: Mann-Whitney p-values (MW p-value) and Cliff’s Delta d (Cliff d) between the distribution of #Evaluations required by the GA and RS. 8.3.3 Discussions The results of this study show that the proposed GA, differently to the RS, is capable to assist service providers in applying the Consumer-Driven Contracts pattern. Running the GA, providers can retrieve the Façade services that reflect the usage of their clients and minimize the number of remote invocations. Once the set of Façade services is retrieved, they should manually select the most appropriate Façade services as discussed in Section 8.1. These Façade services can be deployed on top of the existing service without modifying it and preserving the compatibility of existing clients. Furthermore, since this approach is semi-automatic, it can be executed over time to monitor the evolution of clients’ usage. This allows service providers to co-evolve the granularity of their services reflecting the evolving usage of their clients. The main threats to validity that can affect our study are the threats to external validity. These threats concern the generalization of our findings. We 8.4. Related Work 165 evaluated our approach with a small working example. However, to best of our knowledge, there are no available data sets that contain service usage information suitable for our analysis. In literature different data sets are available for research on QoS (e.g., [Al-Masri and Mahmoud, 2008; Zhang et al., 2010]). However, these data sets do not contain information about the operations invoked but only the service names and their url. As a consequence they are not suitable for our analysis. 8.4 Related Work Granularity of services. The closest work to ours is the study developed by Jiang et al. [2011]. In this study the authors propose an approach to infer the granularity of services by mining the activities of business processes. The main idea consists of using frequent pattern mining algorithms to analyze the invocations to service interfaces. Our approach differs to theirs because it can mine the granularity of every kind of services and not only services involved in business processes. Furthermore, we have not used the proposed frequent pattern mining algorithm because they require a special tuning of the support and confidence parameters that are problem specific. Moreover, these parameters, together with other relevant details, are not reported in [Jiang et al., 2011] making the replication of this study not possible. To the best of our knowledge we are not aware of further studies aimed a inferring the right granularity of service interfaces. Related work have mostly proposed classifications for different levels of granularity and have investigated metrics for measuring the granularity. Haesen et al. [2008] have proposed a classification of three service granularity types (i.e., functionality, data, and business value granularity). For each of these types they have discussed the impact on a set of architectural attributes (e.g., performance, reusability and flexibility). In this chapter we adhered to their functionality granularity that has been referred to as granularity for the sake of simplicity. Haesen et al. confirm that the functionality granularity can have an impact on both performance and reusability as stated in [Hohpe and Woolf, 2003; Murer et al., 2010; Daigneau, 2011] and already discussed in Section 8.1. Many other studies have investigated metrics to measure the granularity (e.g., [Khoshkbarforoushha et al., 2010; Alahmari et al., 2011]). For instance, Khoshkbarforoushha et al. [2010] measure the granularity appropriateness with a model that integrates four different metrics that measure: 1) the business value of a service, 2) the service reusability, 3) the service context-independency, and 4) the service complexity. Alahmari et al. [2011] proposed a set of metrics to measure the granularity based on internal struc- 166 Chapter 8. Refactoring Chatty web APIs tural attributes (e.g., number of operations, number of messages, complexity of data types). However, these studies are limited to measure the granularity and do not provide suggestions on inferring the right granularity. Refactoring through genetic algorithms. Over the last years genetic algorithms, and in general search based algorithms, have become popular to perform refactorings of software artifacts. For instance, Ghannem et al. [2013] found appropriate refactoring suggestions using a set of refactoring examples. Their approach is based on an Interactive Genetic Algorithm which enables to interact with users and integrate their feedbacks into a classic GA. Ghaith and Cinnéide [2012] presented an approach to automate improvements of software security based on search-based refactoring. O’Keeffe and í Cinnéide [2008] have constructed a software tool capable of refactoring object-oriented systems. This tool uses search-based techniques to conform the design of a system to a given design quality model. These studies confirm that genetic algorithms are a useful technique to solve refactoring problems and satisfying desired quality attributes. 8.5 Conclusion & Future Work In this chapter we have proposed a genetic algorithm to mine the adequate granularity of a service interface. According to the Consumer-Driven Contracts pattern, the granularity of a service should reflect its clients’ usage. To adopt this pattern our genetic algorithm suggests Façade services whose granularity reflect the clients’ usage. These services can be deployed on top of existing services allowing an easy adaptation of the Consumer-Driven Contracts pattern that does not require any modifications to existing services. Our approach is semi-automatic as discussed in Section 8.1. The genetic algorithm outputs different sets of Façade services that should be reviewed by providers. In our future work, first, we plan to further improve this approach to minimize the effort required from the user. Specifically, we plan to add parameters that can guide the search algorithm towards more detailed goals: giving more relevance to certain clients, satisfying other quality attributes (e.g., high cohesion of Façade services, low number of local invocations), etc. Then, we plan to compare our genetic algorithm with other search-based techniques (e.g., local search algorithms). Finally, we plan to improve the genetic algorithm suggesting overlapping Façade services that allow a method to belong to different Façade services. However, an ad-hoc study is needed to investigate to which extent the methods can be exposed through different Façade services because it can be problematic for the maintenance of service-oriented systems. . 9. Conclusion The need of reusing existing software components has caused the emergence of a new programming paradigm called service-oriented. According to service orientation, existing software systems (e.g., legacy systems) can be integrated with web services. The main goal of web services is to provide a standardized API that hide the technologies used to implement the legacy system. Other systems can reuse the business logic of legacy systems without knowing their implementation details and only binding these APIs. As a consequence the coupling between integrated systems is reduced as discussed in Chapter 1. However, such systems are still coupled through web APIs that specify the operations exposed by web services and the data structures needed to invoke them. These web APIs are considered contracts between web service providers and their clients and they should stay as stable as possible. Changes in the web APIs can lead the client systems to be broken and their business can be damaged. In this dissertation we have focused on better understanding the changeproneness of APIs and web APIs. Specifically to that end, this work has investigated which indicators can be used to highlight change-prone APIs and web APIs providing approaches to assist practitioners in refactoring them. 9.1 Contributions The main contributions of this thesis can be summarized as follows: • An external cohesion metric (i.e., IUC) capable of highlighting changeprone Java interfaces. We performed an empirical study aimed at investigating which software metrics can be used to highlight change-prone Java interfaces. We compared the capability of existing software metrics defined for object oriented and service oriented systems. Software metrics have been measured along the history of a software system and they have been correlated through statistical tests with the changes per167 168 Chapter 9. Conclusion formed in the analyzed systems. These metrics have also been used to train prediction models aimed at predicting change-prone Java interfaces. The results of this study are useful for software engineers and software researchers. Software engineers can better measure the stability of their interfaces. This helps them in highlighting change-prone interfaces before they are bound to web APIs. Software researchers can use this first study to further investigate the change-proneness of Java interfaces and more in general of APIs. • A set of antipatterns (i.e., ComplexClass, SpaghettiCode, and SwissArmyKnife) that highlight change-prone Java APIs. We performed an empirical study aimed at investigating which antipatterns can be used as indicators of changes in Java classes. We investigated which antipatterns are more likely to lead to changes and which types of changes are likely to appear in Java classes affected by certain types of antipatterns. Among other types of changes, we investigated changes that APIs undergo along their history and which antipatterns are more likely to cause these changes. As in the previous contribution, we measured the presence of antipatterns along the history of software systems and we statistically correlated them with the number and type of changes performed in the software systems. The perspective of this study is that of software engineers who want to estimate the stability of Java classes that participate in certain antipatterns. Among thess antipatterns, they might be interested in antipatterns that cause changes to APIs. This is particularly relevant if APIs are bound to web APIs. • An approach to mine dynamic dependencies among web services deployed in an enterprise. To that end, we used the vector clocks technique originally conceived to order events in a distributed environment. We used this technique in the domain of web service systems by attaching the vector clocks to the header of SOAP messages. We modified the vector clocks’ value along the execution of a service oriented system and we used them to order service executions and to infer causal dependencies among the executions. The implementation of this approach is portable and it relies on well known integration patterns. Moreover, we analyzed the impact of the attached vector clocks on the performance of a service oriented systems. This approach is useful for software engineers who want to monitor the dynamic chain of dependencies among web services that might be useful for debugging and reverse engineering tasks. 9.1. Contributions 169 • A tool called WSDLDiff that extracts fine-grained changes between different versions of WSDL APIs. Differently to existing approaches, our tool takes into account the syntax of WSDL and XSD that are used to define operations and data structures in a WSDL API. This tool is useful for web service subscribers and researchers. WSDLDiff can be used by subscribers who want to analyze which elements are frequently added, changed, and removed in a WSDL API and which types of changes a WSDL API undergoes more frequently. Based on this information they can subscribe to the most stable WSDL APIs reducing the likelihood that unstable APIs might break their systems. Researchers can use our tool to further investigate the change-proneness of WSDL APIs retrieving automatically the fine-grained changes performed along their history. • A set of maintenance scenarios that can affect web APIs with low internal end external cohesion. We performed an empirical study aimed at investigating the impact of internal and external cohesion on the changeproneness of web APIs. This analysis is performed using a mixed-method approach. First, we used an online survey to investigate the interface, method, and data-type level change-proneness of web APIs with low external and internal cohesion. The survey reports on maintenance scenarios that are likely to cause changes in such web APIs. Then, we analyzed the history of ten well known WSDL APIs to investigate the impact of internal cohesion on the change-proneness. Specifically, we introduced a new internal cohesion metric (DTC) and we correlated statistically the values of this metric with the fine-grained changes extracted with WSDLDiff from the WSDL APIs under analysis. The perspective of this study is that of web service providers, subscribers, and software researchers. Both, web service providers and subscribers, can benefit from the new metric to estimate the interface change-proneness of a WSDL API. Based on the values of the DTC metric, subscribers can subscribe to the most internally cohesive WSDL API to avoid that changes can break their systems. Providers can highlight WSDL APIs that should be refactored to avoid frequent changes. Moreover, they can estimate the change-proneness based on the value of external cohesion metrics (e.g., SIUC). Based on the internal and external cohesion, they can estimate the likelihood that certain maintenance scenario can cause changes to their APIs. Software researchers can also benefit from this first study on change-prone web APIs to further investigate the change-proneness of web APIs. • An approach to automatically refactor APIs with low external cohesion, 170 Chapter 9. Conclusion also knows as fat APIs. We propose an approach to split fat APIs according to the Interface Segregation Principle (ISP). We defined the problem of splitting fat APIs as a multi-objective clustering optimization problem and we proposed a genetic algorithm to solve it. Based on the client usage of a fat API, the genetic algorithm infers the APIs into which it should be split to conform to the ISP and, hence, showing a higher external cohesion. To validate the genetic algorithm we mined the clients’ usage of 42,318 public Java APIs from the Maven repositories. We compared the capability of the genetic algorithm with the capabilities of other search-based techniques, namely a random approach and a multiobjective simulated annealing approach. The genetic algorithm is useful for software engineers who want to refactor APIs and web APIs with low external cohesion that is symptomatic of change-prone APIs. • An approach to refactor chatty web APIs. We proposed a genetic algorithm that assists web service providers in finding the right granularity for their web APIs. Based on the clients’ usage of a web API, the genetic algorithm mines the Façade APIs that reflect the usage of the different clients according to the Consumer-Driven Contracts (CDC) patterns. These Façade APIs cluster together methods that are invoked together by the different clients reducing the number of remote invocations. Façade APIs can be deployed on top of the original APIs with which they communicate locally. As a consequence, the remote chattiness is reduced. 9.2 The Research Questions Revisited In Chapter 1 we have formulated a set of high level research questions whose answers can be found in the studies presented in the chapters of this dissertation. In this section we answer these high level research questions based on the findings of our studies and we discuss them in the context of this PhD thesis. 9.2.1 Track 1: Change-Prone APIs The first track of this PhD research was aimed at investigating indicators of changes for APIs that might be bound to web APIs. Research Question 1: Which software metrics do indicate change-prone APIs? This research question can be answered from the results of the study presented in Chapter 2. In this chapter we investigated the change-proneness 9.2. The Research Questions Revisited 171 of Java interfaces correlating the values of software metrics with the number of fine-grained changes performed in Java interfaces. As software metrics we selected the popular C&K metrics, already used to highlight change-prone Java classes, and a set of complexity and usage metrics defined for interfaces. The fine-grained changes have been extracted with ChangeDistiller mining the software repositories of 10 well known open source Java projects. The results have shown that the Interface Usage Cohesion (IUC) metric exhibits the strongest correlation with the number of changes performed in Java interfaces. As a consequence, software engineers should design interfaces with high external cohesion (measured with the IUC metric) to avoid frequent changes. Low external cohesion is also known as symptom of the violation of the Interface Segregation Principle (ISP). This principle was already popular amongst software practitioners before our study. However, our study provides a first empirical evidence of the effects of the ISP on the stability of interfaces. In the second part of this study we used prediction models (i.e., Support Vector Machine, Naive Bayes Network, and Neural Nets) to predict change-prone Java interfaces. First, we trained these models with object oriented metrics that showed the highest correlation with the number of fine-grained changes namely, CB0, RFC, LCOM, and WMC. Then, we added the IUC metric to these metrics. The results showed that when adding the IUC metric the precision and recall increased. Based on these results, we can answer Research Question 1 stating that the IUC metric is the best metric in highlighting change-prone Java interfaces as far as our research showed. This indicates that external cohesion is a required quality attribute to design stable interfaces. Interestingly low external cohesion highlights also change-prone web APIs as it has been found in the study shown in Chapter 6. This result suggests that the clients usage should be taken into account when we expose operations through an API. Research Question 2: What is the impact of antipatterns on the change-proneness of APIs? Previous studies have already shown the impact of antipatterns on the change-proneness of software artifacts. In the context of this PhD research we wanted to investigate whether antipatterns impact also the change-proneness of APIs. We can answer Research Question 2 with the results of the study reported in Chapter 3. In Chapter 3 we performed an empirical study to investigate 1) the impact of certain antipatterns on change-proneness and 2) the frequency of appearance of certain type of fine-grained changes in Java classes affected by certain antipatterns. 172 Chapter 9. Conclusion The fine-grained changes have been extracted with ChangeDistiller from the repositories of 16 Java open source projects. These changes have been clustered in 5 different categories depending on the entity of the change. Among these categories we defined a category that includes all changes performed on APIs (e.g., method renaming, changes of parameters, changes of return types). Besides extracting the changes performed in each class, we detected the list of antipatterns affecting each class with the DECOR tool [Moha et al., 2008a,b, 2010]. Based on this extracted data, we correlated the presence of certain antipatterns with the frequency of certain types of changes. We showed empirically that changes to APIs are more likely to appear if APIs are affected by the ComplexClass, SpaghettiCode, and SwissArmyKnife antipatterns. These results allow us to answer Research Question 2 stating that these antipatterns have a greater impact on the change-proneness of APIs in the analyzed systems. Together with the results of the study showed in Chapter 2, they provide heuristics to detect change-prone APIs. If these APIs are made available through web services, engineers should resolve these antipatterns, and assure high external cohesion, to avoid frequent changes in the future. 9.2.2 Track 2: Change-Prone Web APIs In the second track of this PhD research we focused on change-proneness of web APIs. First, we defined two approaches to analyze service oriented systems. Then, we analyzed the change-proneness answering the research questions reported below. Research Question 3: How can we extract fine-grained changes among subsequent versions of web APIs? In Chapter 4 we have proposed the WSDLDiff tool to extract fine-grained changes from the history of WSDL APIs. The tool has been implemented on top of the Eclipse Modeling Framework (EMF). This framework allows to parse WSDL APIs into standardized models (i.e., ecore models) that can be compared through the Matching and Differencing engines. Differently to previous work, our tool takes into account the syntax of WSDL and XSD languages and outputs the elements affected by a change (e.g., XSDElement, WSDLMessage) and the type of change (i.e., addition, deletion, and modification). In our first study (shown in Chapter 4) we used WSDLDiff to analyze the evolution of four well known public WSDL APIs. The changes extracted in this study showed that WSDL APIs evolve differently and they do change frequently. This result further motivated us to investigate the change-proneness 9.2. The Research Questions Revisited 173 of web APIs. WSDLDiff is a useful tool that can help web service subscribers in analyzing which elements are frequently added, removed, and changed in a WSDL API. Based on this information they can subscribe to the most stable WSDL API to avoid to continuously adapt their clients to new versions of a WSDL API. Researchers can benefit from this tool to further investigate the evolution of WSDL APIs. We can answer Research Question 3 stating that EMF provides a framework suitable to extract changes between different versions of a WSDL API. Research Question 4: How can we mine the full chain of dynamic dependencies among web services? We can answer Research Question 4 based on the study presented in Chapter 5. We have reported on an approach to extract dynamic dependencies among web services based on the vector clocks. We provided a non-intrusive, easy-to-implement, and portable implementation that relies on the well known Pipes and Filters integration pattern. As a consequence this approach can be implemented in many enterprise service buses and web service frameworks such as Apache Axis2, Apache CXF, and MuleESB. This approach consists in attaching vector clocks to the header of SOAP messages. When a web service is invoked, the vector clock is captured and updated storing information about the invoked web service. Along the execution of a service oriented system the vector clock stores the chains of invocations that can be viewed at run-time or at a later time. This approach is particularly useful to reverse engineering and debugging service oriented systems. A first analysis of the overhead due to this approach showed that extra overhead is negligible. To summarize we can answer Research Question 4 stating that the existing technique of vector clocks can be used to retrieve dependencies among web services and its overhead is negligible. Research Question 5: What are the scenarios in which developers change web APIs with low internal and external cohesion? In Chapter 6 we have presented a mixed-method approach to investigate the change-proneness of web APIs with low internal and low external cohesion. The survey we performed gives insights into the maintenance scenarios that can lead such web APIs to change. Specifically, we can state that low externally cohesive web APIs change frequently to 1) improve understandability 174 Chapter 9. Conclusion and 2) ease maintainability and reduce clones in the APIs. Low internally cohesive web APIs change frequently to 1) reduce the impact of changes on the many clients they have, 2) avoid that all the clients lead the APIs to be changed frequently, and 3) improve understandability. We complemented the finding of our survey performing a quantitative analysis of low internally cohesive web APIs. First, we defined a new internal cohesion metric (DTC) to measure properly the internal cohesion. Then, we correlated the values of DTC with the number of changes performed in ten well known WSDL APIs. The changes have been extracted with our WSDLDiff tool presented in Chapter 4. The results confirm that web APIs with low internal cohesion are more change-prone than internally cohesive web APIs. 9.2.3 Track 3: Refactoring Web APIs The last track of this dissertation is dedicated to approaches to refactor changeprone web APIs. Research Question 6: Which search based techniques can be used to apply the Interface Segregation Principle? Both studies presented in Chapter 2 and Chapter 6 showed that low internally cohesive APIs and web APIs are more change-prone. To refactor such APIs we presented an approach to apply the Interface Segregation Principle (ISP) in Chapter 7. We defined the problem of splitting fat APIs into smaller APIs specific for each client (i.e., ISP) as a multi-objective clustering optimization problem. To solve this problem we used two state-of-the art search based approaches namely, a genetic algorithm and a simulated annealing algorithm. The results of this study showed that the genetic algorithm is able to infer more externally cohesive APIs for 42,318 public APIs whose usage has been mined from the Maven repositories. This approach is useful for API and web API providers. To use our genetic algorithm API providers should monitor how their clients invoke their API. This data is then used by the genetic algorithm to split the API into smaller APIs accordingly to the ISP. Research Question 7: Which search based techniques can transform a fine-grained APIs into multiple coarse-grained APIs reducing the total number of remote invocations? As discussed in Section 1.1.3 fine-grained web APIs should be refactored into coarse-grained web APIs to avoid performance problems. In Chapter 8 9.3. Recommendations for Future Work 175 we defined a genetic algorithm to infer coarse-grained Façade APIs from the clients usage of a fine-grained API. The genetic algorithm looks for Façade APIs that cluster together the fine-grained methods of the original API. Finegrained methods are clustered into a single coarse-grained method if they are invoked consecutively by the clients. In this way the clients can invoke the coarse-grained methods in the Façade APIs reducing the number of remote invocations. A first study showed that the genetic algorithm outperforms the random search technique and is always able to suggest the right Façade APIs for the working example shown in Chapter 8. The capability of the random approach decreases with larger fine-grained APIs. This approach can be used every time there is the need to reduce the chattiness of web APIs. In such cases the Façade APIs retrieved by the genetic algorithm can be deployed on top of the original APIs. This allows the clients to interact with the APIs with less invocations while keeping the original APIs. 9.3 Recommendations for Future Work The work presented in this dissertation provides relevant insights into the change-proneness of web APIs. However, this is only a first step in this area of research that certainly needs to be incrementally enriched and revised. In this section we present the recommendations for future work for each of the different tracks of this PhD project. To investigate the change-proneness of APIs we have performed quantitative studies. These studies provide statistical evidence of heuristics to highlight change-prone APIs. This track should be enriched performing qualitative analyses. These qualitative analyses should include questionnaires, surveys, interviews allowing developers and engineers to further refine our findings. Moreover, it is desirable to perform a more extended quantitative analysis that analyzes software systems implemented in different programming languages and paradigms also including commercial software systems. The recommendations for the future work of Track 2 are threefold. First, a quantitative analysis of the change-prone externally cohesive web APIs is desirable. This analysis should refine and revise the insights we collected in our survey. However, performing this analysis requires access to the clients usage of web APIs that might not be publicly available. As a consequence, this analysis should be performed in an industrial environment where this data is available. A second important step to understand why web APIs change over time is understanding their purpose. Track 2 should be extended taking into account 176 Chapter 9. Conclusion the web service typologies. Heuristics to classify web services into different typologies, as suggested by Krafzig et al. [Krafzig et al., 2004], should be defined. We expect that some web service typologies change less frequently and for different reasons than others. For instance, the web API of a web service that is meant to bridge a technological gap would change only when the bridged technologies change. On the other hand, the interface of a web service that provides search functionalities can change every time that the search criterion changes. To automatically classify web services we can analyze two sources of information. First, we can analyze the documentations that are usually available in natural language and published on websites. For instance, Google Maps web services are documented on their website.1 The second source of information consists of the web API that is composed of: 1) method declarations, 2) data types needed to invoke the methods and to retrieve the results, and 3) comments to ease the comprehension of a service interface. To obtain relevant information from these two sources future work should be based on information retrieval techniques, widely used in the software engineering community for similar purposes. Finally, future work should investigate the change-proneness of REST APIs separately. In this dissertation we have focused on RPC APIs such as WSDL APIs. As discussed in Chapter 1 REST APIs are different because they are Resource APIs that expose resources through HTTP as application protocol. As consequence, the operations they expose are fixed but the resource itself can change. Dedicated studies to investigate the change-proneness of resources is desirable to understand why REST APIs change. As future work of Track 3 both the genetic algorithms presented in Chapter 7 and Chapter 8 can be further improved. For instance, the sub-APIs generated by the genetic algorithm Chapter 7 exposes disjoint sets of methods. These sub-APIs might expose overlapping sets of methods to show higher values of external cohesion. However, this causes the introduction of clones and further studies are needed to investigate how they impact other quality attributes such as maintainability. 9.4 Concluding Remarks The work presented in this dissertation was aimed at investigating the changeproneness of APIs and web APIs. This work by no means covers all the aspects of change-prone APIs and web APIs nor provides a complete guideline on de1 https://developers.google.com/maps/documentation/webservices/ 9.4. Concluding Remarks 177 signing stable APIs and web APIs. However, we advanced the state-of-the-art in 1) validating software metrics (i.e., internal and external cohesion) that highlight change-prone APIs and web APIs, 2) analyzing service oriented systems, and 3) refactoring fine-grained and low externally cohesive APIs and web APIs. Our contributions are aimed at giving new insights into the changeproneness of APIs and web APIs that allow the research community to further advance and refine our findings. Bibliography Marwen Abbes, Foutse Khomh, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. An empirical study of the impact of two antipatterns, blob and spaghetti code, on program comprehension. In Tom Mens, Yiannis Kanellopoulos, and Andreas Winter, editors, CSMR, 15th European Conference on Software Maintenance and Reengineering, pages 181–190. IEEE Computer Society, 2011. Hani Abdeen, Houari A. Sahraoui, and Osama Shata. How we design interfaces, and how to assess it. In ICSM, pages 80–89, 2013. Omar Al Jadaan and Lakishmi Rajamani. Improved selection operator for ga. Journal of Theoretical and Applied Information Technology, 4(4), 2008. Eyhab Al-Masri and Qusay H. Mahmoud. Investigating web services on the world wide web. WWW, pages 795–804, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-085-2. Saad Alahmari, Ed Zaluska, and David C. De Roure. A metrics framework for evaluating soa service granularity. SCC, pages 512–519, Washington, DC, USA, 2011. ISBN 978-0-7695-4462-5. Gustavo Alonso, Fabio Casati, Harumi Kuno, and Vijay Machiraju. Web Services: Concepts, Architectures and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010. ISBN 3642078885, 9783642078880. Mohammad Alshayeb and Wei Li. An empirical validation of object-oriented metrics in two different iterative software processes. Transactions on Software Engineering, 29:1043–1049, November 2003. 179 180 BIBLIOGRAPHY Lerina Aversano, Marcello Bruno, Massimiliano Di Penta, Amedeo Falanga, and Rita Scognamiglio. Visualizing the evolution of web services using formal concept analysis. In IWPSE, pages 57–60, 2005. Jagdish Bansiya and Carl G. Davis. A hierarchical model for object-oriented design quality assessment. IEEE Trans. Softw. Eng., 28(1):4–17, January 2002. ISSN 0098-5589. I. Barker. What is information architecture? URL http://www.steptwo. com.au. Victor R. Basili, Lionel C. Briand, and Walcélio L. Melo. A validation of objectoriented design metrics as quality indicators. IEEE Trans. Software Eng., 22 (10):751–761, 1996. Sujoy Basu, Fabio Casati, and Florian Daniel. Toward web service dependency discovery for soa management. In Proceedings of the 2008 IEEE International Conference on Services Computing - Volume 2, pages 422–429, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3283-7-02. Abraham Bernstein, Jayalath Ekanayake, and Martin Pinzger. Improving defect prediction using temporal features and non linear models. In Ninth international workshop on Principles of software evolution: in conjunction with the 6th ESEC/FSE joint meeting, IWPSE ’07, pages 11–18, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-722-3. Shawn A. Bohner. Software change impacts - an evolving perspective. In ICSM, pages 263–272, 2002. Bart Du Bois, Serge Demeyer, Jan Verelst, Tom Mens, and Marijn Temmerman. Does god class decomposition affect comprehensibility? In Proceedings of the IASTED International Conference on Software Engineering, pages 346–355. IASTED/ACTA Press, 2006. Eric Bouwers, Arie van Deursen, and Joost Visser. Evaluating usefulness of software metrics: an industrial experience report. In Proceedings of the International Conference on Software Engineering, pages 921–930, 2013. Marcus A. S. Boxall and Saeed Araban. Interface metrics for reusability analysis of components. In Proceedings of the 2004 Australian Software Engineering Conference, ASWEC ’04, pages 40–, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2089-8. BIBLIOGRAPHY 181 Lionel Briand, Walcelio Melo, and Juergen Wuest. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng., 28:706–720, July 2002. Lionel C. Briand, John W. Daly, and Jürgen Wüst. A unified framework for cohesion measurement in object-oriented systems. Empirical Software Engineering, 3(1):65–117, July 1998. ISSN 1382-3256. Lionel C. Briand, Yvan Labiche, and Johanne Leduc. Toward the reverse engineering of uml sequence diagrams for distributed java software. IEEE Trans. Softw. Eng., 32:642–663, September 2006. ISSN 0098-5589. Peter F Brown and Rebekah Metz Booz Allen Hamilton. Reference model for service oriented architecture 1.0, 2006. William J. Brown, Raphael C. Malveau, Hays W. McCormikk III, and T.J. Mowbray. Anti Patterns: Refactoring Software, Architectures, and Projects in Crisis. Wiley, 1998. Cedric Brun and Alfonso Pierantonio. Model differences in the eclipse modelling framework. UPGRADE The European Journal for the Informatics Professional, IX:29–34, 2008. John Businge. Co-evolution of the eclipse SDK framework and its third-party plug-ins. In 17th European Conference on Software Maintenance and Reengineering, CSMR 2013, Genova, Italy, March 5-8, 2013, pages 427–430, 2013. John Businge, Alexander Serebrenik, and Mark van den Brand. An empirical study of the evolution of eclipse third-party plug-ins. In Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), IWPSE-EVOL ’10, pages 63–72, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-01282. John Businge, Alexander Serebrenik, and Mark van den Brand. Analyzing the eclipse API usage: Putting the developer in the loop. In 17th European Conference on Software Maintenance and Reengineering, CSMR 2013, Genova, Italy, March 5-8, 2013, pages 37–46, 2013. Gerardo Canfora, Michele Ceccarelli, Luigi Cerulo, and Massimiliano Di Penta. Using multivariate time series and association rules to detect logical change coupling: An empirical study. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-8630-4. 182 BIBLIOGRAPHY Luba Cherbakov, Mamdouh Ibrahim, and Jenny Ang. Soa antipatterns: the obstacles to the adoption and successful realization of service-oriented architecture, 2006. URL http://www.ibm.com/developerworks/ webservices/library/ws-antipatterns/. Shyam R. Chidamber and Chris F. Kemerer. Towards a metrics suite for object oriented design. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 197–211, 1991. Shyam R. Chidamber and Chris F. Kemerer. A metrics suite for object oriented design. Transactions on Software Engineering, 20(6):476–493, June 1994. ISSN 0098-5589. James O. Coplien and Neil B. Harrison. Organizational Patterns of Agile Software Development. Prentice-Hall, Upper Saddle River, NJ (2005), 1st edition, 2005. Steve Counsell, Stephen Swift, and Jason Crampton. The interpretation and utility of three cohesion metrics for object-oriented design. Transactions on Software Engineering and Methodology, 15(2):123–149, April 2006. ISSN 1049-331X. John W. Creswell and Vicki L.P. Clark. Designing and Conducting Mixed Methods Research. SAGE Publications, 2010. ISBN 9781412975179. Robert Daigneau. Service Design Patterns: Fundamental Design Solutions for SOAP/WSDL and RESTful Web Services. Pearson Education, 2011. ISBN 032154420X. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimisation: Nsga-ii. In PPSN, volume 1917, pages 849–858, 2000. ISBN 3-540-410562. Karim Dhambri, Houari Sahraoui, and Pierre Poulin. Visual detection of design anomalies. In Proceedings of the 12 th European Conference on Software Maintenance and Reengineering, Tampere, Finland, pages 279–283. IEEE CS Press, April 2008. Danny Dig and Ralph E. Johnson. How do apis evolve? a story of refactoring. Journal of Software Maintenance, 18(2):83–107, 2006. BIBLIOGRAPHY 183 Bill Dudney, Joseph Krozak, Kevin Wittkopf, Stephen Asbury, and David Osborne. J2EE Antipatterns. John Wiley & Sons, Inc., New York, NY, USA, 1 edition, 2002. ISBN 0471146153. Mahmoud O. Elish and Mojeeb-Al-Rahman Al-Khiaty. A suite of metrics for quantifying historical changes to predict future change-prone classes in object-oriented software. Journal of Software: Evolution and Process, 25 (5):407–437, 2013. Thomas Erl. SOA Principles of Service Design (The Prentice Hall Service-Oriented Computing Series from Thomas Erl). Prentice Hall PTR, Upper Saddle River, NJ, USA, 2007. Len Erlikh. Leveraging legacy system dollars for e-business. IT Professional, 2 (3):17 –23, may/jun 2000. ISSN 1520-9202. Emanuel Falkenauer. Genetic Algorithms and Grouping Problems. John Wiley & Sons, Inc., New York, NY, USA, 1998. ISBN 0471971502. Zaiwen Feng, Keqing He, Rong Peng, and Yutao Ma. Taxonomy for evolution of service-based system. In SERVICES, pages 331–338, 2011. Colin J. Fidge. Timestamps in message-passing systems that preserve partial ordering. In Proceedings of the 11th Australian Computer Science Conference, pages 56–66, 1988. Roy Thomas Fielding. Architectural Styles and the Design of Network-based Software Architectures. PhD thesis, 2000. AAI9980887. Jr. Floyd J. Fowler. Survey Research Methods (4th ed.). SAGE Publications, Inc., 0 edition, 2009. Beat Fluri and Harald C. Gall. Classifying change types for qualifying change couplings. In Proceedings of the 14th IEEE International Conference on Program Comprehension, ICPC ’06, pages 35–45, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2601-2. Beat Fluri, Michael Wuersch, Martin PInzger, and Harald Gall. Change distilling: Tree differencing for fine-grained source code change extraction. IEEE Trans. Softw. Eng., 33:725–743, November 2007. Marios Fokaefs, Rimon Mikhaiel, Nikolaos Tsantalis, Eleni Stroulia, and Alex Lau. An empirical study on web service evolution. In Proceedings of the International Conference on Web Services, pages 49–56, 2011. 184 BIBLIOGRAPHY Martin Fowler. Refactoring – Improving the Design of Existing Code. AddisonWesley, 1st edition, June 1999. ISBN 0-201-48567-2. Harald C. Gall, Beat Fluri, and Martin Pinzger. Change analysis with evolizer and changedistiller. IEEE Softw., 26:26–33, January 2009. ISSN 0740-7459. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995. ISBN 0-201-63361-2. Shadi Ghaith and Mel Ó Cinnéide. Improving software security using searchbased refactoring. In SSBSE, pages 121–135, 2012. Adnane Ghannem, Ghizlane El-Boussaidi, and Marouane Kessentini. Model refactoring using interactive genetic algorithm. In SSBSE, pages 96–110, 2013. Emanuel Giger, Martin Pinzger, and Harald C. Gall. Comparing fine-grained source code changes and code churn for bug prediction. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 83–92, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0574-7. Tudor Girba, Stéphane Ducasse, and Michele Lanza. Yesterday’s weather: Guiding early reverse engineering efforts by summarizing the evolution of changes. In Proceedings of the International Conference on Software Maintenance, pages 40–49, 2004. David M. Green and John A. Swets. Signal detection theory and psychophysics, volume 1. Wiley, 1966. Robert J. Grissom and John J. Kim. Effect sizes for research: A broad practical approach. Lawrence Earlbaum Associates, 2nd edition edition, 2005. Raf Haesen, Monique Snoeck, Wilfried Lemahieu, and Stephan Poelmans. On the definition of service granularity and its architectural impact. CAiSE, pages 375–389, Berlin, Heidelberg, 2008. ISBN 978-3-540-69533-2. Mark Harman, Stephen Swift, and Kiarash Mahdavi. An empirical study of the robustness of two module clustering fitness functions. In GECCO, pages 1029–1036, 2005. ISBN 1-59593-010-8. Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. Search-based software engineering: Trends, techniques and applications. ACM Comput. Surv., 45(1):11:1–11:61, December 2012. ISSN 0360-0300. BIBLIOGRAPHY 185 Brian Henderson-Sellers. Object-oriented metrics: measures of complexity. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996. ISBN 0-13-2398729. Brian Henderson-Sellers, Larry L. Constantine, and Ian M. Graham. Coupling and cohesion (towards a valid metrics suite for object-oriented analysis and design). Object Oriented Systems, 3:143–158, 1996. Gregor Hohpe and Bobby Woolf. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2003. ISBN 0321200683. Will G. Hopkins. A new view of statistics. Internet Society for Sport Science, 2000. Daqing Hou and Xiaojia Yao. Exploring the intent behind api evolution: A case study. In Proceedings of the Working Conference on Reverse Engineering, pages 131–140, 2011. Curtis E. Hrischuk and Murray C. Woodside. Logical clock requirements for reverse engineering scenarios from a distributed system. IEEE Trans. Softw. Eng., 28:321–339, April 2002. ISSN 0098-5589. Eduardo Raul Hruschka, Ricardo J. G. B. Campello, Alex A. Freitas, and André C. Ponce Leon F. De Carvalho. A survey of evolutionary algorithms for clustering. Trans. Sys. Man Cyber Part C, 39(2):133–155, March 2009. ISSN 1094-6977. Deligiannis Ignatios, Stamelos Ioannis, Angelis Lefteris, Roumeliotis Manos, and Shepperd Martin. A controlled experiment investigation of an object oriented design heuristic for maintainability. Journal of Systems and Software, 65(2), February 2003. Deligiannis Ignatios, Shepperd Martin, Roumeliotis Manos, and Stamelos Ioannis. An empirical investigation of an object-oriented design heuristic for maintainability. Journal of Systems and Software, 72(2), 2004. Daniel Jacobson. Embracing the differences : Inside the netflix api redesign. http://techblog.netflix.com/2012/07/embracingdifferences-inside-netflix.html, 2012. [Online; accessed May2014]. 186 BIBLIOGRAPHY Jinlei Jiang, Yongwei Wu, and Guangwen Yang. Making service granularity right: An assistant approach based on business process analysis. CHINAGRID, pages 204–210, Washington, DC, USA, 2011. ISBN 978-0-7695-44724. Nicolai Josuttis. Soa in Practice: The Art of Distributed System Design. O’Reilly Media, Inc., 2007. ISBN 0596529554. Huzefa Kagdi, Michael L. Collard, and Jonathan I. Maletic. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. J. Softw. Maint. Evol., 19(2):77–131, March 2007. ISSN 1532-060X. Foutse Khomh, Massimiliano Di Penta, and Yann-Gael Gueheneuc. An exploratory study of the impact of code smells on software change-proneness. In Proceedings of the Working Conference on Reverse Engineering, pages 75– 84, 2009. Foutse Khomh, Stephane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. Bdtex: A gqm-based bayesian approach for the detection of antipatterns. Journal of Systems and Software, 84(4):559 – 572, 2011. ISSN 0164-1212. <ce:title>The Ninth International Conference on Quality Software</ce:title>. Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empirical Software Engineering, 17(3):243– 275, 2012. Taghi M. Khoshgoftaar and Robert M. Szabo. Improving code churn predictions during the system test and maintenance phases. In Proceedings of the International Conference on Software Maintenance, pages 58–67, 1994. Alireza Khoshkbarforoushha, R. Tabein, Pooyan Jamshidi, and Fereidoon Shams Aliee. Towards a metrics suite for measuring composite service granularity level appropriateness. In SERVICES, pages 245–252, 2010. ISBN 978-0-7695-4129-7. Dirk Krafzig, Karl Banke, and Dirk Slama. Enterprise SOA: Service-Oriented Architecture Best Practices (The Coad Series). Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004. ISBN 0131465759. BIBLIOGRAPHY 187 Jaroslav Král and Michal Zemlicka. The most important service-oriented antipatterns. In Proceedings of the International Conference on Software Engineering Advances, page 29, 2007. WH Kruskal and WA Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621, 1952. Naveen N. Kulkarni and Vishal Dwivedi. The role of service granularity in a successful soa realization - a case study. In SERVICES I, pages 423–430. IEEE Computer Society, 2008. ISBN 978-0-7695-3286-8. Avadhesh Kumar, Rajesh Kumar, and P. S. Grover. Unified cohesion measures for aspect-oriented systems. International Journal of Software Engineering and Knowledge Engineering, 21(1):143–163, 2011. Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, 1978. Guillaume Langelier, Houari A. Sahraoui, and Pierre Poulin. Visualizationbased analysis of quality for large-scale software systems. In proceedings of the 20 t h international conference on Automated Software Engineering. ACM Press, Nov 2005. Michele Lanza and Radu Marinescu. Object-Oriented Metrics in Practice. Springer-Verlag, 2006. ISBN 3-540-24429-8. Erich Leo Lehmann and H.J.M D’Abrera. Nonparametrics : Statistical Methods Based on Ranks. Holden-Day Series in Probability and Statistics. Holden-Day New York Dusseldorf Johannesbourg, 1975. ISBN 0-07-037073-7. Philipp Leitner, Anton Michlmayr, Florian Rosenberg, and Schahram Dustdar. End-to-end versioning support for web services. In 2008 IEEE International Conference on Services Computing (SCC), pages 59–66, 2008. Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Softw. Eng., 34:485–496, July 2008. ISSN 0098-5589. Wei Li and Sallie M. Henry. Object-oriented metrics which predict maintainability. Technical report, Virginia Polytechnic Institute & State University, Blacksburg, VA, USA, 1993. 188 BIBLIOGRAPHY Wei Li and Raed Shatnawi. An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution. Journal of Systems and Software, 80(7), 2007. Zheng Li, Yi Bian, Ruilian Zhao, and Jun Cheng. A fine-grained parallel multiobjective test case prioritization on gpu. In SSBSE, pages 111–125, 2013. Fangfang Liu, Yuliang Shi, Jie Yu, Tianhong Wang, and Jingzhe Wu. Measuring similarity of web services based on wsdl. In ICWS, pages 155–162, 2010. Kiarash Mahdavi, Mark Harman, and Robert M. Hierons. A multiple hill climbing approach to software module clustering. In ICSM, pages 315–324, 2003. ISBN 0-7695-1905-9. Spiros Mancoridis, Brian S. Mitchell, Chris Rorres, Yih-Farn Chen, and Emden R. Gansner. Using automatic clustering to produce high-level system organizations of source code. In IWPC, pages 45–52, 1998. ISBN 0-81868560-3. Spiros Mancoridis, Brian S. Mitchell, Yih-Farn Chen, and Emden R. Gansner. Bunch: A clustering tool for the recovery and maintenance of software system structures. In ICSM, pages 50–59, 1999. Henry B. Mann and Whitney D. R. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1):50–60, 1947. Mika Mantyla. Bad Smells in Software - a Taxonomy and an Empirical Study. PhD thesis, Helsinki University of Technology, 2003. Radu Marinescu. Detection strategies: Metrics-based rules for detecting design flaws. In Proceedings of the 20 th International Conference on Software Maintenance, pages 350–359. IEEE CS Press, 2004. Robert C. Martin. Agile Software Development, Principles, Patterns, and Practices. Prentice-Hall, Inc, 2002. Friedemann Mattern. Virtual time and global states of distributed systems. In Parallel and Distributed Algorithms, pages 215–226. North-Holland, 1989. Bernhart M. Mauczka A., Grechenig T. Predicting code change by using static metrics. In Software Engineering Research, Management and Applications, pages 64–71, 2009. BIBLIOGRAPHY 189 Diego Mendez, Benoit Baudry, and Martin Monperrus. Empirical evidence of large-scale diversity in api usage of object-oriented software. In SCAM, pages 43–52, 2013. Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng., 33:2–13, January 2007. ISSN 0098-5589. Brian S. Mitchell and Spiros Mancoridis. Using heuristic search techniques to extract design abstractions from source code. In GECCO, pages 1375–1382, 2002. ISBN 1-55860-878-8. Brian S. Mitchell and Spiros Mancoridis. On the automatic modularization of software systems using the bunch tool. IEEE Trans. Software Eng., 32(3): 193–208, 2006. Naouel Moha, Yann-Gaël Guéhéneuc, Anne-Françoise Le Meur, and Laurence Duchien. A domain analysis to specify design defects and generate detection algorithms. In Proceedings of the Theory and practice of software, 11th international conference on Fundamental approaches to software engineering, FASE’08/ETAPS’08, pages 276–291, Berlin, Heidelberg, 2008a. SpringerVerlag. ISBN 3-540-78742-9, 978-3-540-78742-6. Naouel Moha, Amine Mohamed Rouane Hacene, Petko Valtchev, and YannGaël Guéhéneuc. Refactorings of design defects using relational concept analysis. In Proceedings of the 6th international conference on Formal concept analysis, ICFCA’08, pages 289–304, Berlin, Heidelberg, 2008b. SpringerVerlag. ISBN 3-540-78136-6, 978-3-540-78136-3. Naouel Moha, Yann-Gael Gueheneuc, Laurence Duchien, and Anne-Francoise Le Meur. Decor: A method for the specification and detection of code and design smells. IEEE Trans. Softw. Eng., 36(1):20–36, January 2010. ISSN 0098-5589. Naouel Moha, Francis Palma, Mathieu Nayrolles, Benjamin Joyen Conseil, Yann-Gael.Gueheneuc@polymtl.Ca Yann-Gael, Guéhéneuc, Benoit Baudry, and Jean-Marc Jézéquel. Specification and detection of soa antipatterns. In Proceedings of the International Conference on Service Oriented Computing, pages 1–16, Shanghai, China, 2012. Matthew James Munro. Product metrics for automatic identification of “bad smell" design problems in java source-code. In Proceedings of the 11 th International Software Metrics Symposium. IEEE Computer Society Press, September 2005. 190 BIBLIOGRAPHY Stephan Murer. 13 years of soa at credit suisse: Lessons learned-remaining challenges. In Proceedings of the European Conference on Web Services, page 12, Sept 2011. Stephan Murer, Bruno Bonati, and Frank Furrer. Managed Evolution - A Strategy for Very Large Information Systems. Springer, 2010. ISBN 3-642-01632-4. Nachiappan Nagappan, Andreas Zeller, Thomas Zimmermann, Kim Herzig, and Brendan Murphy. Change bursts as defect predictors. In ISSRE, pages 309–318, 2010. Dongkyung Nam and Cheol Hoon Park. Multiobjective simulated snnealing: a comparative study to evolutionary algorithms. International Journal of Fuzzy Systems, 2(2):87–97, 2000. Hans Neukom. Early use of computers in swiss banks. IEEE Annals of the History of Computing, 26(3):50–59, 2004. Mark O’Keeffe and Mel í Cinnéide. Search-based refactoring for software maintenance. J. Syst. Softw., 81(4):502–516, April 2008. ISSN 0164-1212. Steffen Olbrich, Daniela S. Cruzes, Victor Basili, and Nico Zazworka. The evolution and impact of code smells: A case study of two open source systems. In Third International Symposium on Empirical Software Engineering and Measurement, 2009. Rocco Oliveto, Foutse Khomh, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. Numerical signatures of antipatterns: An approach based on b-splines. In Rafael Capilla, Rudolf Ferenc, and Juan Carlos Dueas, editors, Proceedings of the 14 th Conference on Software Maintenance and Reengineering. IEEE Computer Society Press, March 2010. Mike P. Papazoglou. The challenges of service evolution. In Proceedings of the international Conference on Advanced Information Systems Engineering, pages 1–15, 2008. Cesare Pautasso and Erik Wilde. Why is the web loosely coupled?: A multifaceted metric for service design. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 911–920, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-487-4. Cesare Pautasso, Olaf Zimmermann, and Frank Leymann. Restful web services vs. "big"’ web services: Making the right architectural decision. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 805–814, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-085-2. BIBLIOGRAPHY 191 Massimiliano Di Penta, Luigi Cerulo, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. An empirical study of the relationships between design pattern roles and class change proneness. In Proceedings of the International Conference on Software Maintenance, pages 217–226, 2008. Mikhail Perepletchikov, Caspar Ryan, and Keith Frampton. Towards the definition and validation of coupling metrics for predicting maintainability in service-oriented designs. In OTM Workshops (1), pages 34–35, 2006. Mikhail Perepletchikov, Caspar Ryan, and Keith Frampton. Cohesion metrics for predicting maintainability of service-oriented software. In Proceedings of the International Conference on Quality Software, pages 328–335, 2007. ISBN 0-7695-3035-4. Mikhail Perepletchikov, Caspar Ryan, and Zahir Tari. The impact of service cohesion on the analyzability of service-oriented software. Transactions on Services Computing, 3(2):89–103, April 2010. ISSN 1939-1374. Pierluigi Plebani and Barbara Pernici. Urbe: Web service retrieval based on similarity evaluation. IEEE Trans. on Knowl. and Data Eng., 21:1629–1642, November 2009. ISSN 1041-4347. Daryl Posnett, Christian Bird, and Prem Dévanbu. An empirical study on the influence of pattern roles on change-proneness. Empirical Software Engineering, 16(3):396–423, June 2011. ISSN 1382-3256. Colin Potts. Software-engineering research revisited. IEEE Softw., 10(5):19– 28, September 1993. ISSN 0740-7459. Kata Praditwong, Mark Harman, and Xin Yao. Software module clustering as a multi-objective search problem. IEEE Trans. Software Eng., 37(2):264–282, 2011. Steven Raemaekers, Arie van Deursen, and Joost Visser. Measuring software library stability through historical version analysis. In Proceedings of the International Conference on Software Maintenance, pages 378–387, 2012. Steven Raemaekers, Arie van Deursen, and Joost Visser. The maven repository dataset of metrics, changes, and dependencies. In MSR, pages 221–224, 2013. Romain Robbes, Damien Pollet, and Michele Lanza. Logical coupling based on fine-grained change information. In Proceedings of the 2008 15th Working Conference on Reverse Engineering, pages 42–46, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3429-9. 192 BIBLIOGRAPHY Daniele Romano and Martin Pinzger. Using source code metrics to predict change-prone java interfaces. In ICSM, pages 303–312, 2011a. ISBN 9781-4577-0663-9. Daniele Romano and Martin Pinzger. Using vector clocks to monitor dependencies among services at runtime. In Proceedings of the International Workshop on Quality Assurance for Service-Based Applications, QASBA ’11, pages 1–4, 2011b. ISBN 978-1-4503-0826-7. Daniele Romano and Martin Pinzger. Analyzing the evolution of web services using fine-grained changes. In ICWS, pages 392–399, 2012. ISBN 978-14673-2131-0. Daniele Romano and Martin Pinzger. A genetic algorithm to find the adequate granularity for service interfaces. In 2014 IEEE World Congress on Services, Anchorage, AK, USA, June 27 - July 2, 2014, pages 478–485, 2014. Daniele Romano, Martin Pinzger, and Eric Bouwers. Extracting dynamic dependencies between web services using vector clocks. In SOCA, pages 1–8, 2011. Daniele Romano, Paulius Raila, Martin Pinzger, and Foutse Khomh. Analyzing the impact of antipatterns on change-proneness using fine-grained source code changes. In Proceedings of the Working Conference on Reverse Engineering, pages 437–446, 2012. Daniele Romano, Maria Kalouda, and Martin Pinzger. Analyzing the impact of external and internal cohesion on the change-proneness of web apis. Technical Report TUD-SERG-2013-018, Software Engineering Research Group, Delft University of Technology, 2013. URL http://swerl.tudelft.nl/ twiki/pub/Main/TechnicalReports/TUD-SERG-2013-018.pdf. Daniele Romano, Steven Raemaekers, and Martin Pinzger. Refactoring fat interfaces using a genetic algorithm. Technical Report TUD-SERG2014-007, Software Engineering Research Group, Delft University of Technology, 2014. URL http://swerl.tudelft.nl/twiki/pub/Main/ TechnicalReports/TUD-SERG-2014-007.pdf. Dieter H. Rombach. A controlled expeniment on the impact of software structure on maintainability. IEEE Trans. Softw. Eng., 13:344–354, March 1987. ISSN 0098-5589. Dieter H. Rombach. Design measurement: Some lessons learned. IEEE Softw., 7:17–25, March 1990. ISSN 0740-7459. BIBLIOGRAPHY 193 Arnon Rotem-Gal-Oz. SOA Patterns. Manning Pubblications, 1 edition, 2012. ISBN 9781933988269. Günter Rudolph. Evolutionary search for minimal elements in partially ordered finite sets. In Evolutionary Programming, volume 1447 of Lecture Notes in Computer Science, pages 345–353, 1998. ISBN 3-540-64891-7. R.S. Arnold Shawn A. Bohner. Software change impact analysis. IEEE Computer Society Press, 1996. Jeffery Shelburg, Marouane Kessentini, and Daniel R. Tauritz. Regression testing for model transformations: A multi-objective approach. In SSBSE, volume 8084 of Lecture Notes in Computer Science, pages 209–223, 2013. ISBN 978-3-642-39741-7. David J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC, 4 edition, 2007. ISBN 1584888148, 9781584888147. Jelber Sayyad Shirabad, Timothy C. Lethbridge, and Stan Matwin. Mining the maintenance history of a legacy software system. In Proceedings of the International Conference on Software Maintenance, ICSM ’03, pages 95–, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1905-9. Frank Simon, Frank Steinbrückner, and Claus Lewerentz. Metrics based refactoring. In Proceedings of the Fifth European Conference on Software Maintenance and Reengineering (CSMR’01), page 30. IEEE CS Press, 2001. ISBN 0-7695-1028-0. Renuka Sindhgatta, Bikram Sengupta, and Karthikeyan Ponnalagu. Measuring the quality of service oriented design. In Proceedings of the International Joint Conference on Service-Oriented Computing, pages 485–499, Berlin, Heidelberg, 2009. Springer-Verlag. S. N. Sivanandam and S. N. Deepa. Introduction to Genetic Algorithms. Springer Publishing Company, Incorporated, 1st edition, 2007. ISBN 354073189X, 9783540731894. Edward Smith, Robert Loftin, Emerson Murphy-Hill, Christian Bird, and Thomas Zimmermann. Improving developer participation rates in surveys. In Proceedings of the International Workshop on Cooperative and Human Aspects of Software Engineering, pages 89–92, 2013. 194 BIBLIOGRAPHY Ramanath Subramanyam and M. S. Krishnan. Empirical analysis of ck metrics for object-oriented design complexity: Implications for software defects. IEEE Trans. Softw. Eng., 29:297–310, April 2003. ISSN 0098-5589. S. Dowdy S.Weardon and D. Chilko. Statistics for Research. Probability and Statistics. John Wiley and Sons, 2004. Suresh Thummalapenta, Luigi Cerulo, Lerina Aversano, and Massimiliano Di Penta. An empirical study on the maintenance of source code clones. Empirical Software Engineering, 15(1):1–34, 2010. Sander Tichelaar, Stéphane Ducasse, and Serge Demeyer. Famix and xmi. In Proceedings of the Seventh Working Conference on Reverse Engineering (WCRE’00), WCRE ’00, pages 296–, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0-7695-0881-2. Guilherme Travassos, Forrest Shull, Michael Fredericks, and Victor R. Basili. Detecting defects in object-oriented designs: using reading techniques to increase software quality. In Proceedings of the 14 th Conference on ObjectOriented Programming, Systems, Languages, and Applications, pages 47–56. ACM Press, 1999. Martin Treiber, Hong Linh Truong, and Schahram Dustdar. On analyzing evolutionary changes of web services. In ICSOC Workshops, pages 284–297, 2008. Nikolaos Tsantalis, Alexander Chatzigeorgiou, and George Stephanides. Predicting the probability of change in object-oriented systems. IEEE Transactions on Software Engineering, 31(7):601–614, 2005. ISSN 0098-5589. Nikolaos Tsantalis, Natalia Negara, and Eleni Stroulia. Webdiff: A generic differencing service for software artifacts. In ICSM, pages 586–589, 2011. Eva van Emden and Leon Moonen. Java quality assurance by detecting code smells. In Proceedings of the 9th Working Conference on Reverse Engineering (WCRE’02). IEEE CS Press, October 2002. Mario Linares Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. Api change and fault proneness: a threat to the success of android apps. In Proceedings of the ESEC/SIGSOFT Foundations of Software Engineering, pages 477–487, 2013. W3C. Web services architecture. http://www.w3.org/TR/ws-arch/, 2004. [Online; accessed May-2014]. BIBLIOGRAPHY 195 William C. Wake. Refactoring Workbook. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2003. ISBN 0321109295. Shuying Wang and Miriam A. M. Capretz. A dependency impact analysis model for web services evolution. In ICWS, pages 359–365, 2009. Bruce F. Webster. Pitfalls of Object Oriented Development. M & T Books, 1st edition, February 1995. ISBN 1558513973. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. ISBN 0120884070. Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Bjöorn Regnell, and Anders Wesslén. Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, Norwell, MA, USA, 2000. ISBN 0-79238682-5. Qian Wu, Ling Wu, Guangtai Liang, Qianxiang Wang, Tao Xie, and Hong Mei. Inferring dependency constraints on parameters for web services. In WWW, pages 1421–1432, 2013. Zhenchang Xing and Eleni Stroulia. Umldiff: an algorithm for object-oriented design differencing. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, ASE ’05, pages 54–65, 2005a. Zhenchang Xing and Eleni Stroulia. Analyzing the evolutionary history of the logical design of object-oriented software. IEEE Trans. Software Eng., 31 (10):850–868, 2005b. Aiko Fallas Yamashita and Leon Moonen. Exploring the impact of inter-smell relations on software maintainability: an empirical study. In ICSE, pages 682–691, 2013. ISBN 978-1-4673-3076-3. Shin Yoo and Mark Harman. Pareto efficient multi-objective test case selection. In ISSTA, pages 140–150, 2007. ISBN 978-1-59593-734-6. Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18:1245– 1262, December 1989. Yilei Zhang, Zibin Zheng, and Michael R. Lyu. Wsexpress: A qos-aware search engine for web services. In ICWS, pages 91–98. IEEE Computer Society, 2010. ISBN 978-0-7695-4128-0. 196 BIBLIOGRAPHY Yuanyuan Zhang, Mark Harman, and Soo Ling Lim. Empirical evaluation of search based requirements interaction management. Information & Software Technology, 55(1):126–152, 2013. Jianjun Zhao and Baowen Xu. Measuring aspect cohesion. In Proceedings of the Fundamental Approaches to Software Engineering, pages 54–68, 2004. Yuming Zhou and Hareton Leung. Predicting object-oriented software maintainability using multivariate adaptive regression splines. J. Syst. Softw., 80: 1349–1361, August 2007. ISSN 0164-1212. Yuming Zhou, Hareton Leung, and Baowen Xu. Examining the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. Transactions on Software Engineering, 35(5): 607–623, 2009. Thomas Zimmermann, Peter Weisgerber, Stephan Diehl, and Andreas Zeller. Mining version histories to guide software changes. In Proceedings of the 26th International Conference on Software Engineering, ICSE ’04, pages 563– 572, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-76952163-0. Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. Predicting defects for eclipse. In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, PROMISE ’07, pages 9–, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2954-2. Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, pages 91–100, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-001-2. Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Trans. Evolutionary Computation, 3(4):257–271, 1999. Summary Analyzing the Change-Proneness of APIs and web APIs APIs and web APIs are used to expose existing business logic and, hence, to ease the reuse of functionalities across multiple software systems. Software systems can use the business logic of legacy systems by binding their APIs and web APIs. With the emergence of a new programming paradigm called service-oriented, APIs are exposed as web APIs hiding the technologies used to implement legacy systems. As a consequence, web APIs establish contracts between legacy systems and their consumers and they should stay as stable as possible to not break consumers’ systems. This dissertation aims at better understanding the change-proneness of APIs and web APIs. Specifically to that end, we investigated which indicators can be used to highlight change-prone APIs and web APIs and we provided approaches to assist practitioners in refactoring them. To perform this analysis we adopted a research approach consisting of three different tracks: analysis of change-prone APIs, analysis of change-prone web APIs, and refactoring of change-prone APIs and web APIs. Change-Prone APIs Service-oriented systems are composed by web services. Each web service is implemented by an implementation logic that is hidden to its clients through its web APIs. Along the history of a software system the implementation logic can be changed and its changes can be propagated and affect web APIs. Among all the software units composing the implementation logic, APIs are likely to be mapped directly into web APIs. This scenario is likely to happen especially if a legacy API is made available through a web service. 197 198 Summary In this first track we focused on analyzing the change-proneness of APIs (i.e., the set of public methods declared in a software unit). Among all the metrics we analyzed, we have shown that the Interface Usage Cohesion (IUC) metric is the most suitable metric to highlight change-prone Java interfaces. This result suggests that software engineers should design interfaces with high external cohesion (measured with the IUC metric) to avoid frequent changes. Moreover, we analyzed the impact of specific antipatterns on the changeproneness of APIs. We showed empirically that changes to APIs are more likely to appear if APIs are affected by the ComplexClass, SpaghettiCode, and SwissArmyKnife antipatterns. As a consequence software engineers should refactor APIs affected by these antipatterns. Change-Prone Web APIs In the second track we analyzed the change-proneness of web APIs. First, we developed two tools to analyze software systems composed of web APIs. The first tool is called WSDLDiff and extracts fine-grained changes between subsequent versions of WSDL APIs. The second tool extracts the full chains of dependencies among web APIs at run time. Second, we performed an empirical study to investigate which scenarios can cause changes to web APIs. We showed that low externally cohesive APIs change frequently to 1) improve understandability and 2) ease maintainability and reduce clones in the APIs. Low internally cohesive APIs change frequently to 1) reduce the impact of changes on the many clients they have, 2) avoid that all the clients lead the APIs to be changed frequently, and 3) improve understandability. Moreover, we proposed a new internal cohesion metric (DTC) to measure the internal cohesion of WSDL APIs. Refactoring APIs and Web APIs Based on the results of the studies performed in the first and second track, we defined two approaches to refactor APIs and web APIs. The first approach assists software engineers in refactoring APIs with low external cohesion based on the Interface Segregation Principle (ISP). We defined the problem of splitting low externally cohesive APIs into smaller APIs specific for each client (i.e., ISP) as a multi-objective clustering optimization problem. To solve this problem we proposed a genetic algorithm that outperforms other search based approaches. The second approach assists software engineers in refactoring fine-grained web APIs. These APIs should be refactored into coarse-grained web APIs to reduce the number of remote invocations and avoid performance problems. 199 To achieve this goal we proposed a genetic algorithm that looks for Façade APIs that cluster together the fine-grained methods of the original API. Conclusion We believe that these results advance the state-of-the-art in designing, analyzing, and refactoring software systems composed of web APIs (i.e., serviceoriented systems) and provide to the research community new insights into the change-proneness of APIs and web APIs. Daniele Romano Samenvatting Analyse van Veranderlijke Web APIs en APIs APIs en web APIs helpen om bestaande business-logica aan te bieden en vereenvoudigen het hergebruik van functionaliteit in meerdere software systemen. Software systemen kunnen de business-logica van legacy systemen gebruiken door elkaars APIs en web APIs te verbinden. Met de opkomst van het service georiënteerde programmeerparadigma worden APIs geëxposeerd als web APIs die de technologie waarmee de legacy systemen geïmplementeerd zijn verbergen. Als gevolg hiervan sluiten web APIs contracten af tussen legacy systemen en hun gebruikers en dienen ze zo stabiel mogelijk te zijn zodat ze de systemen van deze gebruikers niet kapot maken. Het doel van dit proefschrift is om een beter begrip te krijgen van de veranderlijkheid van APIs en web APIs. Hiervoor hebben we onderzocht welke indicatoren gebruikt kunnen worden om APIs en web APIs met een hoge veranderlijkheid te identificeren en we hebben ontwikkelaars van methodes voorzien om deze APIs te herschrijven. Om deze analyse uit te voeren hebben we een onderzoeksmethode gehanteerd die is opgedeeld in drie delen: analyse van veranderlijke APIs, analyse van veranderlijke web APIs en het herschrijven van veranderlijke APIs en web APIs. Veranderlijke APIs Service-georiënteerde systemen zijn opgebouwd uit web services. Elke web service is geïmplementeerd met behulp van een logica die verborgen is voor zijn gebruikers door middel van web APIs. Gedurende de geschiedenis van een software systeem, kunnen veranderingen in deze logica doorwerken naar web APIs. Van alle software componenten waaruit de implementatie logica bestaat, is de API de meest waarschijnlijke om direct gekoppeld te worden 201 202 Samenvatting aan web APIs. Dit scenario komt vaak voor als een legacy API beschikbaar wordt gemaakt als web service. In dit eerste deel van het onderzoek focusten wij op het analyseren van de veranderlijkheid van APIs (i.e., de set publieke methoden in een software component). We hebben aangetoond dat de Interface Usage Cohesion (IUC) metriek de meest geschikte metriek is om veranderlijke Java-interfaces te identificeren. Dit resultaat suggereert dat software engineers interfaces met een hoge mate van externe cohesie (gemeten met de IUC metriek) zouden moeten ontwerpen om frequente veranderingen te vermijden. Ook hebben we de impact op de veranderlijkheid van APIs van specifieke antipatronen geanalyseerd. We hebben empirisch aangetoond dat veranderingen van APIs waarschijnlijker zijn wanneer ze slachtoffer zijn van het ComplexClass, SpaghettiCode of SwissArmyKnife antipatroon. Daarom dienen software engineers APIs die geraakt worden door deze antipatronen te herschrijven. Veranderlijke web APIs In het tweede deel van het onderzoek hebben we de veranderlijkheid van web APIs geanalyseerd. Ten eerste hebben we twee tools ontwikkeld om software systemen die opgebouwd zijn uit web APIs te analyseren. De eerste tool, WSDLDiff, extraheert zeer kleine veranderingen tussen opeenvolgende versies van WSDL APIs. De tweede tool extraheert de volledige reeks van afhankelijkheden tussen web APIs tijdens run-time. Daarnaast hebben we een empirische studie uitgevoerd om te onderzoeken welke scenarios veranderingen in web APIs kunnen veroorzaken. We hebben aangetoond dat APIs met een lage externe cohesie vaak veranderen om 1) de begrijpelijkheid te verbeteren en 2) onderhoud te vereenvoudigen en het aantal clones binnen de API te verkleinen. APIs met een lage interne cohesie veranderen vaak om 1) de impact van veranderingen op het grote aantal klanten dat ze hebben te verkleinen, 2) te vermijden dat veranderende eisen van klanten leiden tot veranderingen in de APIs en 3) om de begrijpelijkheid te verbeteren. Daarnaast hebben we een nieuwe interne cohesie metriek (DTC) voorgesteld voor het meten van interne cohesie van WSDL APIs. Het herschrijven van APIs en Web APIs Gebaseerd op de resultaten van de studies uit het eerste en tweede deel van dit onderzoek hebben we twee methodes voor het herschrijven van APIs en web APIs gepresenteerd. De eerste methode assisteert software engineers met het herschrijven van APIs met een lage externe cohesie en is gebaseerd op het Interface Segrega- 203 tion Principle (ISP). We hebben het probleem van het opdelen van APIs met een lage externe cohesie in kleinere APIs specifiek voor elke klant (i.e., ISP) gedefinieerd als een multi-objective clustering optimalisatie probleem. Om dit probleem op te lossen hebben we een genetisch algoritme voorgesteld dat beter presteert dan andere search-based methodes. De tweede methode assisteert software engineers met het herschrijven van fine-grained web APIs. Deze APIs dienen herschreven te worden als coarsegrained APIs om het aantal aanroepen van buitenaf te verkleinen en hiermee performance problemen te vermijden. Om dit te bereiken hebben we een genetisch algoritme voorgesteld dat zoekt naar Façade APIs die de fine-grained methodes van de originele API samenvoegen. Conclusie We geloven dat deze resultaten de state-of-the-art in het ontwerpen, analyseren en herschrijven van software systemen die bestaan uit web APIs (i.e., service- georiënteerde systemen) vooruit helpen. Daarnaast bieden ze de onderzoeksgemeenschap nieuwe inzichten in de veranderlijkheid van APIs en web APIs. Daniele Romano Curriculum Vitae Education 2010 – 2014: Ph.D., Computer Science Delft University of Technology, Delft, The Netherlands. Under the supervision of prof. dr. M. Pinzger. 2007 – 2010: M.Sc., Computer Science University of Sannio, Benenvento, Italy. Master’s thesis title: An Approach for Search Based Testing of Null Pointer Exceptions. 2001 – 2006: B.Sc., Computer Science University of Sannio, Benevento, Italy. Bachelor’s thesis title: Development and testing of a GUI tool for the creation and modification of nomadic applications. Work Experience 2014 – present: Advisory IT Specialist/Continuous Delivery Product Owner ING Nederland, Amsterdam, The Netherlands. 2010: Software Engineering Researcher Internship at École Polytechnique de Montréeal, Canada. 2005 – 2007 : Java and SOA Software Developer as Freelancer. Benevento, Italy. 2006: Software Engineering Researcher RCOST (Research Centre On Software Technology), Benevento, Italy. 205 206 Curriculum Vitae