Subsections


9.2 Component Repository Systems

Research on reusable component repository systems is abundant. They differ from each other mainly in the retrieval mechanisms adopted. These systems can be divided into three groups according to the aspect of programs on which the retrieval mechanisms are based.

9.2.1 Concept-Based Component Repository Systems

The retrieval mechanism used in CodeBroker is similar to those of reusable component repository systems that use free-text indexing. GURU [Maarek et al., 1991] indexes components based on their textual documentation. Etzkorn and Davis have tried to use header comments (similar to doc comments in CodeBroker) to index legacy object-oriented programs [Etzkorn and Davis, 1997b]. Comments and identifier names are also used for indexing in [Girardi and Ibrahim, 1995] and [DiFelice and Fonzi, 1998]. Michail and Notkin have demonstrated the possibility of using identifier names only to find similar reusable components for comparison [Michail and Notkin, 1999].

Free-text indexing is easy both for setting up a component repository and for programmers to formulate reuse queries. Despite its simplicity, empirical studies have found that it performs no worse, in terms of retrieval effectiveness, than other more delicate, effort-consuming repository systems [Frakes and Pole, 1994,Mili et al., 1997b]. Nevertheless, free-text indexing-based reuse systems do not directly support shortening the conceptual gap in query formulation.

One attempt to bridge the conceptual gap is to use structured representations and knowledge bases. Both CodeFinder [Henninger, 1997] and LaSSIE [Devanbu et al., 1991] use frames to represent reusable components. Frames in CodeFinder are connected by an associative network whose links have weights to reflect the semantic relationships among components. Searching relevant components is supported by spreading activation. Frames in LaSSIE are structured into hierarchical, taxonomic categories by human experts. To ease the difficulty of creating the frame representations of components, ROSA [Girardi and Ibrahim, 1995] applies natural language processing techniques to automate the process. However, sentences that can be processed are very limited. The multiple faceted classification schema [Prieto-Diaz, 1991] is another format of structured representations. Reusable components are represented with multiple facets, each of which is described with a term. A conceptual distance graph has to be constructed to reflect the semantic relationships among terms. AIRS is a system that combines multiple facets and the frame-based approach [Ostertag et al., 1992].

Structured representation-based systems are labor intensive in creating representations of components and knowledge bases.

9.2.2 Constraint-Based Component Repository Systems

Constraints of programs can also be used to index and retrieve reusable components. Rittri first proposed to use signatures in reusable component retrieval [Rittri, 1989]. His work is further extended by Zaremski and Wing who give a general framework for signature matching in functional programming languages [Zaremski and Wing, 1995]. Although research on signature matching has largely focused on functional programming languages that are often designed with a sound type theory, signature matching is also applicable to other strong-typed programming languages [DiCosmo, 1995]. An Ada version of signature matching has been implemented in [Stringer-Calvert, 1994]. CodeBroker applies this technique to the strong-typed object-oriented programming language. Signature matching in CodeBroker is not used as the sole method of retrieving components; rather, it is used as a filter to exclude those components that are significantly different from the current task in terms of constraint compatibility.

The formal specification-based approach is another form of using constraints to index and retrieval components. Zaremski and Wing have adopted pre- and post-predicates to find components that exactly match or approximately match a reuse query [Zaremski and Wing, 1997]. A. Mili et al. have tried to classify reusable components based on refinement order existing among their formal specifications [Mili et al., 1997a]. The formal specification-based approach could be integrated into CodeBroker to improve the precision of retrieval, if the programming environment supports formal methods. For the majority of programmers, however, the formal approach is too difficult to use.

9.2.3 Code-Based Reuse Repository Systems

Behavior sampling exploits the code aspect of programs to retrieve reusable components [Hall, 1993,Podgurski and Pierce, 1993]. In behavior sampling-based systems, a programmer randomly chooses a small set of sample inputs and computes the corresponding outputs after having specified the signature of the module (same as the constraint query created by CodeBroker). Reusable components whose signature is compatible are found and executed on the sample inputs. Components whose outputs match the outputs computed by the programmer are returned. Behavior sampling is difficult to apply to components with complicated data structures. Moreover, it is unable to find close but not identical components.

9.2.4 Uniqueness of CodeBroker

Compared with other component repository systems, CodeBroker has three unique features:
  1. It automatically extracts and formulates reuse queries.
  2. It is adaptable and adaptive to each programmer to reflect their growing knowledge about reusable components.
  3. It is seamlessly integrated with current programming environments.

All of the above-mentioned systems, except CodeFinder, assume that component retrieval is an one-time effort and do not support retrieval-by-reformulation. In addition to the query refinement supported in CodeFinder, CodeBroker allows programmers to manipulate retrieved components interactively to reduce the difficulties of component choosing.


Ph.D. Dissertation by Yunwen Ye, April 20, 2001, Department of Computer Science, University of Colorado