Subsections
9.2 Component Repository Systems
Research on reusable component repository systems is abundant. They differ from
each other mainly in the retrieval mechanisms adopted. These systems can be
divided into three groups according to the aspect of programs on which
the retrieval mechanisms are based.
The retrieval mechanism used in CodeBroker is similar to those of reusable component
repository systems that use free-text indexing.
GURU [Maarek et al., 1991] indexes
components based on their textual documentation. Etzkorn and Davis have
tried to use header comments (similar
to doc comments in CodeBroker) to index legacy object-oriented
programs [Etzkorn and Davis, 1997b]. Comments and identifier names are
also used for indexing in [Girardi and Ibrahim, 1995]
and [DiFelice and Fonzi, 1998]. Michail and Notkin have demonstrated the
possibility of using identifier names only to find similar reusable
components for comparison [Michail and Notkin, 1999].
Free-text indexing is easy both for setting up a component repository and
for programmers to formulate reuse queries. Despite its simplicity,
empirical studies have found that it performs no worse, in terms of
retrieval effectiveness, than other more delicate, effort-consuming
repository systems [Frakes and Pole, 1994,Mili et al., 1997b].
Nevertheless, free-text indexing-based reuse systems do not directly
support shortening the conceptual gap in query formulation.
One attempt to bridge the conceptual gap is to use structured
representations and knowledge bases. Both CodeFinder [Henninger, 1997]
and LaSSIE [Devanbu et al., 1991] use frames to represent
reusable components. Frames in CodeFinder are connected by an
associative network whose links have weights to reflect the semantic
relationships among components. Searching relevant components is
supported by spreading activation. Frames in LaSSIE are structured
into hierarchical, taxonomic categories by human experts. To
ease the difficulty of creating the frame representations of
components, ROSA [Girardi and Ibrahim, 1995] applies natural language
processing techniques to automate the
process. However, sentences that can be processed are very
limited.
The multiple faceted classification schema [Prieto-Diaz, 1991]
is another format of structured representations. Reusable components
are represented with multiple facets, each of which is described with
a term. A conceptual distance graph has to be constructed to reflect
the semantic relationships among terms. AIRS is a system that combines
multiple facets and the frame-based approach [Ostertag et al., 1992].
Structured representation-based systems are labor intensive in
creating representations of components and knowledge bases.
Constraints of programs can also be used to index and retrieve
reusable components. Rittri first proposed to use signatures in
reusable component retrieval [Rittri, 1989]. His work is further
extended by Zaremski and Wing who give
a general framework
for signature matching in functional programming languages [Zaremski and Wing, 1995].
Although research on signature matching
has largely focused on functional programming languages that are
often designed with a sound type theory, signature matching is also applicable to
other strong-typed programming languages [DiCosmo, 1995]. An Ada
version of signature matching has been implemented in [Stringer-Calvert, 1994].
CodeBroker applies this technique to the strong-typed object-oriented
programming language. Signature matching in CodeBroker is not used
as the sole method of retrieving components; rather, it is used as a
filter to exclude those
components that are significantly different from the current task in terms of
constraint compatibility.
The formal specification-based approach is another form of using
constraints to index and retrieval components. Zaremski and Wing have
adopted pre- and post-predicates to find components
that exactly match or
approximately match a reuse query [Zaremski and Wing, 1997]. A. Mili et
al. have tried to classify reusable components based on refinement
order existing among their formal specifications [Mili et al., 1997a].
The formal specification-based approach could be integrated
into CodeBroker to improve the precision of retrieval, if the
programming environment supports formal methods. For the
majority of programmers, however, the formal approach is too difficult to use.
Behavior sampling exploits the code aspect of programs to retrieve
reusable components [Hall, 1993,Podgurski and Pierce, 1993]. In behavior
sampling-based systems, a
programmer randomly chooses a
small set of sample inputs and computes the corresponding outputs after
having specified the signature of the module (same as the constraint
query created by CodeBroker).
Reusable components whose signature is compatible are found and
executed on the sample inputs. Components whose outputs match the
outputs computed by the programmer are returned. Behavior sampling is
difficult to apply to components with complicated data
structures. Moreover, it is unable to find close but not identical
components.
Compared with other component repository systems, CodeBroker has three
unique features:
- It automatically extracts and formulates reuse queries.
- It is adaptable and adaptive to each programmer to reflect
their growing knowledge about reusable components.
- It is seamlessly integrated with current programming environments.
All of the above-mentioned systems, except CodeFinder, assume that component
retrieval is an one-time effort and do not support retrieval-by-reformulation. In addition to the query refinement supported in
CodeFinder, CodeBroker allows programmers to manipulate retrieved
components interactively to reduce the difficulties of component
choosing.
Ph.D. Dissertation by Yunwen Ye, April 20, 2001, Department of Computer Science, University of Colorado