Subsections


6.1 Indexing and Retrieval Mechanisms

An effective retrieval mechanism is essential in deciding whether relevant reusable components can be located. An encoding schema that determines how to represent reusable components and reuse queries, and a relevance judgment criterion that determines whether a component is relevant to a reuse query, are two major considerations in the design of retrieval mechanisms. Encoding a component for the purpose of indexing can be based on its concepts, constraints, or code (see Section 9.2 for more details about other indexing methods used for component repositories). CodeBroker encodes components based on both concepts and constraints. The CodeBroker system extracts the concept of a component from its associated documentation embedded in Java source programs in the format of doc comments, and the constraint from the signature of a component. Reuse queries are represented in the same format. Relevance is determined by the combination of concept similarity and constraint compatibility. CodeBroker uses both probabilistic model-based indexing and retrieval techniques [Robertson and Walker, 1994] and the Latent Semantic Analysis (LSA) technique [Landauer and Dumais, 1997] to compute concept similarity. Constraint compatibility is computed by signature matching [Zaremski and Wing, 1995]. These methods are chosen because
  1. They require the least effort, among all suggested retrieval mechanisms for component repository systems, to encode components for indexing and retrieval.
  2. They are the easiest and the most straightforward way for programmers to formulate reuse queries for retrieval.
  3. The needed information to formulate a reuse query is readily available from the program editor.
  4. They are as effective as other complicated retrieval mechanisms in terms of retrieval performance [Frakes and Pole, 1994,Mili et al., 1997b].


6.1.1 Free-Text Indexing and Retrieval

Free-text indexing and retrieval is concerned with finding documents in free-text form (such as newspaper articles, research papers, books, web pages, etc.) relevant to the queries submitted by users [Salton and McGill, 1983].

Free-text documents are indexed by terms. Terms can be controlled or un-controlled. In the controlled-term approach, an indexer is responsible for choosing the appropriate terms to index the document. Those controlled terms are also known as keywords. Because this process is often manually conducted, it is very time-consuming when the collection of documents gets large. Another problem with controlled terms is that people often choose different terms to describe the same document [Furnas et al., 1987,Harman, 1995], and even the same person may not be consistent in choosing terms.

The other approach of choosing terms is to automatically extract them from documents. Terms can be precisely the words appearing in documents. However, most free-text indexing systems use one or both of the following two techniques: stemming and stop list. Stemming reduces words to their morphological root forms. For example, computer, computing, compute, computation, and computational are all reduced to the form comput and all five words are represented by the term comput. The advantage of stemming is to allow users to find documents that contain morphological variations of the word in their queries. A stop list is simply a list of high-frequency words that are not used as terms. Words that appear in almost every document are not very useful in distinguishing one document from other documents in terms of relevance to user queries.

The two primary models for the indexing and retrieving free-text documents are: the vector space model and the probabilistic model.

6.1.1.1 Vector Space Model

In the vector space model, documents and queries are represented as vectors of terms contained in the whole collection of documents, commonly known as a corpus. The value of each element in the vector reflects the importance of a particular term in representing the concept or meaning of that document. In a corpus containing N terms, a document is represented by a vector in the N-dimensional space as follows:

Docj = (w1, j, w2, j,..., wN, j) (6.1)

where
wi, j
is a value denoting the importance of the term i in representing the concept of the document Docj.

The variation of the vector space model comes from the method of determining the value for each wi, j. The simplest binary value model sets wi, j to 1, if the term i is present in the document j, and to 0, if it is not. The binary value model does not reflect the fact that some terms appear more in a document and thus contribute more to the concept of the document. Term frequency (tfi, j), which is the number of occurrences of term i in document j, can be used as the value of wi, j. Using term frequency favors longer documents because most longer documents tend to use the same word more often, as the verbosity hypothesis [Robertson and Walker, 1994] states. To level the ground, tfi, j can be normalized by being divided by the overall length of the document vector.

The term frequency model, normalized or not, treats all terms equally. However, terms that are limited to a few documents are more useful for discriminating documents from the rest of the corpus than terms that occur frequently across the entire corpus. To reflect this discrimination power of a term i, wi, j can be multiplied by its inverse document frequency (idfi). The idf for each term is defined as follows:

idfi = log($\displaystyle {\frac{{N}}{{df_i}}}$) (6.2)

where
N is the number of documents in the collection,
dfi is the number of documents that include term i.

Queries submitted by users are also in free-text form and are represented as a query vector in the same way as document vectors:

Q = (q1, q2,..., qN) (6.3)

where
qi
is $ {\frac{{qtf_i}}{{\sum_{p=1}^N qtf_p}}}$×idfi
qtfi
is the frequency of term i in the query.

The relevance of a document to a query is determined by comparing the document vector against the query vector. It is common to use the cosine of the angle between two vectors as the criteria to judge the similarity (sim) of a document and a query:

sim(Q, Dj) = $\displaystyle {\frac{{\sum_{i=1}^{N} w_{i,j} \times q_{i}}}{{\sqrt{\sum_{i=1}^{N}w_{i,j}^2} \times \sqrt{\sum_{i=1}^{N}q_{i}^2}}}}$ (6.4)

When the document and the query (or two documents) are identical, their vectors should be identical in the vector space and the cosine is one; and when they share no common terms, namely, they are orthogonal to each other, the cosine is zero. Upon receiving a query from a user, the retrieval system should thus compute the cosine for each document against the query vector, and return those documents with higher cosine values to the user as the relevant documents.


6.1.1.2 Probabilistic Model

The probabilistic model ranks documents in decreasing order of their evaluated probability of relevance to a user query. It makes use of formal theories of probability and statistics to evaluate, or estimate, those probabilities of relevance. The relevance probability is different from the similarity computed in the vector space model. The latter generally lacks the theoretical soundness of the relevance probability, which can be defined precisely. However, the computation of a theoretically sound probability is not practically tractable, and currently the probability can only be roughly approximated based on various simplification assumptions [Crestani et al., 1998].

The basis for all probabilistic models is the probability ranking principle, which asserts that optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query [Robertson, 1977]. Given a query Q, the main task of retrieval systems based on the probabilistic model is to compute the relevance probability P(R| Q, Dj) for each query-document pair.

The relevance probability of a document can be estimated by assigning an appropriate weight to each term in the document corpus. Probabilistic models assume that terms are distributed differently in relevant and irrelevant documents, which is known as the cluster hypothesis [Van Rijsbergen, 1979]. If a term appears more frequently in relevant than in irrelevant documents, it has more power to discriminate relevant from irrelevant documents. The discriminating power of a term is called its term relevance weight (TRW) in probabilistic models, and its value is calculated by the following formula:

TRWi(0) = log$\displaystyle {\frac{{p_i \times (1-q_i) }}{{q_i \times (1-p_i)}}}$ (6.5)

where
pi, qi
represent the probability of the ith term appearing in a relevant or an irrelevant document, respectively.
The above formula can be computed only retrospectively on test collections where the relevance assessments of documents are known. At the time of regular document retrieval, we do not know yet which document is relevant or irrelevant; therefore, we do not know how to compute pi and qi, and TRWi can only be estimated.

A simplified formula, proposed by Croft and Harper [Croft and Harper, 1979], uses corpus information to make estimates and does not use the distribution probability (pi and qi). In their model, TRWi is computed as follows:

TRWi(1) = log$\displaystyle {\frac{{(r+0.5)/(R-r+0.5)}}{{(n-r+0.5)/(N-n-R+r+0.5)}}}$ (6.6)

where
N
is the number of documents in the collection
n
is the number of documents containing the term
R
is the number of relevant documents
r
is the number of relevant documents containing the ith term.
When R and r are not available, TRWi can be further reduced to

TRWi(2) = log$\displaystyle {\frac{{N-n+0.5}}{{n+0.5}}}$ (6.7)

which is similar to the inverse document frequency in vector space model (see Equation (6.3)), whose dfi corresponds to n).

In Equation (6.6), only the presence and absence of terms (the binary value model) are considered. To take the term frequency within-document (tf) and term frequency within-query (qtf) into consideration, a more refined formula is proposed by Robertson et al. to estimate the probability of relevance between document Di and query Q [Robertson et al., 1995]

P(R| Q, Dj) = TRWi(2)$\displaystyle {\frac{{(k_1+1) \times tf_{i,j}}}{{K+tf_{i,j}}}}$×$\displaystyle {\frac{{(k_3+1) \times qtf_{i}}}{{k_3+qtf_{i}}}}$ (6.8)

where
K
is k1×((1 - b) + b×dlj/avdl )
k1,k3,b
are parameters depending on the nature of the queries and the collection of the data. In CodeBroker, k1 is set to 1.2, k3 to 1.0, and b to 0.75, according to the data in [Walker et al., 1998].
dlj
is the length of document j
avdl
is the average length of all documents.

CodeBroker uses Equation (6.8) to calculate the relevance probability of a document to a given query mainly because it can reuse the source code that is available through the distribution of the Remembrance Agent system [Rhodes and Starner, 1996], and the system Okapi in which the equation has been implemented has achieved retrieval performance comparable to other leading research prototypes of information retrieval systems [Robertson et al., 1995,Walker et al., 1998]. For a more detailed explanation of why Equation (6.8) provides an estimation of TRWi, please see [Robertson and Walker, 1994]. For the sake of brevity, Okapi will be used to refer to the probabilistic model used in CodeBroker.


6.1.1.3 Latent Semantic Analysis

LSA is an extension of the vector space model. The vector space model assumes that terms are independent from each other and does not take their semantics into consideration; therefore, it suffers from the concept-based retrieval problem (also known as vocabulary mismatch, discussed in Section 3.4.3): If programmers use terms different from those used in the descriptions of components, they cannot find what they want. By constructing a large semantic space of terms to capture the overall pattern of their associative relationship, LSA is expected to facilitate concept-based retrieval and bridge the conceptual gap in formulating reuse queries.

The indexing process of LSA starts with creating a semantic space with a large corpus of training documents in a specific domain. It first creates a vector for each document in the corpus in the same way that the vector space model does. All vectors for documents of the corpus compose a large term-by-document matrix X.

X = $\displaystyle \begin{pmatrix}
w_{1,1} & w_{1,2} & \ldots & w_{1,M} \\
w_{2...
...} \\
\hdotsfor{4} \\
w_{N,1} & w_{N,2} & \ldots & w_{N,M}
\end{pmatrix}$

where the columns of the matrix represent the documents in the collection (M documents in the corpus), and the rows represent the terms (N terms in the corpus). The term-by-document matrix X is then decomposed, by means of singular value decomposition, into the product of three matrices: T0, S0, and D0'.

X = $\displaystyle \begin{pmatrix}
t_{1,1}^{(0)} & t_{1,2}^{(0)} & \ldots & t_{1,r...
...
t_{N,1}^{(0)} & t_{N,2}^{(0)} & \ldots & t_{N,r}^{(0)} \\
\end{pmatrix}$×$\displaystyle \begin{pmatrix}
s_{1,1} & 0 & \ldots & 0 \\
0 & s_{2,2} & \ldots & 0 \\
\hdotsfor{4} \\
0 & 0 & \ldots & s_{r,r} \\
\end{pmatrix}$×$\displaystyle \begin{pmatrix}
d_{1,1}^{(0)} & d_{1,2}^{(0)} & \ldots & d_{1,M...
...
d_{r,1}^{(0)} & d_{r,2}^{(0)} & \ldots & d_{r,M}^{(0)} \\
\end{pmatrix}$

T0 is an orthogonal matrix having the left singular vectors of X, D0' is also an orthogonal matrix having the right singular vectors of X, and the diagonal matrix S0 is the singular value matrix whose rank is r, the smaller number of N and M. sj, j(1 $ \leq$ j $ \leq$ r) are singular values, and they appear in decreasing order along the diagonal of the matrix S0, namely, s1, 1 > s2, 2 > ... > sr, r.

The hypothesis behind LSA holds that because of synonymy and polysemy in natural languages, there is much noise in the matrix X, and if the rank of the singular value matrix S0 is reduced--by getting rid of less significant singular values--to a much smaller number k to obtain another singular value matrix S, the noise is reduced too. The value of k often ranges from 40 to 400, but the best value of k still remains an open question and needs to be empirically determined. The T0 matrix with the size of N×r is reduced to T with the size of N×k, and D0' with the size of r×M is reduced to D' with the size of k×M.

A new matrix $ \hat{{X}}$, viewed as the semantic space of the domain represented by the corpus, is constructed through the production of the three reduced matrices: T, S, and D'.

$\displaystyle \hat{{X}}$ = $\displaystyle \begin{pmatrix}
t_{1,1} & t_{1,2} & \ldots & t_{1,k} \\
t_{2...
...
\hdotsfor{4} \\
t_{N,1} & t_{N,2} & \ldots & t_{N,k} \\
\end{pmatrix}$×$\displaystyle \begin{pmatrix}
s_{1,1} & 0 & \ldots & 0 \\
0 & s_{2,2} & \ldots & 0 \\
\hdotsfor{4} \\
0 & 0 & \ldots & s_{k,k} \\
\end{pmatrix}$×$\displaystyle \begin{pmatrix}
d_{1,1} & d_{1,2} & \ldots & d_{1,M} \\
d_{2...
...
\hdotsfor{4} \\
d_{k,1} & d_{k,2} & \ldots & d_{k,M} \\
\end{pmatrix}$

In this new matrix $ \hat{{X}}$, each row represents the position of each term in the semantic space. Terms are re-represented in the newly created semantic space. The reduction of singular values is important because it captures only the major, overall pattern of associative relationships among terms by ignoring the noises accompanying most automatic thesaurus construction simply based on co-occurrence statistics of terms.

After the semantic space is created, each document is represented as a vector in the semantic space based on terms contained, and so is a query. The similarity of a query and a document is thus determined by the cosine of the two vectors as in Equation (6.4). A document matches a query if their similarity value is above a certain threshold value.

The corpus used by CodeBroker to create the LSA semantic space for Java programming comes from four sources: Linux on-line manuals, programming textbooks, the Java language specification and virtual machine specification, and Java class libraries (component repositories). These four types of documents are chosen because they cover the domain knowledge a Java programmer needs: knowledge about the computer and operating systems, which is covered by Linux manuals; knowledge about programming in general, which is covered by programming text books; knowledge about programming in Java, which is covered by the Java specifications; and knowledge about reusable components, which is covered by the Java class libraries. The corpus contains 78,475 documents and 10,988 different terms after common and extremely rare words are cut off. A word is considered as extremely rare if it appears in one document once. This is useful to remove those esoteric abbreviations that are common in Linux on-line manuals but not used elsewhere.


6.1.2 Signature Matching

Signature matching is the process of determining the compatibility of two components in terms of their signatures [Zaremski and Wing, 1995]. It is an indexing and retrieval mechanism based on type constraints of a module or a component (see Section 5.2.2).

Two signatures

      Sig1 : OutTypeExp1 <- InTypeExp1 
      Sig2 : OutTypeExp2 <- InTypeExp2
match if and only if InTypeExp1 is in structural conformance with InTypeExp2, and OutTypeExp1 is in structural conformance with OutTypeExp2. Two type expressions are structurally conformant if they are formed by applying the same type constructor to structurally conformant types.

This definition of signature matching is very restrictive because it misses components whose signature does not exactly match, but that are practically similar enough to be reusable after slight modification.

Partial signature matching relaxes the definition of structural conformance of types: A type is considered as conforming to its more generalized form or more specialized form. For procedural types, if there is a path from type T1 to type T2 in the type lattice, T1 is a generalized form of T2, and T2 is a specialized form of T1. For example, in most programming languages, integer is a specialized form of float; and float is a generalized form of integer. For object-oriented types, if T1 is a subclass of T2, T1 is a specialized form of T2, and T2 is a generalized form of T1.

The constraint compatibility value between two signatures is the production of the conformance value between their types. The type conformance value is 1.0 if two types are in structural conformance according to the definition of the programming language. It drops a certain percentage if one type conversion is needed or if there is an immediate inheritance relationship between them. The signature compatibility value is 1.0 if two signatures exactly match.

A class signature is composed of its data definition part and method definition part. Signature matching of two classes C1 and C2 requires that both their data definition parts match and their method definition parts match, respectively. Data definition parts are treated as record-types, and are compared according to their structural conformance.

The method definition parts of two classes C1 and C2 match if for each method m1 in class C1, there is a corresponding method m2 in class C2 such that m1 matches m2 according to signature matching for methods. Correspondence from m1 to m2 is decided based on the best match principle--that is, among all methods from C2, if m2 has the highest signature compatibility with m1, it is considered as the matching method of m1. The compatibility value of the method definition part is thus the average value of compatibility values existing between pairs of matching methods.

Because classes inherit data and methods from their parent classes, comparing two classes only is not enough if they do not inherit immediately from the same class. Inherited data definitions and methods must be taken into consideration. A common ancestor class is located first, and then all data definitions and method definitions in between the common ancestor and the compared classes should be added.

CodeBroker does not implement signature matching for classes due to the following two considerations. First, the primary goal of CodeBroker is to deliver reusable components before programmers start to implement the module. Unlike other object-oriented programming languages, such as C++, in which the declaration of class interfaces is usually separated from the implementation and is stored in a separate file such as a header file, the Java programming language dose not provide a mechanism to separate the declaration of class interfaces from implementations. It is not very common that Java programmers start to fill in the implementation of methods after they have finished the declaration of the class signature--the class variables and all method signatures. Therefore, in most cases, the class signature would not be available to CodeBroker for it to deliver components before implementation.

Second, signature matching alone is not very powerful in locating reusable components because the limited number of primary data types in a programming language such as Java leads to few variations of signatures. However, signature matching can play an important differential role when many components with similar concepts are retrieved, and programmers need to choose one that fits their type constraints. The main goal of using signature matching in CodeBroker is to make choosing components easier when there is a set of components intended for the same purpose but implemented for different data types. For example, a reasonably good component repository that contains random number generators needs a set of components that create random numbers of different types such as integers, floats and long integers. These components usually have same functionality descriptions, namely, same concepts, and the signature-matching process would help programmers identify the desired component immediately without too much browsing effort.


Ph.D. Dissertation by Yunwen Ye, April 20, 2001, Department of Computer Science, University of Colorado