Free-text documents are indexed by terms. Terms can be controlled or un-controlled. In the controlled-term approach, an indexer is responsible for choosing the appropriate terms to index the document. Those controlled terms are also known as keywords. Because this process is often manually conducted, it is very time-consuming when the collection of documents gets large. Another problem with controlled terms is that people often choose different terms to describe the same document [Furnas et al., 1987,Harman, 1995], and even the same person may not be consistent in choosing terms.
The other approach of choosing terms is to automatically extract them from documents. Terms can be precisely the words appearing in documents. However, most free-text indexing systems use one or both of the following two techniques: stemming and stop list. Stemming reduces words to their morphological root forms. For example, computer, computing, compute, computation, and computational are all reduced to the form comput and all five words are represented by the term comput. The advantage of stemming is to allow users to find documents that contain morphological variations of the word in their queries. A stop list is simply a list of high-frequency words that are not used as terms. Words that appear in almost every document are not very useful in distinguishing one document from other documents in terms of relevance to user queries.
The two primary models for the indexing and retrieving free-text documents are: the vector space model and the probabilistic model.
The variation of the vector space model comes from the method of determining the value for each wi, j. The simplest binary value model sets wi, j to 1, if the term i is present in the document j, and to 0, if it is not. The binary value model does not reflect the fact that some terms appear more in a document and thus contribute more to the concept of the document. Term frequency (tfi, j), which is the number of occurrences of term i in document j, can be used as the value of wi, j. Using term frequency favors longer documents because most longer documents tend to use the same word more often, as the verbosity hypothesis [Robertson and Walker, 1994] states. To level the ground, tfi, j can be normalized by being divided by the overall length of the document vector.
The term frequency model, normalized or not, treats all terms equally. However, terms that are limited to a few documents are more useful for discriminating documents from the rest of the corpus than terms that occur frequently across the entire corpus. To reflect this discrimination power of a term i, wi, j can be multiplied by its inverse document frequency (idfi). The idf for each term is defined as follows:
Queries submitted by users are also in free-text form and are represented as a query vector in the same way as document vectors:
The relevance of a document to a query is determined by comparing the document vector against the query vector. It is common to use the cosine of the angle between two vectors as the criteria to judge the similarity (sim) of a document and a query:
The basis for all probabilistic models is the probability ranking principle, which asserts that optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query [Robertson, 1977]. Given a query Q, the main task of retrieval systems based on the probabilistic model is to compute the relevance probability P(R| Q, Dj) for each query-document pair.
The relevance probability of a document can be estimated by assigning an appropriate weight to each term in the document corpus. Probabilistic models assume that terms are distributed differently in relevant and irrelevant documents, which is known as the cluster hypothesis [Van Rijsbergen, 1979]. If a term appears more frequently in relevant than in irrelevant documents, it has more power to discriminate relevant from irrelevant documents. The discriminating power of a term is called its term relevance weight (TRW) in probabilistic models, and its value is calculated by the following formula:
A simplified formula, proposed by Croft and Harper [Croft and Harper, 1979], uses corpus information to make estimates and does not use the distribution probability (pi and qi). In their model, TRWi is computed as follows:
In Equation (6.6), only the presence and absence of terms (the binary value model) are considered. To take the term frequency within-document (tf) and term frequency within-query (qtf) into consideration, a more refined formula is proposed by Robertson et al. to estimate the probability of relevance between document Di and query Q [Robertson et al., 1995]
CodeBroker uses Equation (6.8) to calculate the relevance probability of a document to a given query mainly because it can reuse the source code that is available through the distribution of the Remembrance Agent system [Rhodes and Starner, 1996], and the system Okapi in which the equation has been implemented has achieved retrieval performance comparable to other leading research prototypes of information retrieval systems [Robertson et al., 1995,Walker et al., 1998]. For a more detailed explanation of why Equation (6.8) provides an estimation of TRWi, please see [Robertson and Walker, 1994]. For the sake of brevity, Okapi will be used to refer to the probabilistic model used in CodeBroker.
The indexing process of LSA starts with creating a semantic space with a large corpus of training documents in a specific domain. It first creates a vector for each document in the corpus in the same way that the vector space model does. All vectors for documents of the corpus compose a large term-by-document matrix X.
×
×
The hypothesis behind LSA holds that because of synonymy and polysemy in natural languages, there is much noise in the matrix X, and if the rank of the singular value matrix S0 is reduced--by getting rid of less significant singular values--to a much smaller number k to obtain another singular value matrix S, the noise is reduced too. The value of k often ranges from 40 to 400, but the best value of k still remains an open question and needs to be empirically determined. The T0 matrix with the size of N×r is reduced to T with the size of N×k, and D0' with the size of r×M is reduced to D' with the size of k×M.
A new matrix
, viewed as the semantic space of the domain
represented by the corpus,
is constructed through the production of the three reduced matrices:
T, S, and D'.
×
×
After the semantic space is created, each document is represented as a vector in the semantic space based on terms contained, and so is a query. The similarity of a query and a document is thus determined by the cosine of the two vectors as in Equation (6.4). A document matches a query if their similarity value is above a certain threshold value.
The corpus used by CodeBroker to create the LSA semantic space for Java programming comes from four sources: Linux on-line manuals, programming textbooks, the Java language specification and virtual machine specification, and Java class libraries (component repositories). These four types of documents are chosen because they cover the domain knowledge a Java programmer needs: knowledge about the computer and operating systems, which is covered by Linux manuals; knowledge about programming in general, which is covered by programming text books; knowledge about programming in Java, which is covered by the Java specifications; and knowledge about reusable components, which is covered by the Java class libraries. The corpus contains 78,475 documents and 10,988 different terms after common and extremely rare words are cut off. A word is considered as extremely rare if it appears in one document once. This is useful to remove those esoteric abbreviations that are common in Linux on-line manuals but not used elsewhere.
Two signatures
Sig1 : OutTypeExp1 <- InTypeExp1
Sig2 : OutTypeExp2 <- InTypeExp2
match if and only if InTypeExp1 is in structural conformance
with InTypeExp2, and OutTypeExp1 is in
structural conformance
with OutTypeExp2. Two type expressions are structurally conformant
if they are formed by applying the same type constructor to structurally
conformant types.
This definition of signature matching is very restrictive because it misses components whose signature does not exactly match, but that are practically similar enough to be reusable after slight modification.
Partial signature matching relaxes the definition of structural conformance of types: A type is considered as conforming to its more generalized form or more specialized form. For procedural types, if there is a path from type T1 to type T2 in the type lattice, T1 is a generalized form of T2, and T2 is a specialized form of T1. For example, in most programming languages, integer is a specialized form of float; and float is a generalized form of integer. For object-oriented types, if T1 is a subclass of T2, T1 is a specialized form of T2, and T2 is a generalized form of T1.
The constraint compatibility value between two signatures is the production of the conformance value between their types. The type conformance value is 1.0 if two types are in structural conformance according to the definition of the programming language. It drops a certain percentage if one type conversion is needed or if there is an immediate inheritance relationship between them. The signature compatibility value is 1.0 if two signatures exactly match.
A class signature is composed of its data definition part and method definition part. Signature matching of two classes C1 and C2 requires that both their data definition parts match and their method definition parts match, respectively. Data definition parts are treated as record-types, and are compared according to their structural conformance.
The method definition parts of two classes C1 and C2 match if for each method m1 in class C1, there is a corresponding method m2 in class C2 such that m1 matches m2 according to signature matching for methods. Correspondence from m1 to m2 is decided based on the best match principle--that is, among all methods from C2, if m2 has the highest signature compatibility with m1, it is considered as the matching method of m1. The compatibility value of the method definition part is thus the average value of compatibility values existing between pairs of matching methods.
Because classes inherit data and methods from their parent classes, comparing two classes only is not enough if they do not inherit immediately from the same class. Inherited data definitions and methods must be taken into consideration. A common ancestor class is located first, and then all data definitions and method definitions in between the common ancestor and the compared classes should be added.
CodeBroker does not implement signature matching for classes due to the following two considerations. First, the primary goal of CodeBroker is to deliver reusable components before programmers start to implement the module. Unlike other object-oriented programming languages, such as C++, in which the declaration of class interfaces is usually separated from the implementation and is stored in a separate file such as a header file, the Java programming language dose not provide a mechanism to separate the declaration of class interfaces from implementations. It is not very common that Java programmers start to fill in the implementation of methods after they have finished the declaration of the class signature--the class variables and all method signatures. Therefore, in most cases, the class signature would not be available to CodeBroker for it to deliver components before implementation.
Second, signature matching alone is not very powerful in locating reusable components because the limited number of primary data types in a programming language such as Java leads to few variations of signatures. However, signature matching can play an important differential role when many components with similar concepts are retrieved, and programmers need to choose one that fits their type constraints. The main goal of using signature matching in CodeBroker is to make choosing components easier when there is a set of components intended for the same purpose but implemented for different data types. For example, a reasonably good component repository that contains random number generators needs a set of components that create random numbers of different types such as integers, floats and long integers. These components usually have same functionality descriptions, namely, same concepts, and the signature-matching process would help programmers identify the desired component immediately without too much browsing effort.