6.2 Creating the Component Repository

The component repository in CodeBroker is created by its indexing subsystem, CodeIndexer. CodeIndexer extracts and indexes functional descriptions (concepts) and signatures (constraints) from the HTML-based online documentation generated by running Javadoc over Java source programs (Figure 6.2).

Figure 6.2: The process of creating a component repository from Java programs

\includegraphics[width=.9\linewidth]{figs/CreateRepository.eps}

Javadoc generates documentation, in HTML format, for Java programs by parsing the source files. In the HTML documentation, each Java class has its own HTML-formatted document file, which is cross-linked to the document files of its super-classes and sub-classes. The contents of a class document describe the functionality of the class and all of its methods. Those descriptions are extracted from doc comments associated with each class and method. An example of a Javadoc document is shown in Figure 6.3. Documentation for Java components distributed with JDK (Java Development Kit) by Sun Microsystems, Inc. is also generated by Javadoc. Other component developers create documentation for their components in the same fashion.

Figure 6.3: An example of a document generated by Javadoc

\includegraphics[width=.9\linewidth]{figs/javadoc.eps}

CodeIndexer creates indexes for Java components in two steps. First, it extracts needed information for indexing from Javadoc documents and converts it into the CodeBroker indexing format that can be processed by the indexing program. Each method of a class is treated as a document to be independently indexed, although in Javadoc documentation, all method descriptions of a class appear as one physical file. Five types of information are extracted for the purpose of indexing a method component: the full class name (including the package name and class name); the HTML tag name which specifies the exact location of the method in the Javadoc document; the method name; the signature; and the description of the method included in the doc comment for the method (Figure 6.4). Doc comments of Java may use special tags, which begin with the @ character and allow Javadoc to provide additional formatting for the documentation. For example, some doc comments may include @author to specify the author of the component, or @see to specify a link to related methods or classes. These tags could provide additional indexing information to narrow the range of components to be located. For instance, a programmer may be interested in components written by a specific author only. However, the current version of CodeBroker does not support this, and all special tags, along with their contents, are removed.

Figure 6.4: The indexing format of method documents in CodeBroker

NEW METHOD:: 
CLS: java.lang.String 
TAG: length 
MET: length 
SIG: int length() 
DEF: Returns the length of this string. The length is equal
     to the number of 16-bit Unicode characters 
     in the string. 

NEW METHOD:: 
CLS: java.lang.String 
TAG: charAt 
MET: charAt 
SIG: char charAt(int index) 
DEF: Returns the character at the specified index. An index 
     ranges from 0 to length() - 1.


$\textstyle \parbox{.8\linewidth}{\small{In \textit{CodeBroker}, each method is ...
...{\tt SIG} for the signature, and {\tt DEF} for the
description of the method.}}$

The second step of CodeIndexer creates, from the CodeBroker indexing documents in the format of Figure 6.4, three index files: the probabilistic model index file (or Okapi index file, for short), the LSA index file, and the signature index file. The Okapi index file and LSA index file contain the concept indexes of components, and the signature index file contains the signature indexes.

The Okapi index for a component consists of terms and their frequencies appearing in the doc comment. A term is the stemmed form of an English word, which is not included in the stop list.

The LSA index for a component is a float vector with length k calculated by the following equation6.2:

LSAVector = (v1, v2,..., vk) (6.9)

vi = ($\displaystyle \sum_{{j=1}}^{{N}}$tfj×tj, isi, i-1 (6.10)

where
k
is the number of singular values in the pre-computed semantic space
N
is the number of terms in the semantic space
tfi
is the frequency of the term in the component
tj, i
is the term vector of term j in the pre-computed semantic space; and terms in the component but not in the corpus are discarded
si, i
is the singular value.

The signature index for a component is in the following format:

        5617 getInt : int <- int x int
where the leftmost number is the identifier number assigned to each component. The string following the number is the component name (getInt), the string following the colon : is the returned type (int), and the string following the left arrow (<-) specifies the input type(s) (int x int).

To speed up the locating process and to reduce the size of indexing files, all three index files are encoded and stored as a database file.

The indexing mechanism can easily create a component repository from any Java source programs. However, components scavenged from ordinary programs present more challenges for reuse other than the locating problem, such as the low quality issue of documents and code. Because the focus of this research is to help programmers discover reusable components, the current version of CodeBroker includes the Java 1.1.8 Core API library and JGL 3.0 (Java General Library from ObjectSpace Inc.), both of which are of high quality and well documented. There are 503 and 170 classes in Java 1.1.8 and JGL, respectively, and a total of 7,338 method components.


Ph.D. Dissertation by Yunwen Ye, April 20, 2001, Department of Computer Science, University of Colorado