Javadoc generates documentation, in HTML format, for Java programs by parsing the source files. In the HTML documentation, each Java class has its own HTML-formatted document file, which is cross-linked to the document files of its super-classes and sub-classes. The contents of a class document describe the functionality of the class and all of its methods. Those descriptions are extracted from doc comments associated with each class and method. An example of a Javadoc document is shown in Figure 6.3. Documentation for Java components distributed with JDK (Java Development Kit) by Sun Microsystems, Inc. is also generated by Javadoc. Other component developers create documentation for their components in the same fashion.
CodeIndexer creates indexes for Java components in two steps. First, it extracts needed information for indexing from Javadoc documents and converts it into the CodeBroker indexing format that can be processed by the indexing program. Each method of a class is treated as a document to be independently indexed, although in Javadoc documentation, all method descriptions of a class appear as one physical file. Five types of information are extracted for the purpose of indexing a method component: the full class name (including the package name and class name); the HTML tag name which specifies the exact location of the method in the Javadoc document; the method name; the signature; and the description of the method included in the doc comment for the method (Figure 6.4). Doc comments of Java may use special tags, which begin with the @ character and allow Javadoc to provide additional formatting for the documentation. For example, some doc comments may include @author to specify the author of the component, or @see to specify a link to related methods or classes. These tags could provide additional indexing information to narrow the range of components to be located. For instance, a programmer may be interested in components written by a specific author only. However, the current version of CodeBroker does not support this, and all special tags, along with their contents, are removed.
|
The second step of CodeIndexer creates, from the CodeBroker indexing documents in the format of Figure 6.4, three index files: the probabilistic model index file (or Okapi index file, for short), the LSA index file, and the signature index file. The Okapi index file and LSA index file contain the concept indexes of components, and the signature index file contains the signature indexes.
The Okapi index for a component consists of terms and their frequencies appearing in the doc comment. A term is the stemmed form of an English word, which is not included in the stop list.
The LSA index for a component is a float vector with length k calculated by the following equation6.2:
| LSAVector = (v1, v2,..., vk) | (6.9) |
| vi = ( |
(6.10) |
The signature index for a component is in the following format:
5617 getInt : int <- int x int
where the leftmost number is the identifier number assigned to each
component. The string following the number is the component name
(getInt),
the string following the colon : is the returned type (int),
and the
string following the left arrow (<-) specifies the input type(s)
(int x int).
To speed up the locating process and to reduce the size of indexing files, all three index files are encoded and stored as a database file.
The indexing mechanism can easily create a component repository from any Java source programs. However, components scavenged from ordinary programs present more challenges for reuse other than the locating problem, such as the low quality issue of documents and code. Because the focus of this research is to help programmers discover reusable components, the current version of CodeBroker includes the Java 1.1.8 Core API library and JGL 3.0 (Java General Library from ObjectSpace Inc.), both of which are of high quality and well documented. There are 503 and 170 classes in Java 1.1.8 and JGL, respectively, and a total of 7,338 method components.