C++
C#
VB
JScript
All

Design Overview

Copyright (C) Tall Tree Software

DocJet is a modular system - its support for different programming languages and commenting conventions is built from a set of interchangable parts. This modular design means that we can add support for new languages and commenting convetions easily and without risk to the overall product. It also means that you, the customer, can extend DocJet to meet needs particular to your organization.

This document describes how you can write your own modular components. Also, since many of the stock components are shipped in source code form, you can simply modify them to suit your needs. Either way, this document will help you on your way.

We will start with a discussion of ActiveX, the underlying medium to all this. Then we will launch into a general discussion of the roles and responsibilities of the various types of components. Subsequent discussions will go further into the details.

ActiveX and Performance

In DocJet version 4, all of the front-end modules for DocJet are ActiveX (aka COM) API's. This is a departure from previous practice where the interfaces were just simple C functions exported from DLL's. By choosing the ActiveX approach, we gained flexibility on two fronts: First, it is now possible to write DocJet plug-ins in Visual Basic and Java, and secondly we allowed for a much richer interface between the modules and DocJet.

All of the DocJet API's are designed for maximum performance. DocJet has a rather peculiar relationship with performance issues. Some customers run DocJet over small source bases, and DocJet would run reasonably fast for them even if it was not designed all that well. Some customers, however, have really large source bases, and any performance defects would rapidly show themselves in this case.

As a result, all of the stock modules are written in C++, except for a few parts that only run in the GUI, which were written in Visual Basic. If you are writing a plug-in for your own purposes and your source base is small, then you might go ahead and write it in Visual Basic and never notice a problem.

But many aspects of performance are independent of language. It will always be faster to pass an integer than it is to pass a string. As a result, many of the interfaces have functions that pass offsets into strings, rather than the sub-string itself.

Another impact that performance has had on the module design is that we still have one module interface that does not work via ActiveX interfaces. This is the system for creating extensions to the DocJet output format command language. The current system does not really deliver on the performance front: the existing interface definition requires at least one malloc/strcpy/free on each call. Each being fast, but in a large system, a single function can be called billions of times. The stock functions currently use a different, internal, interface and do much better than that.

The guiding principle behind the effort to revamp that particular interface was that the new interface had to be used by the internal functions as well as user-written ones. In the end, a system that met that requirement and ran at least as fast as the current system never materialized, so the whole idea was scrapped and the current scheme limps on.

So, this paper really only talks about the front side of DocJet's operation -- identifying source files and parsing them and their comments.

Separation of Responsibilities

The front-side of DocJet's execution has the purpose of identifying all the objects and parsing their comments. The steps are outlined below:

  1. Generate a list of the source files that we will be using

  2. Scan those source files to identify a set of objects we will use to generate documentation from

  3. Separate the meat of the comment from the characters used to delineate the comment

  4. Apply any changes needed to deal with local commenting customs

  5. Find any discovery-time directives

  6. Break each comment up into sections

  7. Break each comment section up into paragraphs

  8. Find any character-level markup

Each of these steps requires a small set of interfaces. Some of these interfaces are implemented by the pluggable module, and some by DocJet. The rest of this document goes over these steps in order.

Common Techniques

In this chapter we will talk about some of the common paradigms in use in the module interfaces.

Callbacks

Just about all of the interfaces have some sort of scheme where by the “Scanner” interface is called by the generator and it is required to send results to a “ReportTo” interface defined by the generator. A common API would be:

interface IScanner {   // Implemented by the plug-ins
        HRESULT Scan( BSTR thingToScan, IReportTo *findings );
};

interface IReportTo {  // Implemented by the generator
        HRESULT Result( BSTR whatWasFound );
};

Iterators and With

Another variation on the theme of callbacks that we use alot is an iterator. An iterator is an object that is defined by the callback. You call a function defined by DocJet and it calls your “iterator” object several times with different data and combines the results for you. For example, to iterate over all the objects collected so far the module would call an interface like this:

interface IFooCollection {
        HRESULT WithEachObject( IWithFoo * );
};

The module would have to implement the IWithFoo interface, which would be fairly simple:

interface IWithFoo {
        HRESULT DoIt( IFoo *o );
};

We use the “iterator” model to solve a performance problem. Normally, if you wanted to access an attribute of an object you would do it just like this:

interface IFoo {
        [propget]
        HRESULT Attr( BSTR *o );
};

The performance problem that comes into play here results from the semantics of BSTR in calls. To implement the Attr method, it must call SysAllocString, which will do a malloc and copy. Again, as an individual operation, the performance of these calls is just fine. When you call them a few hundred thousand times, it's not so good. The particularly vexing part about this is that the copy is totally unecessary if you don't plan on modifying the string, and that covers almost all cases.

For that case, you can implement another interface:

interface IWithAttribute {
        HRESULT DoIt( BSTR attr );
};

Then call it with this:

interface IFoo {
        HRESULT WithAttribute( IWithAttribute *w );
};

Markup Function Objects

When we said that all of the interfaces we are discussing in this are “front-end”, we imply that they are not involved in the production of the actual output. This is not quite so. The parsing functions are all done “up front”, before any output is really contemplated, but at the end of the day, somebody has to transform “*bold*” to bold.

The callbacks for DocJet's scanning functions all have an argument that consists of a range of text a range of text that the scanner thinks indicates a markup sequence and a “Markup Function Object”. That object implements a function that is called during the output-generation phase to actually produce the text.

These Markup Function Objects implement an interface that supports only one function. That function usually does not do any of the actual output generation itself, but instead just organizes a call to a function in the Output Format. So in the end you can just think of a Markup Function Object as a coupling between the scanner and its friends within the output format.

Ambiguity Resolution

One other problem we frequently will have to contend with is ambiguity resolution... What if two scanners look at the same body of text and each has their own idea of how what to make out of it?

The mechanism we use to deal with that is also known as a “confidence”. Whenver a scanner makes a report about a finding, in addition to the extent and the markup function object, it also passes a “confidence” value. This is used by the scanner to break ties.

The value of the “confidence” increases with how many clues we have that we're right. For instance, with a directive, we have both the curly braces that enclose the tract and the directive name, which ought to match a known directive. That would be a pretty firm indication that we have things right. The Preformatted Paragraph scanner has far less to go on -- it just looks for indented paragraphs and that's all it has to go on. In a situation like this:

//   o  A bullet
//
//      o  An indented bullet
//
//         A second paragraph of
//         that indented bullet

The “Second paragraph” could be taken as preformatted, although that really would not be right, as it's only indented to suit the needs of the surrounding bullet. The bullet, which is more specifically identified (bullet character at the start of a line that is the first line of a paragraph), takes precedence.

Still, this is not a very good system, although it's hard to imagine a better one. The best plan is to make sure that your scanner is as selective as it possibly can be.

Values used for confidence are entirely arbitrary, but some #define's are given in the IDL:

#define CONFIDENCE_FITS_VERY_BROAD_PATTERN 10
#define CONFIDENCE_FITS_BROAD_PATTERN 15
#define CONFIDENCE_FITS_GENERAL_PATTERN 25
#define CONFIDENCE_FITS_SPECIFIC_PATTERN 50
#define CONFIDENCE_FITS_VERY_SPECIFIC_PATTERN 100

We consider a rule like the preformatted paragraph scanner to be one that recognizes a “general” pattern because it runs on one rather vague hint. The bullet scanner rates as “specific”, because it has more clues to work with. The only stock markup that qualifies as “broad” is the normal paragraph scanner.