C++ as a Data Abstraction Language

by Doug Lea.

Classically, an abstract data type (ADT) is an idealization of a set of values, along with a set of functions and relations on these values that together constitute a type. Simple examples include Integers, Complexes, Sets, Lists, and so on.

Programming styles and practices in pure ADT-based languages (like ML) differ in many ways from those in pure OO languages. C++, not being very pure about anything, supports various mixtures. This is nice in some ways but confusing in other ways.

(Interestingly enough, most academic study of ADTs and programming languages that support them arose after the introduction of the first OOPL (Simula). Similarly, C++ support for ADT style programming was established several years after most of its basic OO support was in place -- see Stroustrup's Design and Evolution book.)

Pure ADT programming is value-oriented rather than object-oriented. The basic rules of value-oriented programming are that you are never allowed to care about (or in most ADT languages, even to discover) which object is being used to represent something; you only care about the symbolic meaning of the value itself. Named variables are considered to be bound to different values; and are not considered to be handles to objects. Because programs deal with computers, not mathematical abstractions, there must indeed be some object that maintains the representation of any given value used in a program. But again, in ADT style programming you are not supposed to know anything about this.

Consider, for example, Strings as values. If you have two Strings a and b, you might want to test them for equality. You don't care whether a and b are represented by different objects, you just want to know whether their meanings are the same. (Note of course that if they were both represented by the same object, then it is a dead giveaway that they maintain the same value.) Similarly, you might want to use the value representing the concatenation of a + b. Again, you don't care which object represents this value; in fact it might possibly be the one that until the concatenation, was used for a. (This might happen for example in an optimizing compiler when it notices that a value is no longer needed, so the associated object can be reused for different purposes.) And again similarly, suppose you wanted the reverse of a. You might invoke a function reverse(a) that returns a new value (again, probably represented by a new object).

In constrast, consider Strings as objects. You might have a constructor that initializes representations, reference-equality testing to check if two variables refer to the same object, an append method that causes an object to change its representation by adding characters, and a reverse method that causes an object to to represent the reverse of its old one.

The conceptual differences in these styles boil down to:

  1. In value-based approaches, variables denote values, while in object-based approaches, they normally denote references (pointers) to objects that in turn may represent values.
  2. All operations in value-based approaches are side-effect-free, consisting only of relations that inspect values, and pure functions that generate new ones without altering any other reachable object in the process of doing so. There are no mutative operations that alter representations.
  3. Identities in value-based approaches are opaque, or at least unexploitable -- you never know which object is maintaining which value so long as the intended semantics are correct.

Most OO languages (not just C++) actually support a mixture of these two styles. In fact, some procedural languages do too. For example, C contains both a pure functional version of the plus operator (as in a + 3) and a mutative object-based one (as in a += 3). Conceptually, the first one returns a new int value; the second modifies an existing int object. More generally, C differs from most other procedural languages in that it does allow you to know and exploit the identity of built-in object types like ints via the address-of operator.

While object-based rather than value-based techniques are clear winners in most contexts, there are some advantages to sometimes adopting a value-based style, especially for those types that you normally think of in value-based terms anyway. One asset is that clients never have to worry about the underlying objects, so can program in a simpler expression-based style. For example, it is pretty awkward to write the following just to obtain a String concatenation:

void f(String* a, String* b) {
  String* c = new String;
  c->append(a);
  c->append(b);
  // ...
}

Instead, many people would rather write something like:

void f(String a, String b) {
  String c = a + b;
  ...
}

There is no single right way to write classes to achieve these effects. In fact, the right effects cannot always be had in C++, and even approximating them sometimes enters dark corners of the language.

Designing ADT Classes in C++

Types

The first consideration is what kinds of types you want to single out for such treatment. These are normally types for which there already exist a common set of functional/expression based semantics. These in turn often form the bottommost level of support in a class library. For example BitSet, Set, Bag, String, Complex, Map, MultiprecisionInteger, and so on.

Because of their simplicity and familiarity, as well as the fact that you can probably find full code for them elsewhere, we will mainly illustrate with a Complex number class. As a first take, the basic structure might look like this:

class Complex {
public:
  Complex(float re = 0, float im = 0);
  float real() const; 
  float imag() const; 
  //...
};

Complex operator +  (Complex a, Complex b);
bool    operator == (Complex a, Complex b);

// ...

Note that, as in the standard ADT style, the functions (like +) and relations (like ==) aren't necessarily methods of Complex, they are just things you can do with Complex values in order to obtain new values. However, ``destructuring'' accessors (like real()) are intrinsically methods, although still pure side-effect free ones.

It may be the case that these associated operators require access to raw representations to work (or just to work efficiently), in which case you either declare them as friends, or alternatively use the in-class versions that treat the left-hand-side of any binary operator as the recipient. Usually the first choice is better, since it interacts a bit more nicely with C++ conversion rules.

For an ``arithmetic'' type like Complex, it is a good idea to step though all the possible C++ operators that might apply, and implement all of them that do. (For example, mutliplication does, but less-than does not.)

Parameterization

Many ADT-style classes lend themselves well to parameterization over base types, and thus to templates. For example, we could remove the reliance of Complex on the particular choice of floats for their base representation type by parameterizing:
template <class T>
class ComplexOf {
public:
  ComplexOf(T re = 0, T im = 0);
  T real() const; 
  T imag() const; 
  //...
};

template <class T>
  ComplexOf<T> operator + (ComplexOf<T> a, ComplexOf<T> b);

template <class T>
  bool    operator == (ComplexOf<T> a, ComplexOf<T> b);

// ...

This class could be instantiated for any type T that:

Representations

The second consideration is how you are going to represent things in order to obtain side-effect-free semantics. The basic rule here is that the internals of the class must be communication closed. No operation can have any side effect on any object that is not under the exclusive control of the object representing the abstract value. Thus, the representations must be private and self-contained. So, among other rules, you should never reveal pointers or references to any internal data fields or subobjects used to represent state.

Direct

The simplest alternative is for each object to carry around a direct representation of its value. For example, here, we could just embed two floats. As a check, some of the function mechanics are done out too:

class ComplexV1 {
  float re_;
  float im_;
public:
  ComplexV1(float re = 0, float im = 0) :re_(re), im_(im) {}
  float real() const  { return re_; }
  float imag() const; { return im_; }
};

ComplexV1 operator + (ComplexV1 a, ComplexV1 b) {
  return ComplexV1(a.real() + b.real, a.imag() + b.imag()) ;
}

bool operator == (ComplexV1 a, ComplexV1 b) {
  return a.real() == b.real() && a.imag() == b.imag();
}

// ...

We picked the simplest possible direct representation scheme here. But several others exist; for example those based on polar rather than cartesian coordinates.

Indirect

Sometimes you can't or don't want to strictly embed the representational objects, and instead must hold a pointer to them. In this case, all is still well so long as the host object maintains exclusive control, never ``leaking'' the identities of helper components. For example:

struct ComplexRep {
  float re_;
  float im_;
};

class ComplexV2 {
  ComplexRep* rep_;
public:
  ComplexV2(float re = 0, float im = 0) :rep(new ComplexRep) {
     rep_->re_ = re; rep_->im_ = im; 
  }
 ~ComplexV2() { delete rep_; }

  float real() const  { return rep_->re_; }
  float imag() const; { return rep_->im_; }
};

Shared Indirect

When representations are offloaded in this way, it is sometimes possible to share representation objects across multiple hosts, so long as they do not interfere with each others use. For a simple example, suppose we'd like all default-constructed Complexes to share a common representation object:

class ComplexV3 {
  static ComplexRep theComplexZero = { 0, 0 };
  ComplexRep* rep_;
public:
  ComplexV3() :rep_(&theComplexZero); 
  ...
};

But this is pretty dangerous; we'll do better below.

Passing Conventions

Copy-based

By far the simplest and most well-behaved strategy for separating values from objects is to use pure copy-based parameter passing conventions. Here, whenever a value is sent as an argument or return value, a new object is used to represent it. Among other virtues, called functions cannot modify their caller's representation objects by mistake. (Note that in our running Complex example, they can't anyway since we don't have any mutative operations yet. (Well, almost; as we'll see, the compiler probably built an assignment operator for us...))

All you need to write in order to support this is a copy-constructor, that intializes representations of a new object to be the same as that of the old object sent as an argument. The syntactic form of copy constructors is dictated by the language. For example, for our first represention option, it would look like:

class ComplexV1 {
  float re_;
  float im_;
public:
  Complex(const Complex& c) :re_(c.real()), im_(c.imag()) {}
  ...
};

And for the second:

class ComplexV2 {
  ComplexRep* rep_;
public:
  Complex(const Complex& c) : rep(new ComplexRep) {
     rep_->re_ = c.real(); rep_->im_ = c.image(); 
  }
  // ...
};

Note that to maintain communication-closure, the constructor made a new representation rather than sharing the one used by its argument. This is an example of deep copying. Copy constructors for ADT classes should always (recursively) copy all parts so that they can have their own new exclusive copies to work off.

One of the disadvantages of copy-based passing is that it can lead to copying in cases where you don't intuitively expect it. Consider for example a slight variant of plus:

Complex operator + (Complex a, Complex b) {
  Complex c (a.real() + b.real, a.imag() + b.imag()) ;
  return c;
}

And a sample use:

void user() {
  Complex x, y;
  // ...
  Complex w = x + y;
}

Here, the sequence of copy-constructor calls is:

  1. Copy x as a
  2. Copy y as b
  3. Construct c
  4. Copy c as return value of plus
  5. Copy return value as w

C++ Compilers are allowed to optimize away some of these copies, but are not always able to do so. One way to kill off one of these copies yourself is to use the style of returning a value in plus illustrated in the first version:

Complex operator + (Complex a, Complex b) {
  return Complex(a.real() + b.real, a.imag() + b.imag()) ;
}

But even with this, things get worse in chains of calls in expressions. For example, in:

void user2() {
  Complex x, y, z;
  // ...
  Complex w = x + y + z;
}

The sequence of constructor calls is:

  1. Copy x as a
  2. Copy y as b
  3. Construct return value of x+y
  4. Copy return value of x+y as a
  5. Copy z as b
  6. Construct return value of x+y+z
  7. Copy return value as w

These issues are sometimes called the temporary object problem in C++. They are a byproduct of the expression-oriented style supported by overloaded operators. When they lead to efficiency problems, usually the best alternative is not to use this style, but instead to use OO-style mutative versions (see below).

Reference-based

Sometimes copy-based passing introduces far too much overhead. While it would work OK for Complex, it wouldn't ordinarily make sense for things like Sets that tend to use a lot of space. (Note however that if you are sending ADT values over a network, you may at some level be making copies and packets holding copies, etc., anyway.)

Of course, the most common solution to this is to adopt a special form of reference-based passing that (given the cooperation of all programmers) preserves the intent of value-passing, but actually uses cheaper object-based pointer-passing conventions. The C++ argument mode that does this is passing by const reference. Functions that accept const-ref arguments may only invoke methods labeled as const, and/or send them to other functions accepting const-refs. When there are no non-const methods, and there is full communication closure, the differences between copying versus const-reference are fully transparent to all programmers. When there are non-const methods, it takes some further policies to be fully effective. For example, programmers have to refrain from ``casting away const''.

(Style note: References and pointers in C++ are nearly the same thing. Just about the only times I use const-ref instead of pointers are when I'm using this idiom of trying to obtain value semantics but passing as an object. It's not a bad convention.)

So, for example, we might alter the signature of plus to:

Complex operator + (const Complex& a, const Complex& b) {
  return Complex(a.real() + b.real, a.imag() + b.imag()) ;
}

This works pretty well for arguments, but doesn't work at all for return values of functions. You cannot return a function result by const-ref. For example, consider:

const Complex& operator + (const Complex& a, const Complex& b) {
  Complex c(a.real() + b.real, a.imag() + b.imag()) ;
  return c;
}

Do not do this! You can't return a reference to a local in C++. The stack-based memory allocation rules say that all locals disapear at function exit, which leads to a dangling reference. The alternative is dynamic allocation. For example:

const Complex& operator + (const Complex& a, const Complex& b) {
  Complex* c = new Complex(a.real() + b.real, a.imag() + b.imag()) ;
  return *c;
}

But the only way that you could ever live with this would be to employ some sort of garbage collection storage management scheme. Otherwise, every client ever obtaining a function result would have to figure out when it were no longer needed and then kill it. This is normally incredibly error prone and slower than just relying on a good garbage collector, assuming you can find or build one for C++ in the first place, which you most often cannot.

So, almost always you have a choice between using copy-based conventions for return values, or writing some kind of class-specific memory management support. In simple classes like Complex, by far the best alternative is to use copying. In most other cases, you find yourself using some kind of reference-counting storage management.

Reference Counting

Reference Counting is among the most common semi-automatic storage management schemes used in C++. As applied to ADTs, the basic idea is that each representation object maintains a count of how many other objects are using it. When the number is zero, the representation is deleted. The trick in making this work well is to do refcount maintenance in C++ constructors and destructors. For example here:
struct ComplexRep {
  float re_;
  float im_;
  int refcount_;
  ComplexRep(float re = 0, float im = 0, int rc = 1)
    :re_(re), im_(im), refcount_(rc) {}
};

class ComplexV4 {
  ComplexRep* rep_;
public:
  ComplexV4(float re = 0, float im = 0) :rep(new ComplexRep(re,im) {}
  Complex(const Complex& c) :rep_(c.rep_) { rep_->refcount_++; }

 ~ComplexV4() { if (--rep_->refcount_ == 0) delete rep_; }

};

This suffices as we've developed the class so far, but isn't a very good model for most classes. Issues include:

Note that when you use reference-counting schemes like this, you should normally use by-value style passing as well, so that reference counts are maintained across all calls.

OO-style methods

You essentially always find yourself wanting to add mutative operations to ADT-style classes, in order to support a more object-based style. In C++, this is pretty natural to do since there is already a syntax for them using the X= operators. For example, we might add, using our first representation:

class ComplexV1 {
public:
  // ...
  void operator += (const Complex& b) { re_ += b.real(); im_ += b.imag(); }
};

Note that even though we are supporting an object style here, we use references, not pointers since that is how overloaded operators are defined.

As alluded to above, use of mutative operations is often more efficient than expression-based style. For example, instead of writing Complex w = x + y + z, we could write:

void user3() {
  ComplexV1 x, y, z;
  ComplexV1 w = x;
  w += y;
  w += z;
}

Here, only one constructor call is used for w, and no temporaries are produced.

There is additionally a mutative operation that you are obligated to either support or explicity not support -- assignment. In an ADT, you almost always have to support it, and implement it by the equivalent of re-initializing state to that held by another object. For example:

class ComplexV1 {
public:
  ...
  ComplexV1& operator = (const ComplexV1& b) { 
        re_ = b.real(); im_ = b.imag(); return *this; }
};

The return value type follows an idiomatic convention used by default in C++ -- note that here is among the few places where you can without danger return a reference, since it only a reference to something that must necessarily exist after the operation exits.

The mechanics are slightly trickier with indirect representations. Since the argument object (or perhaps just its representation) may actually be the same as the host, you need to be more careful, especially if you may need to reinitialize with a different representation object. For example, in our second representation:

class ComplexV2 {
public:
  // ...
  ComplexV2& operator = (const ComplexV2& b) { 
    if (b.rep_ != rep_) {
      ComplexRep* newrep = new ComplexRep;
      newrep->re_ = b.real();
      newrep->im_ = b.imag();
      delete rep_;
      rep_ = newrep;
     }
    return *this;
  }    
  // ...
};

In this example, we really didn't have to kill the old rep_ and make a new one, so this part could have been simplified. But in general, this is the right canonical sequence.

With reference-counting, it is a bit trickier still:

class ComplexV4 {
public:
  // ...
  ComplexV4& operator = (const ComplexV4& b) { 
    if (&b != this) {
      b.rep_->refcount_++; // Order is important
      if (--rep_->refcount_ == 0) delete rep_; 
      rep_ = b.rep_;
     }
    return *this;
  }    
  // ...
};

With reference counting, these kinds of mechanis extend to all other mutative operations as well, leading to a general style of copy-on-write semantics. Here, whenever you change a value in the representation, you have to make sure that you are no longer sharing with other objects that don't want the value changed. For example:

class ComplexV4 {
public:
  ...
  void operator += (const Complex& b) { 
    if (rep_->refcount_ != 1) {
      --rep_->refcount_;
      rep_ = new ComplexRep(real()+b.real(), imag()+b.imag());
     }
    else {
      rep_->re_ += b.rep_->re_;
      rep_->im_ += b.rep_->im_;
    }
  }    
  // ...
};

There are several further variations and improvements. Almost none of them are worth doing for simple classes like Complex, but become more tempting for ADTs with heavyweight representations like matrices.

Converters

Among the last things to check when building ADT-style classes is how they interconvert with built-in types, as well as other ADT classes.

In C++, any single-argument constructor also defines a converter from the source type to that class. These converters may be implicity invoked by the C++ overload resolution mechanism, so are a bit more dangerous than they look. (The upcoming ANSI standard introduces some syntax that detaches construction from coercion when desired.) Thus, for example, because our original Complex constructor includes a case where you can initialize a Complex from a single float, this conversion would be implicitly called when used in:

void user() {
  Complex x;
  // ...
  x += 4.0;
}

The sequence of events is:

Thus, these implicit coercions add to the temporary problem, and can add surprising behavior and overhead. One way to deal with this is to add special-case versions of functions and methods to deal with sources of convertable types. These tend to be more efficient. For example:
class ComplexV1 {
  // ...

  void operator +=(float f) { re_ += f; }
};

The main disadvantage of such maneuvers is that they tend to proliferate. It is tempting to add special case versions of just about every operator for just about every convertible type. This could lead to dozens of operations, most of which will rarely be used.

You can also add converters from a new class back into builtin types. For example, although not very sensible here, we could add a conversion that coerces a Complex into a simple float:

class Complex {
  // ...

  float operator float() const { return re_; }
};
This converter would be called implicitly whenever a Complex were sent to a routine expecting a float (which is why it is not very sensible here.)

Subclassing

The short version of the best rule of thumb in C++ is to never try to build subclasses of ADT-style classes; use any other design / programming technique instead. For some details see the paper on libg++. But to illustrate, suppose we wanted to build a NamedComplex class that adds a simple name attribute to values. Here are some problems you would run up against:

The normal reaction to all this is disbelief, followed by frustration trying to find a good way to do it, failure, and finally acceptance. Programmers using purely reference passing languages do not have to face these issues as much, which is why such problems don't often arise in Smalltalk or OREXX. In fact, if you build a class using only OO-style operations (for example, here, only methods like operator +=), you don't run into these problems as much either, and it is not hard to build useful subclasses.

One otherwise undesirable consequence of this is that when building ADT-style classes, you often end up padding their interfaces with lots of miscellaneous operations that you would otherwise define only within special subclasses. This is one reason why String classes in C++ tend to have too many basic methods and operations, and are not very representative of ``normal'' C++ classes.

On the other hand, a potential advantage of this is that since no methods need be declared as virtual, you can better control inlining. (Most compilers cannot inline virtuals.) For the kinds of operations seen in Complex, inlining could be a big performance win. In numerical applications especially, people find as much as a factor of 20 speed-up when they inline. But in general it is hard to predict whether it is worth the trade-off against the code bloat associated with inlining. You always need to test this empirically with client code that matches expected usage profiles (which is itself very hard to do for such basic classes).


Last update Wed May 3 07:10:46 1995 Doug Lea (dl at gee)