Going Beyond End of File:

More Filtering Streambufs

by James Kanze

Introduction

In a previous article, we examined several simple cases of using a streambuf as a filter, with another streambuf as the final source or sink. In this article, we will pursue the idea further, with a streambuf which redefines end of file. When it encounters end of file on the current source, it connects to another and continues, only returning end of file when there are no more sources. This technique goes beyond what is supported by the templates we presented previously, and so we will simply derive directly from streambuf. For the technical details concerning the necessary boiler plate code, I refer you to the previously cited article.

To add to the fun, I've added buffering, so that you don't have a virtual function call each time you read a character.

Because I happen to have a compiler which implements the new iostream handy, and feel like experimenting with it, the examples here will use the new forms, e.g. char_traits< char >::eof(), instead of EOF, and so forth. (There are fewer changes than I originally expected; namespaces and the new definition of EOF are the most obvious.) The code has been tested using Microsoft's Visual C++, version 5.0, service pack 3.

Requirements specification

An obvious use of this technique is to process filename arguments as is usual under UNIX: if there are no arguments, we read from standard in (cin), otherwise, the arguments are the names of files which are read one after the other; traditionally, a filename of "-" also causes standard in to be read.

In the tradition of STL and the standard C++ library, we will specify the list of files by using two iterators, one to the beginning, and one to the end. Given that the list of filename arguments is in argv, which has the type char*[] (actually, char**), we would expect the iterators to have the type char**. In fact, in my own code, for various reasons, the arguments will normally have been copied into a list< std::string > (with those not corresponding to filename arguments removed). Others might use a std::vector< string >, or who knows what else. So, continuing in the tradition of STL, we will use a template class, instantiated over the type of iterator (char** for argv). The only requirement is that the results of dereferencing the iterator can be assigned to a std::string. A char** meets this requirement, since the results of dereferencing it is a char*, which will convert implicitly to std::string.

To best give an idea of what we are trying to do, consider a simple implementation of the UNIX utility cat:

    int
    main( int argc , char* argv[] )
    {
        MultiFileReader< char** >
                            mfrbuf( argv + 1 , argv + argc ) ;
        cout << &mfrbuf ;
        return mfrbuf.returnStatus() ;
    }

You can't get much simpler. But of course, we've ignored all of the options and error conditions. Never the less, inputting from such a streambuf can simplify a lot of code. And if, as is the case in most of my code, option processing has left the filenames in a list< string >, the declaration of the buffer would be:

    MultiFileReader< list< string >::const_iterator >
                            mfrbuf( files.begin() , files.end() ) ;

As mentioned before, we'll do buffering. For the purposes of demonstration, however, we'll simplify error handling to the maximum: if we cannot open a file, will display an error message on standard out; all the user program can do is ask whether there has or has not been an error, through the function returnStatus. This is just barely adequate for cat -- a real implementation would probably want to use some sort of callback, to let the user code decide what is appropriate.

Class definition

As usual, I'll start by presenting the class definition:

    template< class I >
    class MultiFileReader : public std::streambuf
    {
    public:
                            MultiFileReader( I begin , I end ) ;
        virtual             ~MultiFileReader() ;
        int                 returnStatus() const ;

    protected:
        virtual int         underflow() ;
        virtual std::streambuf*
                            setbuf( char* buffer , std::streamsize n ) ;

    private:
        I                   myNextFilename ;
        I                   myEndFilename ;
        std::streambuf*     myCurrentStream ;
        std::filebuf        myFilebuf ;
        char*               myBuffer ;
        std::streamsize     myBufferSize ;
        int                 myReturnStatus ;
        bool                myAmBufferOwner ;

        void                nextStream() ;
    } ;

As explained in the previous article, you may want to redefine sync as well; if portability to older versions of iostream is an issue, you should also provide a version of overflow which simply returns end of file.

Several comments are in order:

We've added a function, returnStatus(), for error reporting, see the preceding section.
The variables myNextFilename and myEndFilename are classical STL iterators. When the two are equal, there are no more filenames present. In this implementation, myNextFilename points to the name of the next file to be opened (and not the currently open file).
The most important variable is myCurrentStream. There is a class invariant that this variable is NULL if and only if we have reached our end of file; that is, we have reached end of file for all of the input streams. We will thus verify that all of the functions maintain this invariant (although it may temporarily be NULL within one of our functions).
We declare a filebuf in our data, and reuse it. This is more efficient that getting a new one off the heap each time, although I doubt the difference is measurable compared to the time necessary to read the files. Not using the heap may also reduce fragmentation.
The variable myBuffer points to the buffer, myBufferSize gives its size.
We also maintain a return status, which the user can use when returning from main. In fact, in a real implementation, I would provide some more flexible error handling, perhaps a callback, so that the user could decide the appropriate action. Here, the only way the user can detect that we were unable to open a file is by checking this value (using returnStatus(), which simply returns myReturnStatus).
Finally, since the user can set the buffer, we need a flag (myAmBufferOwner) to tell us whether we own the buffer or not, in order to know whether to delete it in the destructor.

A more complete implementation would also maintain the name of the current file, and perhaps the line number, which it would make available to the user, for e.g. error messages.

Implementation

So much for the generalities, let's look at the constructor:

    template< class I >
    MultiFileReader< I >::MultiFileReader( I begin , I end )
        :   myNextFilename( begin )
        ,   myEndFilename( end )
        ,   myCurrentStream( NULL )
        ,   myReturnStatus( EXIT_SUCCESS )
        ,   myBuffer( NULL )
        ,   myBufferSize( 0 )
        ,   myAmBufferOwner( false )
    {
        if ( myNextFilename == myEndFilename )
            myCurrentStream = std::cin.rdbuf() ;
        else
            nextStream() ;
    }

As you can see, there is not much too it; the obvious initializations, and either initializing myCurrentStream to the streambuf of cin if no filenames are given, or calling nextStream to initialize it according to the filename designated by myNextFilename.

As usual for input streambuf's, the workhorse is underflow:

    template< class I >
    int
    MultiFileReader< I >::underflow()
    {
        static std::streamsize const
                            ourDefaultBufferSize = 4096 ;
        static int const    eof( std::char_traits< char >::eof() ) ;

        int                 result( eof ) ;
        if ( gptr() < egptr() )
            result = *gptr() ;
        else
        {
            if ( myBuffer == NULL )
            {
                myBuffer = new char[ ourDefaultBufferSize ] ;
                myBufferSize = ourDefaultBufferSize ;
                myAmBufferOwner = true ;
            }
            while ( result == eof && myCurrentStream != NULL )
            {
                std::streamsize     lengthRead
                    = myCurrentStream->sgetn( myBuffer , myBufferSize ) ;
                if ( lengthRead <= 0 )
                    nextStream() ;
                else
                {
                    setg( myBuffer , myBuffer , myBuffer + lengthRead ) ;
                    result = *gptr() ;
                }
            }
        }
        return result ;
    }

We've already seen the first if in the preceding article. In the else branch, the first thing we do is to allocate a buffer if we don't already have one. (The client code may have called setbuf, or this might not be the first call to this function.) Note that we remember that we allocated it, so we can delete it correctly in the destructor.

We use a loop trying to read the character, so that a failure in a single file (perhaps because the file was empty) won't cause a premature end of file to appear. We try to read a complete buffer from the actual source; if this fails, we call nextStream() to go to the next file; this leaves result == eof, but will set myCurrentStream to NULL if there are no more files. If the read succeeds, we set up the buffer pointers to what we have just read, and return the first character.

The other major function is nextStream:

    template< class I >
    void
    MultiFileReader< I >::nextStream()
    {
        if ( myCurrentStream == &myFilebuf )
            myFilebuf.close() ;
        myCurrentStream = NULL ;
        for ( ; 
              myCurrentStream == NULL
                  && myNextFilename != myEndFilename ;
              ++ myNextFilename )
        {
            std::string         filename = *myNextFilename ;
            if ( filename == "-" )
                myCurrentStream = std::cin.rdbuf() ;
            else if ( myFilebuf.open( filename.c_str() , std::ios::in )
                      != NULL )
                myCurrentStream = &myFilebuf ;
            else
            {
                myReturnStatus = EXIT_FAILURE ;
                std::cerr << "Cannot open " << filename << std::endl ;
            }
        }
    }

Here, we first close the filebuf if it was being used, and set the current stream to NULL, since there isn't one until we find a new one. We then loop over the remaining filenames until we successfully find a file, or there aren't any more. If the filename is "-", we use the streambuf from cin, according to the UNIX tradition, otherwise, we try to open the file. In a more robust implementation, the final else would use some sort of call-back¹ with the filename in order to inform the user immediately; in such a case, we should also think about what happens if the user call-back throws an exception. As written, we just display an error on standard error and note the fact in the return status. (This is the usual behavior in such cases for UNIX filter programs.)

It's worth noting the use of the variable filename, rather than simply dereferencing myCurrentPos each time. This is not just a question of optimization! It is necessary so that char** can be used as an iterator -- there is no implicit type conversion on the left side of the . operator when calling c_str.

Of the remaining functions, setbuf() will only set the buffer if it is not yet set (thus, before the first input), otherwise it returns an error:

    template< class I >
    std::streambuf*
    MultiFileReader< I >::setbuf( char* buffer , std::streamsize n )
    {
        std::streambuf*     result = NULL ;
        if ( myBuffer == NULL )
        {
            myBuffer = buffer ;
            myBufferSize = n ;
            myAmBufferOwner = false ;
            result = this ;
        }
        return result ;
    }

The destructor only has to check if we own the buffer, and delete it (using delete[]) if we do, and returnStatus() is a simple accessor function.

Conclusion

Again, this is really just a simple extension to what we showed in the previous article. It is only a simple example of what is possible, however. The fundamental idea is that the end of file that the streambuf uses/defines is not necessarily the same as that of the final source or sink. When you define a streambuf, you define what it represents. I've found such streambuf's exceedingly useful -- in fact, it's rare for me to write an application without using them.

As usual, the actual complete sources can be found in my code. (This is the class GB_MultipleFileInputStreambuf, in the component Extended/multiinp.)

The term call-back should be understood in its largest sense here. In addition to the classical pointer to function, it could also be a static member of a traits class, an additional functional type template parameter, or anything else the user can think of.
(back)