In a previous article, we examined several simple cases of using a
streambuf as a filter, with another streambuf as the final source or
sink. In this article, we will pursue the idea further, with a
streambuf which redefines end of file. When it encounters end of
file on the current source, it connects to another and continues,
only returning end of file when there are no more sources. This
technique goes beyond what is supported by the templates we
presented previously, and so we will simply derive directly from
streambuf
. For the technical details concerning the
necessary boiler plate code, I refer you to the previously cited
article.
To add to the fun, I've added buffering, so that you don't have a virtual function call each time you read a character.
Because I happen to have a compiler which implements the new
iostream handy, and feel like experimenting with it, the examples
here will use the new forms, e.g.
char_traits< char >::eof()
, instead of
EOF
, and so forth. (There are fewer changes than I
originally expected; namespaces and the new definition of
EOF
are the most obvious.) The code has been tested
using Microsoft's Visual C++, version 5.0, service pack 3.
An obvious use of this technique is to process filename arguments as
is usual under UNIX: if there are no arguments, we read from
standard in (cin
), otherwise, the arguments are the
names of files which are read one after the other; traditionally, a
filename of "-"
also causes standard in to be read.
In the tradition of STL and the standard C++ library, we will
specify the list of files by using two iterators, one to the
beginning, and one to the end. Given that the list of filename
arguments is in argv
, which has the type
char*[]
(actually, char**
), we would
expect the iterators to have the type char**
. In fact,
in my own code, for various reasons, the arguments will normally
have been copied into a
list< std::string >
(with those not
corresponding to filename arguments removed). Others might use a
std::vector< string >
, or who knows
what else. So, continuing in the tradition of STL, we will use a
template class, instantiated over the type of iterator
(char**
for argv
). The only requirement
is that the results of dereferencing the iterator can be assigned to
a std::string
. A char**
meets this
requirement, since the results of dereferencing it is a
char*
, which will convert implicitly to
std::string
.
To best give an idea of what we are trying to do, consider a simple
implementation of the UNIX utility cat
:
int main( int argc , char* argv[] ) { MultiFileReader< char** > mfrbuf( argv + 1 , argv + argc ) ; cout << &mfrbuf ; return mfrbuf.returnStatus() ; }
You can't get much simpler. But of course, we've ignored all of the
options and error conditions. Never the less, inputting from such a
streambuf can simplify a lot of code. And if, as is the case in
most of my code, option processing has left the filenames in a
list< string >
, the declaration of the
buffer would be:
MultiFileReader< list< string >::const_iterator > mfrbuf( files.begin() , files.end() ) ;
As mentioned before, we'll do buffering. For the purposes of
demonstration, however, we'll simplify error handling to the
maximum: if we cannot open a file, will display an error message on
standard out; all the user program can do is ask whether there has
or has not been an error, through the function
returnStatus
. This is just barely adequate for
cat
-- a real implementation would probably want
to use some sort of callback, to let the user code decide what is
appropriate.
As usual, I'll start by presenting the class definition:
template< class I > class MultiFileReader : public std::streambuf { public: MultiFileReader( I begin , I end ) ; virtual ~MultiFileReader() ; int returnStatus() const ; protected: virtual int underflow() ; virtual std::streambuf* setbuf( char* buffer , std::streamsize n ) ; private: I myNextFilename ; I myEndFilename ; std::streambuf* myCurrentStream ; std::filebuf myFilebuf ; char* myBuffer ; std::streamsize myBufferSize ; int myReturnStatus ; bool myAmBufferOwner ; void nextStream() ; } ;
As explained in the previous article, you may want to redefine sync as well; if portability to older versions of iostream is an issue, you should also provide a version of overflow which simply returns end of file.
Several comments are in order:
returnStatus()
, for error
reporting, see the preceding section.
myNextFilename
and
myEndFilename
are classical STL iterators. When the
two are equal, there are no more filenames present. In this
implementation, myNextFilename
points to the name of
the next file to be opened (and not the currently open file).
myCurrentStream
.
There is a class invariant that this variable is NULL
if and only if we have reached our end of file; that is, we have
reached end of file for all of the input streams. We will thus
verify that all of the functions maintain this invariant (although
it may temporarily be NULL
within one of our
functions).
filebuf
in our data, and reuse it. This
is more efficient that getting a new one off the heap each time,
although I doubt the difference is measurable compared to the time
necessary to read the files. Not using the heap may also reduce
fragmentation.
myBuffer
points to the buffer,
myBufferSize
gives its size.
main
. In fact, in a real
implementation, I would provide some more flexible error handling,
perhaps a callback, so that the user could decide the appropriate
action. Here, the only way the user can detect that we were
unable to open a file is by checking this value (using
returnStatus()
, which simply returns
myReturnStatus
).
myAmBufferOwner
) to tell us whether we own the
buffer or not, in order to know whether to delete it in the
destructor.
A more complete implementation would also maintain the name of the current file, and perhaps the line number, which it would make available to the user, for e.g. error messages.
So much for the generalities, let's look at the constructor:
template< class I > MultiFileReader< I >::MultiFileReader( I begin , I end ) : myNextFilename( begin ) , myEndFilename( end ) , myCurrentStream( NULL ) , myReturnStatus( EXIT_SUCCESS ) , myBuffer( NULL ) , myBufferSize( 0 ) , myAmBufferOwner( false ) { if ( myNextFilename == myEndFilename ) myCurrentStream = std::cin.rdbuf() ; else nextStream() ; }
As you can see, there is not much too it; the obvious
initializations, and either initializing
myCurrentStream
to the streambuf of cin
if
no filenames are given, or calling nextStream
to
initialize it according to the filename designated by
myNextFilename
.
As usual for input streambuf's, the workhorse is
underflow
:
template< class I > int MultiFileReader< I >::underflow() { static std::streamsize const ourDefaultBufferSize = 4096 ; static int const eof( std::char_traits< char >::eof() ) ; int result( eof ) ; if ( gptr() < egptr() ) result = *gptr() ; else { if ( myBuffer == NULL ) { myBuffer = new char[ ourDefaultBufferSize ] ; myBufferSize = ourDefaultBufferSize ; myAmBufferOwner = true ; } while ( result == eof && myCurrentStream != NULL ) { std::streamsize lengthRead = myCurrentStream->sgetn( myBuffer , myBufferSize ) ; if ( lengthRead <= 0 ) nextStream() ; else { setg( myBuffer , myBuffer , myBuffer + lengthRead ) ; result = *gptr() ; } } } return result ; }
We've already seen the first if in the preceding article. In the
else branch, the first thing we do is to allocate a buffer if we
don't already have one. (The client code may have called
setbuf
, or this might not be the first call to this
function.) Note that we remember that we allocated it, so we can
delete it correctly in the destructor.
We use a loop trying to read the character, so that a failure in a
single file (perhaps because the file was empty) won't cause a
premature end of file to appear. We try to read a complete buffer
from the actual source; if this fails, we call
nextStream()
to go to the next file; this leaves
result == eof
, but will set
myCurrentStream
to NULL
if there are no
more files. If the read succeeds, we set up the buffer pointers to
what we have just read, and return the first character.
The other major function is nextStream
:
template< class I > void MultiFileReader< I >::nextStream() { if ( myCurrentStream == &myFilebuf ) myFilebuf.close() ; myCurrentStream = NULL ; for ( ; myCurrentStream == NULL && myNextFilename != myEndFilename ; ++ myNextFilename ) { std::string filename = *myNextFilename ; if ( filename == "-" ) myCurrentStream = std::cin.rdbuf() ; else if ( myFilebuf.open( filename.c_str() , std::ios::in ) != NULL ) myCurrentStream = &myFilebuf ; else { myReturnStatus = EXIT_FAILURE ; std::cerr << "Cannot open " << filename << std::endl ; } } }
Here, we first close the filebuf if it was being used, and set the
current stream to NULL
, since there isn't one until we
find a new one. We then loop over the remaining filenames until we
successfully find a file, or there aren't any more. If the filename
is "-"
, we use the streambuf from cin
,
according to the UNIX tradition, otherwise, we try to open the file.
In a more robust implementation, the final else would use some sort
of call-back1
with the filename in order to inform the user immediately; in such a
case, we should also think about what happens if the user call-back
throws an exception. As written, we just display an error on
standard error and note the fact in the return status. (This is the
usual behavior in such cases for UNIX filter programs.)
It's worth noting the use of the variable filename
,
rather than simply dereferencing myCurrentPos
each
time. This is not just a question of optimization! It is necessary
so that char**
can be used as an iterator -- there
is no implicit type conversion on the left side of the .
operator when calling c_str
.
Of the remaining functions, setbuf()
will only set the
buffer if it is not yet set (thus, before the first input),
otherwise it returns an error:
template< class I > std::streambuf* MultiFileReader< I >::setbuf( char* buffer , std::streamsize n ) { std::streambuf* result = NULL ; if ( myBuffer == NULL ) { myBuffer = buffer ; myBufferSize = n ; myAmBufferOwner = false ; result = this ; } return result ; }
The destructor only has to check if we own the buffer, and delete it
(using delete[]
) if we do, and
returnStatus()
is a simple accessor function.
Again, this is really just a simple extension to what we showed in the previous article. It is only a simple example of what is possible, however. The fundamental idea is that the end of file that the streambuf uses/defines is not necessarily the same as that of the final source or sink. When you define a streambuf, you define what it represents. I've found such streambuf's exceedingly useful -- in fact, it's rare for me to write an application without using them.
As usual, the actual complete sources can be found in my code. (This is the class
GB_MultipleFileInputStreambuf
, in the component
Extended/multiinp
.)