Data Flow (Text and Binary Files)

Context

  • The C language did not build the input/output facilities into the language. In other words, there is no keyword like read or write. Instead, it left the IO to the compiler as external library functions (such as printf and scanf in stdio library). The ANSI C standard formalized these IO functions into Standard IO package (stdio.h).
  • C++ continues this approach and formalizes IO in libraries such as iostream and fstream.
  • C/C++ I/O are based on streams, which are a sequence of bytes flowing in and out of the programs.
  • In input operations, data bytes flow from an input source (such as keyboard, file, network, or another program) into the program.
  • In output operations, data bytes flow from the program to an output sink (such as console, file, network, or another program). Streams act as an intermediary between the programs and the actual IO devices, in much the way that frees the programmers from handling the actual devices, so as to archive device-independent IO operations.

C++ provides both the formatted and unformatted IO functions. In formatted or high-level IO, bytes are grouped and converted to types such as intdouble, string or user-defined types. In unformatted or low-level IO, bytes are treated as raw bytes and unconverted. Formatted IO operations are supported via overloading the stream insertion (<<) and stream extraction (>>) operators, which presents a consistent public IO interface.

To perform input and output, a C++ program:

  1. Construct a stream object.
  2. Connect (Associate) the stream object to an actual IO device (e.g., keyboard, console, file, network, another program).
  3. Perform input/output operations on the stream, via the functions defined in the stream’s pubic interface in a device-independent manner. Some functions convert the data between the external format and internal format (formatted IO); while other does not (unformatted or binary IO).
  4. Disconnect (Dissociate) the stream to the actual IO device (e.g., close the file).
  5. Free the stream object.

C++ I/O Headers, Templates, and Classes

C++ IO is provided in headers <iostream> (which included <ios>, <istream>, <ostream> and <streambuf>), <fstream> (for file IO), and <sstream> (for string IO). Furthermore, the header <iomanip> provided manipulators such as setw(), setprecision(), setfill() and setbase() for formatting.

Files I/O (STREAMS)

  • stream models a stream of data. In a stream, data flows between objects, and those objects can perform arbitrary processing on the data. When you’re working with streams, the output is data going into the stream and input is data coming out of the stream. These terms reflect the streams as viewed from the user’s perspective.
  • In C++, streams are the primary mechanism for performing input and output (I/O). Regardless of the source or destination, you can use streams as the common language to connect inputs to outputs.
  • We can convert our objects to streams of bytes. We can also convert streams of bytes back to objects. The I/O stream library provides such functionality.
  • Streams can be output streams and input streams.
  • There are different kinds of I/O streams, for instance: file streams.

Formatted Operations (text-based streams)

All formatted I/O passes through two functions: the standard stream operators, operator << and operator >>.

Read

We can read from a file, and we can write to a file. The standard library offers such functionality via file streams. Those files streams are defined inside the <code><fstream></code> header and they are:

  1. std::ifstream – read from a file
  2. std::ofstream – write to a file
  3. std::fstream – read from and write to a file

The std::fstream can both read from and write to a file, so let us use that one. To create a std::fstream object we use:

C++
#include <fstream>
int main()
{
    std::fstream fs{ "myfile.txt" };
}

This example creates a fs file stream and associates it with a file name myfile.txt on our disk. To read from such file, line-by-line, we use:

C++
#include <iostream>
#include <fstream>
#include <string>
int main()
{
    std::fstream fs{ "myfile.txt" };
    std::string s;
    while (fs)
    {
        std::getline(fs, s); // read each line into a string
        std::cout << s << '\n';
    }
}

To read from a file, one character at the time we can use file stream’s >>operator:

C++
#include <iostream>
#include <fstream>
int main()
{
    std::fstream fs{ "myfile.txt" };
    char c;
    while (fs >> c)
    {
        std::cout << c;
    }
}

Write

To write to a file, we use file stream << operator :

C++
#include <fstream>
int main()
{
    std::fstream fs{ "myoutputfile.txt", std::ios::out };
    fs << "First line of text." << '\n';
    fs << "Second line of text" << '\n';
    fs << "Third line of text" << '\n';
}

We associate an fs object with an output file name and provide an additional std::ios::out the flag which opens a file for writing and overwrites any existing myoutputfile.txt file. Then we output our text to a file stream using the << operator.

To append text to an existing file, we include the std::ios::app flag inside the file stream constructor:

C++
#include <fstream>
int main()
{
    std::fstream fs{ "myoutputfile.txt", std::ios::app };
    fs << "This is appended text" << '\n';
    fs << "This is also an appended text." << '\n';
}

We can also output strings to our file using the file stream’s << operator:

C++
#include <iostream>
#include <fstream>
#include <string>
int main()
{
    std::fstream fs{ "myoutputfile.txt", std::ios::out };
    std::string s1 = "The first string.\n";
    std::string s2 = "The second string.\n";
    fs << s1 << s2;
}

Text Files

  • A text file (flat file) is a computer file that only contains text and has no special formatting such as bold text, italic text, images, etc. With Microsoft Windows computers text files are identified with the .txt file extension, as shown in the example picture.
  • Because of their simplicity, text files are commonly used for the storage of information. They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word.
  • A simple text file may need no additional metadata (other than knowledge of its character set) to assist the reader in interpretation. A text file may contain no data at all, which is the case of a zero-byte file.

Encoding

  • The ASCII character set is the most common compatible subset of character sets for English-language text files, and is generally assumed to be the default file format in many situations. It covers American English, but for the British Pound sign, the Euro sign, or characters used outside English, a richer character set must be used.
  • Unicode is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8, which has the advantage of being backwards-compatible with ASCII; that is, every ASCII text file is also a UTF-8 text file with identical meaning.
UTF-8
  • UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

Types of Text Files

CSV (Comma-separated values)

  • A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
  • A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.
  • The CSV file format is not fully standardized.
C++
id,firstname,lastname,email,email2,profession
0,Jobi,Gilmour,Jobi.Gilmour@yopmail.com,Jobi.Gilmour@gmail.com,doctor
1,Xylina,Killigrew,Xylina.Killigrew@yopmail.com,Xylina.Killigrew@gmail.com,police officer
2,Patricia,Zitvaa,Patricia.Zitvaa@yopmail.com,Patricia.Zitvaa@gmail.com,doctor
3,Gusty,Friede,Gusty.Friede@yopmail.com,Gusty.Friede@gmail.com,developer
4,Bee,Michella,Bee.Michella@yopmail.com,Bee.Michella@gmail.com,police officer
5,Evita,Keily,Evita.Keily@yopmail.com,Evita.Keily@gmail.com,firefighter
6,Deane,Jarib,Deane.Jarib@yopmail.com,Deane.Jarib@gmail.com,firefighter
7,Amii,Nance,Amii.Nance@yopmail.com,Amii.Nance@gmail.com,firefighter
8,Ardeen,Sparhawk,Ardeen.Sparhawk@yopmail.com,Ardeen.Sparhawk@gmail.com,police officer
9,Kenna,Skell,Kenna.Skell@yopmail.com,Kenna.Skell@gmail.com,developer
10,Kirbee,Shirberg,Kirbee.Shirberg@yopmail.com,Kirbee.Shirberg@gmail.com,doctor

Tab Delimited

  • A tab-delimited text file is a file containing tabs that separate information with one record per line.
  • A tab delimited file is often used to upload data to a system.
C++
id    firstname    lastname    email    email2    profession
0    Jobi    Gilmour    Jobi.Gilmour@yopmail.com    Jobi.Gilmour@gmail.com    doctor
1    Xylina    Killigrew    Xylina.Killigrew@yopmail.com    Xylina.Killigrew@gmail.com    police officer
2    Patricia    Zitvaa    Patricia.Zitvaa@yopmail.com    Patricia.Zitvaa@gmail.com    doctor
3    Gusty    Friede    Gusty.Friede@yopmail.com    Gusty.Friede@gmail.com    developer
4    Bee    Michella    Bee.Michella@yopmail.com    Bee.Michella@gmail.com    police officer
5    Evita    Keily    Evita.Keily@yopmail.com    Evita.Keily@gmail.com    firefighter
6    Deane    Jarib    Deane.Jarib@yopmail.com    Deane.Jarib@gmail.com    firefighter
7    Amii    Nance    Amii.Nance@yopmail.com    Amii.Nance@gmail.com    firefighter
8    Ardeen    Sparhawk    Ardeen.Sparhawk@yopmail.com    Ardeen.Sparhawk@gmail.com    police officer
9    Kenna    Skell    Kenna.Skell@yopmail.com    Kenna.Skell@gmail.com    developer
10    Kirbee    Shirberg    Kirbee.Shirberg@yopmail.com    Kirbee.Shirberg@gmail.com    doctor

Example I/O

Unformatted Operations (binary files)

  • When data is stored in a file in the binary format, reading and writing data is faster because no time is lost in converting the data from one format to another format. Such files are called binary files.
  • The class ios_base is a multipurpose class that serves as the base class for all I/O stream classes.

Member types and constants, stream open mode type:

ConstantExplanation
appseek to the end of stream before each write
binaryopen in binary mode
inopen for reading
outopen for writing
truncdiscard the contents of the stream when opening
ateseek to the end of stream immediately after open

File size and indexation

In C++, files are considered a stream or an array of uninterpreted bytes, each byte can also be considered a char, with the file contents considered as a char array: (char *)myFile.

The “array” of bytes stored in a file is indexed from zero to len-1, where len is the total number of bytes in the entire file.

Opening Files

There are two main ways of opening files in binary mode:

When declaring the object, set a file name and necessary flags in the constructor.

C++
ifstream myReadFile(filename, ios::in | ios::binary);

Declare a stream object and use the open method to set the file name and necessary flags.

C++
ifstream myFile;
myFile.open ("data2.dat", ios::out | ios::binary);

There are two main flags that need to be used when manipulating binary files:

  1. The i/o mode ios::in or ios::out
  2. The binary mode ios::binary

Read

The read method extracts a given number of bytes from the stream, and places them into the memory pointed to by the first parameter.

C++
Person person;

std::string filename = "people.dat";
ifstream inFile;
inFile.open(filename, ios::in | ios::binary);

inFile.read((char*)&person, sizeof(person));

cout << person.toString << std::endl;

inFile.close();

Write

The write member function writes a given number of bytes on the given stream, starting at the position of the “put” pointer.

C++
Person person1 = Person("Julio");

std::string filename = "people.dat";
ofstream outFile;
outFile.open(filename, ios::out | ios::binary);

outFile.write((char *)&person1, sizeof(person1));
outFile.close();

Accessing file positions

Each open file will have a “get” and a “put” pointer, these store a position in the file, and are part of the stream object.

The GET pointer

It is the current reading position, the index of the next byte that will be read from the file. The get pointer can be repositioned with the istream& seekg(streampos pos) method. To return the index of the get pointer on a given stream use istream& tellg().

C++
Person person;

std::string filename = "people.dat";
ifstream inFile;
inFile.open(filename, ios::binary);

inFile.seekg (sizeof(person),ios::beg);
// Reading the second person in the fileinFile.read((char*)&person, sizeof(person));

cout << person.toString << std::endl;

inFile.close();

The PUT pointer

The put pointer can be repositioned with the ostream& seekp(streampos pos) method. To return the index of the put pointer on a given stream use istream& tellp().

C++
Person person;

std::string filename = "people.dat";
ofstream outFile;
outFile.open(filename, std::ios::binary | std::ios::app);

std::cout << "Adding in position: " << outFile.tellg() << " byte." << std::endl;
// Adding to the end of the file.outFile.write((char*)&person, sizeof(person));

outFile.close();

Example Binary