(10/27/20) Methods, iterables, and generators for reading files

We all need to open, close, and save data to files using Python. Loading a file always involves reading the lines in the file, and possibly doing some processing on each line. An example file \(\textbf{test_data.txt}\) contains 1,000,000 rows (lines) and 3 columns separated by an empty space.

import time
fn = "generators/test_data.txt"

The Python Method

def method_loader(fn):
    with open(fn) as fid:
        lines = fid.readlines() # Read in lines until reaching EOF
        lines = [line.strip("\n").split(" ")[-1] for line in lines] # Process each line
    return lines
for processed_line in method_loader(fn):
    break

The entire file was read into memory with the .readlines() call, before any line processing was performed.

Reading files like this shouldn’t be a problem for file sizes up to ~1G. But sometimes we have no choice and have to work with large files (sometimes hundreds of gigs). As a result the readlines operation can take a very long time. Furthermore, if the file is too large to load into memory, python will throw the error MemoryError and your program will terminate with error.

We frequently encounter large datafiles at NCAR. What can we do about it?

Reading and processing one line at a time would solve this problem. We could even process an “infinitely” large file, which means any file that’s too large to load fully into memory.

This kind of file reading is called lazy reading.

The Python Generator

There is a special tool in the python toolbox that easily enables lazy reading called a generator. The generator object is built on top of python’s Iterator object class, but I will cover them in reverse below.

def generator_loader(fn):
    with open(fn, "r") as fid:
        for line in fid:
            yield line.strip("\n").split(" ")[-1]

We’ve replaced the return with something new named \(\color{green}{\textbf{yield}}\). This chunk of code looks similar to the method_loader!

Test it.

for processed_line in generator_loader(fn):
    print(processed_line)
    break # Stop early, I don't need to print 1,000,000 lines!
0

It behaves similarly when in use as compared to the method variant presented above. The big difference is that the generator version is more memory efficient because, through \(\color{green}{\textbf{yield}}\), lines are read into memory one at a time, returned, released, …, until reaching the end of the file (EOF).

That is to say \(\color{green}{\textbf{yield}}\) returns more than once, whereas \(\color{green}{\textbf{return}}\) in a method signals the end (in terms of memory usage, as it is freed and the method is exited).

Generators work nicely with serialized data … you may have dumped data using the pickle library before. The pickle library allows you to do that for the entire file, all in one go, or line-by-line as the example below illustrates:

import pickle
fn_pkl = "generators/test_data.pkl"
def write_to_pickle(data, fn):
    with open(fn, "wb") as fid:
        for line in data:
            pickle.dump(line, fid) # Iteration over .dump
write_to_pickle(
    method_loader(fn),
    fn_pkl
)

From here will assume that we do not know how many lines are in our serialized data dump:

def load_from_pickle(fn):
    with open(fn, "rb") as fid:
        while True: # Keep looping with while.
            yield pickle.load(fid) # Iteration over .load

where the \(\color{green}{\textbf{while True}}\) clause will keep looping over the call to load pickled data until we reach the end of the file.

Test it!

for row in load_from_pickle(fn_pkl):
    continue
---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-23-2de8c1965edf> in <module>
----> 1 for row in load_from_pickle(fn_pkl):
      2     continue

<ipython-input-22-45015fa1bf7e> in load_from_pickle(fn)
      2     with open(fn, "rb") as fid:
      3         while True: # Keep looping with while.
----> 4             yield pickle.load(fid) # Iteration over .load

EOFError: Ran out of input

It failed!

There must still be a signal that can be used to stop the geneartor from yielding the next line when it does not exist.

Note that python threw an end-of-file error \(\color{red}{\textbf{EOFError}}\). We can catch that and use it to exit the generator:

def load_from_pickle(fn):
    with open(fn, "rb") as fid:
        try:
            while True: # We do not necessarily know how many lines are in fn
                yield pickle.load(fid)               
        except EOFError:
            pass # Do nothing and leave read_from_pickle without error 

Now it will run without error:

for row in load_from_pickle(fn_pkl):
    continue
    
# Finishes without error

This is rather clunky! Now we have to do exception handling (gasp). You might be wondering what good are python generators at helping us to simplify memory usage when this style of coding makes the workflow more complex. We could have relied on python’s Iterator objects (covered next), to do lazy reading.

Fortunatly, there is a generalized version of yield, \(\color{green}{\textbf{yield from}}\) for these siuations:

def load_from_pickle(fn):
    with open(fn, "rb") as fid:
        yield from pickle.load(fid)
for row in load_from_pickle(fn_pkl):
    continue
# Finishes without error. 

In short, generators allow us to write simple code that helps to simplfy memory usage when working with large data files.

The Python Iterable

Before generators were introduced, one relied on a python Iterator object to produce lazy readers. Iterable classes are not too difficult to write, but they have dependencies, in particular, they must contain the \(\color{blue}{\textbf{__iter__}}\) and \(\color{blue}{\textbf{__next__}}\) “thunder” methods.

A simple example with our serialized (pickled) data from above:

class read_from_pickle_iterable:
    
    def __init__(self, fn):
        self.fn = fn
        self.fid = open(self.fn, "rb")
        
    def __iter__(self):
        return self
    
    def __next__(self):
        try:
            return pickle.load(self.fid)
        except EOFError:
            raise StopIteration

The thunder method \(\color{blue}{\textbf{__iter__}}\) returns the object itself (through self!), while \(\color{blue}{\textbf{__next__}}\) is used to return the result of the .load call on the opened file.

rfpi = read_from_pickle_iterable(fn_pkl)

Using the Iterator’s \(\color{blue}{\textbf{__next__}}\) functionality, we then grab the lines from the file one-by-one without opening the entire file:

while True:
    next(rfpi) # Use next like this
---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-44-1e1bfab26979> in __next__(self)
     11         try:
---> 12             return pickle.load(self.fid)
     13         except EOFError:

EOFError: Ran out of input

During handling of the above exception, another exception occurred:

StopIteration                             Traceback (most recent call last)
<ipython-input-51-d7c5bd59b55e> in <module>
      1 while True:
----> 2     next(rfpi) # Use next like this

<ipython-input-44-1e1bfab26979> in __next__(self)
     12             return pickle.load(self.fid)
     13         except EOFError:
---> 14             raise StopIteration

StopIteration: 

which dies as intended when the \(\color{red}{\textbf{EOFError}}\) and the \(\color{red}{\textbf{StopException}}\) is thrown using the \(\color{green}{\textbf{raise}}\) clause. When the iterator object is iterated out, for example rolled out in a for loop (or by list(), enumerate(), etc) it will exit without error:

for line in read_from_pickle_iterable(fn_pkl):
    continue

Why even have generators when there are already iterators?

The answer is because generators are more compact and easier to write, as you do not have to explicitly sub-class them with the \(\color{blue}{\textbf{__iter__}}\) and \(\color{blue}{\textbf{__next__}}\). That’s taken care of under-the-hood with the generator class. But the converse is not true: iterables do not have the yield capability.

When should I use a generator rather than a method?

There are lots of scenerios, in addition to data loading! Note that in the first method example above, a list was returned. Do you need all elements of the list, all at the same time? If the answer is no, then you want to try to use a generator if that file is large in size!

With generators one tries to balance the time spent performing operations with/on the data with memory utilization. Using them will often be determined by how your program is designed to run and utilize resources. If you are presented with a significant memory bottleneck, generators are very often the way to go. However if your program does not have said memory issue, using a generator might result in the program running significantly slower.

As you use generators more and more in your work-flow, you will learn to apply them when they are needed and when to avoid them when they do not offer any benefit over using methods.

Feel free to email me (John Schreck, schreck@ucar.edu) with any questions / mistakes / whatever!