Processing Data with Secondary Storage (files)

Processing data is the main reason behind the design of modern digital computers and programming languages. We are often faced with large datasets for which we have methods to produce the necessary information. However, executing those methods without digital computers is nearly impossible. Imagine if you're given the task of computing the average grade in the final mathematics exam for the students of the 12th grade in an entire country. You will take a long time, get tired soon, and make a lot of mistakes. Instead, we can write a few lines of code to load the data for all students from the long-term memory (also called secondary storage) to the short-term memory (such as RAM), followed by a few more instructions for the computer to compute the average of the grades for all students in a much smaller fraction of time compared to a human as a computer. Here, we will learn about a basic file structure called the Comma Separated Format (CSV) used to store data on secondary storage. We will write a program to load data from a CSV file into memory (represented by a list variable) and then compute an average.

The CSV file format is simple. Each line has values that are separated by commas. Each value represents a column and each line represents a column. Thus, we can have a simple matrix structure in a CSV file. Python has a nice way of reading a CSV file by using the CSV library. Whenever we need to load data from a file into memory, we need to tell the computer to open the file for reading. Then, we will tell the CSV library to prepare for reading from the opened file by creating a reader object. Once we do that, we can loop through the reader object to reach each row from the CSV file and store it into a list we call data. Each row itself is stored as a list. Thus, we can think of the data list as a list of lists. That is, each element of the list data is itself a list that captures a row from the file we loaded into the memory.

Reading one line of a file

f = open('data.csv','r')
line = f.readline() 
f.close()

This simple program opens a file called data.csv. The method readline() can read one single line from a file as a string value and return it. Therefore, the variable line has a string value, which is the first line of data in the file data.csv. We could print this line, split it into a list (if it contains multiple items), or convert it to a numeric value (if the line only has a number). In the remaining of this section, we will dive deeper into reading and writing files.

Reading CSV files

Suppose that the file we want to read is called students.csv and is stored in the folder in which we will store our program.

import csv
f=open('students.csv','r')
reader = csv.reader(f)
data=[]
for row in reader:
   data.append(row)

This program opens the file, prepares a reader object to read the CSV file, defines the data variable, and loops through the file to store all the rows in data. Assume that each row has values for two columns, one is the student name and another is the student exam grade. Our job is now to compute the average of all grades and display it on the screen.

import csv
f=open('students.csv','r')
reader = csv.reader(f)
s = 0 
n = 0 
for row in reader:
   n+=1 
   s+= row[1] 
print(s/n)

We change the program a little bit to compute the average directly. Instead of storing all the data in a variable, we only read each row, add the current grade to a sum variable and count the number of records we're reading. Finally, we compute and print the result of s/n, which exactly gives us the average of all grades.

Using eval and list comprehension

Another way to read a CSV file that only contains numbers is to use the eval function.

f=open('students.csv','r')
data = [eval(x) for x in f]

Assuming the file students.csv only has only numbers separated by a comma in each line, one can evaluate each line to a tuple using the eval function. In the code above, notice that we're also using a special Python statement called list comprehension. A list comprehension is a handy way of constructing a list without the use of append or extend. That is, inside the list brackets, we will specify each element of the list and specify which sequence is used to construct the list. For example, when we have L = [x for x in file], assuming file points to a file that has multiple lines ready to be read, the for loop goes through each line of the file and names it x. In each iteration, x is stored as an element of the list L. We could modify x such as the one in the code above. In this case, [eval(x) for x in f] also applies the eval function to each element.

with keyword

We can also use the with keyword to help us in opening a file:

data = [] 
with open('students.csv','r') as f:
    for line in f:
        data.append(eval(line))
f.close()

How can we write to a file?

It's simple. We can either use the csv library or directly write a line to a file. Suppose that we have a list of numbers that we would like to write to a file.

f=open('numbers.csv','w')
for num in L: 
   print(num, file=f)
f.close()

Here, we should first open the file for writing using 'w'. Then, we will loop through the list and use the print function with the optional parameter file to write a num to f in each iteration.

Another way is to use the csv library.

import csv
f=open('numbers.csv','w')
writer = csv.writer(f)
for record in L: 
   writer.writerow(record)
f.close()

The csv library helps in organizing the file as a comma-separated format file. If each record is a sequence, then csv writes the sequence separated by commas on a line.

Note that we can also open a file for writing using f=open('numbers.csv','a') instead of f=open('numbers.csv','w'), then we will see a different behavior. When using w, if the file does not exist it will be created. If the file exists, its contents will be deleted and we will start afresh. When using a, if the file does not exist it will be created. If the file exists, when writing to it, the new records will be appeneded to the end of the file instead of deleting its contents. The difference between the two can be best observed with a program to test and compare their differences.