2. Import and Export Data

Python Data Science

Python has functions for reading, creating, and deleting files. The high-level steps for many data science applications is to import data, analyze data, and export results.

File open for read or write

A basic function for working with files is open(filename,mode). The filename is a string that identifies the file to open and mode is how the file should be opened as 'r' for read, 'a' for append, 'w' for write, and 'x' for create (returns error if file exists). You can also specify if the file should be handled as a text 't' or binary 'b' file. The defaults is 'rt' to read a file in text mode.

In [2]:
# write a test file with a message
f = open('02-file.txt','w')
f.write('This is a test file')
f.close()

import os
print('File stored in: ' + os.getcwd())

# read and print the file contents
f = open('02-file.txt')
print(f.read())
f.close()
File stored in: C:\Users\johnh\Desktop\data_science
This is a test file

idea

Write Data Files

A common data file form is the Comma Separated Value (CSV) where entries are delimited (separated) by a comma. There is some data m that we would like to write to a CSV file with headers in clist. This example shows how to write the CSV file with several modules.

x,y,z
1,2,3
4,5,6
7,8,9

After running each cell, open the file in your current run directory with either Excel or a text editor.

open and csv module

The with command automatically closes the file when the commands inside the block are completed. The newline='' is only required for Windows. The writerow function writes one row of the CSV file.

In [ ]:
clist = ['x','y','z']
m = [[1,2,3],\
     [4,5,6],\
     [7,8,9]]

import csv
with open('02-data1.csv',mode='w',newline='') as f:
    cw = csv.writer(f)
    cw.writerow(clist)
    for i in range(len(m)):
        cw.writerow(m[i])

numpy writes CSV

The numerical Python numpy package is used throughout this course. The np.savetxt function requires the file name, data m, the type of delimiter ,, and header. If comments='' is omitted then the header has a # sign in front.

In [ ]:
import numpy as np
np.savetxt('02-data2.csv',m,delimiter=',',comments='',header='x,y,z')

pandas writes CSV

The module pandas requires that the data be in DataFrame form for writing.

In [ ]:
import pandas as pd
df = pd.DataFrame(m,columns=clist)
df.to_csv('02-data3.csv',index=False)

pandas writes XLSX and JSON

pandas can also write other files such as json or Excel files. You may need to install openpyxl to write the Excel file. You can do this in a cell with !pip install openpyxl and include --user if you do not have administrative privilege. You may need to restart the IPython notebook kernel after pip installs the openpyxl package.

In [ ]:
df.to_json('02-data3.json',orient='table',index=False)
df.to_excel('02-data3.xlsx',index=False)

expert

Write Activity

Use numpy to create 51 equally spaced values for x between 0 and 100. Calculate y=x**2 and z=x**3 that are derived from x. Store x, y, and z in a CSV file with headings in file 02-test.csv.

In [ ]:
 

idea

Read Data Files

Just like writing the CSV files, there are modules for reading data files.

Use numpy to read CSV

The np.loadtxt function reads the CSV data file with option skiprows=1 to skip the header row. Numpy does not label the rows or columns and only stores the CSV values.

In [ ]:
data = np.loadtxt('02-data1.csv',delimiter=',',skiprows=1)
print(data)

Use pandas to read CSV

The pd.read_csv function reads the CSV data file the the header row to label the colunns. The data.head() and data.tail() functions prints up to the first or last 5 values, respectively.

In [ ]:
data = pd.read_csv('02-data1.csv')
data.head()

expert

Read Activity

Use pandas to read the 02-test.csv file created above. Display the first 5 rows of the file.

In [ ]:
 

idea

Delete Data Files

It is also possible to delete files using the os (operating system) module.

import os
os.remove('02-data1.csv')

The glob module builds a list of files that start with 02-data and end with .csv. It uses the wildcard character * to select any files that match the first and last parts.

['02-data1.csv', '02-data2.csv', '02-data3.csv']

If the user's first letter of the answer is y then it deletes these files.

In [ ]:
import os
import glob
filelist = glob.glob('02-data*.csv')

if filelist==[]:
    print('No files to delete')
    ans='no'
else:
    ans = input('Delete files '+str(filelist)+'? ')

if ans[0].lower()=='y':
    for f in filelist:
        os.remove(f)

expert

Delete Activity

Delete the file 02-test.csv with Python.

In [ ]:
 

expert

TCLab Activity

Write data file 02-tclab.csv with 5 columns that include time in seconds (t), heater levels (Q1 and Q2), and temperatures (lab.T1 and lab.T2). Include a data row every second for 20 seconds. The starting script only prints those values to the screen but they also need to be saved to a file.

In [ ]:
import tclab
import time
n = 20
Q1 = 30; Q2 = 70
with tclab.TCLab() as lab:
    lab.Q1(Q1); lab.Q2(Q2)
    print('t Q1 Q2 T1    T2')
    for t in range(n):
        print(t,Q1,Q2,lab.T1,lab.T2)
        time.sleep(1)

Read the 02-tclab.csv file and print the first 5 rows. If you do not have a TCLab device, read the data file from the url with data=pd.read_csv(url)

# read this file if you do not have a TCLab
url = 'http://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'
In [ ]: