Python has functions for reading, creating, and deleting files. The high-level steps for many data science applications is to import data, analyze data, and export results.
open
for read or write¶A basic function for working with files is open(filename,mode)
. The filename
is a string that identifies the file to open and mode
is how the file should be opened as 'r'
for read, 'a'
for append, 'w'
for write, and 'x'
for create (returns error if file exists). You can also specify if the file should be handled as a text 't'
or binary 'b'
file. The defaults is 'rt'
to read a file in text mode.
# write a test file with a message
f = open('02-file.txt','w')
f.write('This is a test file')
f.close()
import os
print('File stored in: ' + os.getcwd())
# read and print the file contents
f = open('02-file.txt')
print(f.read())
f.close()
A common data file form is the Comma Separated Value (CSV) where entries are delimited (separated) by a comma. There is some data m
that we would like to write to a CSV file with headers in clist
. This example shows how to write the CSV file with several modules.
x,y,z
1,2,3
4,5,6
7,8,9
After running each cell, open the file in your current run directory with either Excel or a text editor.
open
and csv
module¶The with
command automatically closes the file when the commands inside the block are completed. The newline=''
is only required for Windows. The writerow
function writes one row of the CSV file.
clist = ['x','y','z']
m = [[1,2,3],\
[4,5,6],\
[7,8,9]]
import csv
with open('02-data1.csv',mode='w',newline='') as f:
cw = csv.writer(f)
cw.writerow(clist)
for i in range(len(m)):
cw.writerow(m[i])
numpy
writes CSV¶The numerical Python numpy
package is used throughout this course. The np.savetxt
function requires the file name, data m
, the type of delimiter ,
, and header. If comments=''
is omitted then the header has a #
sign in front.
import numpy as np
np.savetxt('02-data2.csv',m,delimiter=',',comments='',header='x,y,z')
pandas
writes CSV¶The module pandas
requires that the data be in DataFrame
form for writing.
import pandas as pd
df = pd.DataFrame(m,columns=clist)
df.to_csv('02-data3.csv',index=False)
pandas
writes XLSX and JSON¶pandas
can also write other files such as json or Excel files. You may need to install openpyxl to write the Excel file. You can do this in a cell with !pip install openpyxl
and include --user
if you do not have administrative privilege. You may need to restart the IPython notebook kernel after pip installs the openpyxl
package.
df.to_json('02-data3.json',orient='table',index=False)
df.to_excel('02-data3.xlsx',index=False)
Use numpy
to create 51
equally spaced values for x
between 0
and 100
. Calculate y=x**2
and z=x**3
that are derived from x
. Store x
, y
, and z
in a CSV file with headings in file 02-test.csv
.
data = np.loadtxt('02-data1.csv',delimiter=',',skiprows=1)
print(data)
pandas
to read CSV¶The pd.read_csv
function reads the CSV data file the the header row to label the colunns. The data.head()
and data.tail()
functions prints up to the first or last 5 values, respectively.
data = pd.read_csv('02-data1.csv')
data.head()
Use pandas
to read the 02-test.csv
file created above. Display the first 5 rows of the file.
It is also possible to delete files using the os
(operating system) module.
import os
os.remove('02-data1.csv')
The glob
module builds a list of files that start with 02-data
and end with .csv
. It uses the wildcard character *
to select any files that match the first and last parts.
['02-data1.csv', '02-data2.csv', '02-data3.csv']
If the user's first letter of the answer is y
then it deletes these files.
import os
import glob
filelist = glob.glob('02-data*.csv')
if filelist==[]:
print('No files to delete')
ans='no'
else:
ans = input('Delete files '+str(filelist)+'? ')
if ans[0].lower()=='y':
for f in filelist:
os.remove(f)
Delete the file 02-test.csv
with Python.
Write data file 02-tclab.csv
with 5 columns that include time in seconds (t
), heater levels (Q1
and Q2
), and temperatures (lab.T1
and lab.T2
). Include a data row every second for 20 seconds. The starting script only prints those values to the screen but they also need to be saved to a file.
import tclab
import time
n = 20
Q1 = 30; Q2 = 70
with tclab.TCLab() as lab:
lab.Q1(Q1); lab.Q2(Q2)
print('t Q1 Q2 T1 T2')
for t in range(n):
print(t,Q1,Q2,lab.T1,lab.T2)
time.sleep(1)
Read the 02-tclab.csv
file and print the first 5 rows. If you do not have a TCLab device, read the data file from the url
with data=pd.read_csv(url)
# read this file if you do not have a TCLab
url = 'http://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'