Lately, I've been thinking hard about the best way to organize my data before feeding it to a machine learning classifier or regressor. I have a few guiding principles in mind:
I first started with the numpy savez function which allows you to save the data in an ordered way. This would typically go as
import numpy as np
# Assume you have a train and test array and labels:
arr_train, label_train
arr_test, label_test
# Define a dict
d = {"arr_train": arr_train,
"arr_test": arr_test,
"label_train": label_train,
"label_test": label_test}
np.savez("data.npz", **d)
Which works just fine.
However, two limitations quickly became apparent:
import numpy as np
import time
arr = np.random.randint(0,1000,(25000,5000))
np.savez("arr.npz", **{"arr": arr})
np.save("arr.npy", arr)
ltime = []
for i in range(20):
start = time.time()
arr = np.load("arr.npy")[:, :]
ltime.append(time.time() - start)
print "npy time:", np.mean(ltime)
ltime = []
for i in range(20):
start = time.time()
arr = np.load("arr.npz")["arr"][:, :]
ltime.append(time.time() - start)
print "npz time:", np.mean(ltime)
npz time: 1.47687283754
npy time: 0.483348703384
which gave me the following times:
That's quite the difference !
Clearly, another approach is needed. So far, I have settled with the excellent and simple h5py module which stores the data in HDF5 format while being very transparent to numpy.
Here's how it goes:
import h5py
import numpy as np
import time
arr = np.random.randint(0,1000,(25000,5000))
with h5py.File("arr.h5", "w") as hf:
hf.create_dataset("arr", data=arr)
And that's it !
You can also easily add metadata to each of your datasets:
import h5py
import numpy as np
import time
arr = np.random.randint(0,1000,(25000,5000))
with h5py.File("arr.h5", "w") as hf:
dset = hf.create_dataset("arr", data=arr)
dset.attrs['author'] = "pony"
My only gripe with the module was an ill-fated attempt at writing a file in parallel from several sources: you need to rebuild h5py to support parallelism (my anaconda distribution did not support it by default) and this takes you to a world of pain with conflicts between anaconda's own HDF5 library and the new parallel one you build. The only workaround I found involved reinstalling h5py outside of anaconda but messed with my MPI setup.
Anyway, let's test the speed of this new design:
import h5py
import numpy as np
import time
arr = np.random.randint(0,1000,(25000,5000))
with h5py.File("arr.h5", "w") as hf:
dset = hf.create_dataset("arr", data=arr)
dset.attrs['author'] = "pony"
ltime = []
for i in range(20):
start = time.time()
with h5py.File("arr.h5", "r") as hf:
arr = hf["arr"][:, :]
ltime.append(time.time() - start)
print "hdf5 time:", np.mean(ltime)
This gave me:
hdf5 time: 0.386118304729
Which is even faster than the .npy
version !
Later on, I'll try to give more details on my data pipeline.