用keras进行大数据训练,为了加快训练,需要提前制作训练集。
由于HDF5的特性,所有数据需要一次性读入到内存中,才能保存。
为此,我采用分批次分为2个以上HDF5进行存储。
1、先读取每个标签下的图片,并设置标签
def load_dataset(path_name,data_path): images = [] labels = [] train_images = [] valid_images = [] train_labels = [] valid_labels = [] counter = 0 allpath = os.listdir(path_name) nb_classes = len(allpath) print("label_num: ",nb_classes) for child_dir in allpath: child_path = os.path.join(path_name, child_dir) for dir_image in os.listdir(child_path): if dir_image.endswith('.jpg'): img = cv2.imread(os.path.join(child_path, dir_image)) image = misc.imresize(img, (IMAGE_SIZE, IMAGE_SIZE), interp='bilinear') #resized_img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)) images.append(image) labels.append(counter)
2、该标签下的数据集分割为训练集(train images),验证集(val images),训练标签(train labels),验证标签
(val labels)
def split_dataset(images, labels): train_images, valid_images, train_labels, valid_labels = train_test_split(images,\ labels, test_size = 0.2, random_state = random.randint(0, 100)) #print(train_images.shape[0], 'train samples') #print(valid_images.shape[0], 'valid samples') return train_images, valid_images, train_labels ,valid_labels
3、分割后的数据分别添加到总的训练集,验证集,训练标签,验证标签。
其次,清空原有的图片集和标签集,目的是节省内存。假如一次性读入多个标签的数据集与标签集,进行数据分割后,会占用大于单纯进行上述操作两倍以上的内存。
images = np.array(images) t_images, v_images, t_labels ,v_labels = split_dataset(images, labels) for i in range(len(t_images)): train_images.append(t_images[i]) train_labels.append(t_labels[i]) for j in range(len(v_images)): valid_images.append(v_images[j]) valid_labels.append(v_labels[j]) if counter%50== 49: print( counter+1 , "is read to the memory!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") images = [] labels = [] counter = counter + 1 print("train_images num: ", len(train_images), " ", "valid_images num: ",len(valid_images))
4、进行判断,直到读到自己自己分割的那个标签。
开始进行写入。写入之前,为了更好地训练模型,需要把对应的图片集和标签打乱顺序。
if ((counter % 4316 == 4315) or (counter == nb_classes - 1)): print("start write images and labels data...................................................................") num = counter // 5000 dirs = data_path + "/" + "h5_" + str(num - 1) if not os.path.exists(dirs): os.makedirs(dirs) data2h5(dirs, t_images, v_images, t_labels ,v_labels)
对应打乱顺序并写入到HDF5
def data2h5(dirs_path, train_images, valid_images, train_labels ,valid_labels): TRAIN_HDF5 = dirs_path + '/' + "train.hdf5" VAL_HDF5 = dirs_path + '/' + "val.hdf5" #shuffle state1 = np.random.get_state() np.random.shuffle(train_images) np.random.set_state(state1) np.random.shuffle(train_labels) state2 = np.random.get_state() np.random.shuffle(valid_images) np.random.set_state(state2) np.random.shuffle(valid_labels) datasets = [ ("train",train_images,train_labels,TRAIN_HDF5), ("val",valid_images,valid_labels,VAL_HDF5)] for (dType,images,labels,outputPath) in datasets: # HDF5 initial f = h5py.File(outputPath, "w") f.create_dataset("x_"+dType, data=images) f.create_dataset("y_"+dType, data=labels) #f.create_dataset("x_"+dType, data=images, compression="gzip", compression_opts=9) #f.create_dataset("y_"+dType, data=labels, compression="gzip", compression_opts=9) f.close()
5、判断文件全部读入
def read_dataset(dirs): files = os.listdir(dirs) print(files) for file in files: path = dirs+'/' + file dataset = h5py.File(path, "r") file = file.split('.') set_x_orig = dataset["x_"+file[0]].shape[0] set_y_orig = dataset["y_"+file[0]].shape[0] print(set_x_orig) print(set_y_orig)
6、训练中,采用迭代器读入数据
def generator(self, datagen, mode): passes=np.inf aug = ImageDataGenerator( featurewise_center = False, samplewise_center = False, featurewise_std_normalization = False, samplewise_std_normalization = False, zca_whitening = False, rotation_range = 20, width_shift_range = 0.2, height_shift_range = 0.2, horizontal_flip = True, vertical_flip = False) epochs = 0 # 默认是无限循环遍历 while epochs < passes: # 遍历数据 file_dir = os.listdir(self.data_path) for file in file_dir: #print(file) file_path = os.path.join(self.data_path,file) TRAIN_HDF5 = file_path +"/train.hdf5" VAL_HDF5 = file_path +"/val.hdf5" #TEST_HDF5 = file_path +"/test.hdf5" db_t = h5py.File(TRAIN_HDF5) numImages_t = db_t['y_train'].shape[0] db_v = h5py.File(VAL_HDF5) numImages_v = db_v['y_val'].shape[0] if mode == "train": for i in np.arange(0, numImages_t, self.BS): images = db_t['x_train'][i: i+self.BS] labels = db_t['y_train'][i: i+self.BS] if K.image_data_format() == 'channels_first': images = images.reshape(images.shape[0], 3, IMAGE_SIZE,IMAGE_SIZE) else: images = images.reshape(images.shape[0], IMAGE_SIZE, IMAGE_SIZE, 3) images = images.astype('float32') images = images/255 if datagen : (images,labels) = next(aug.flow(images,labels,batch_size = self.BS)) # one-hot编码 if self.binarize: labels = np_utils.to_categorical(labels,self.classes) yield ({'input_1': images}, {'softmax': labels}) elif mode == "val": for i in np.arange(0, numImages_v, self.BS): images = db_v['x_val'][i: i+self.BS] labels = db_v['y_val'][i: i+self.BS] if K.image_data_format() == 'channels_first': images = images.reshape(images.shape[0], 3, IMAGE_SIZE,IMAGE_SIZE) else: images = images.reshape(images.shape[0], IMAGE_SIZE, IMAGE_SIZE, 3) images = images.astype('float32') images = images/255 if datagen : (images,labels) = next(aug.flow(images,labels,batch_size = self.BS)) #one-hot编码 if self.binarize: labels = np_utils.to_categorical(labels,self.classes) yield ({'input_1': images}, {'softmax': labels}) epochs += 1
7、至此,就大功告成了
完整的代码:
# -*- coding: utf-8 -*- """ Created on Mon Feb 12 20:46:12 2018 @author: william_yue """ import os import numpy as np import cv2 import random from scipy import misc import h5py from sklearn.model_selection import train_test_split from keras import backend as K K.clear_session() from keras.utils import np_utils IMAGE_SIZE = 128 # 加载数据集并按照交叉验证的原则划分数据集并进行相关预处理工作 def split_dataset(images, labels): # 导入了sklearn库的交叉验证模块,利用函数train_test_split()来划分训练集和验证集 # 划分出了20%的数据用于验证,80%用于训练模型 train_images, valid_images, train_labels, valid_labels = train_test_split(images,\ labels, test_size = 0.2, random_state = random.randint(0, 100)) return train_images, valid_images, train_labels ,valid_labels def data2h5(dirs_path, train_images, valid_images, train_labels ,valid_labels): #def data2h5(dirs_path, train_images, valid_images, test_images, train_labels ,valid_labels, test_labels): TRAIN_HDF5 = dirs_path + '/' + "train.hdf5" VAL_HDF5 = dirs_path + '/' + "val.hdf5" #采用标签与图片相同的顺序分别打乱训练集与验证集 state1 = np.random.get_state() np.random.shuffle(train_images) np.random.set_state(state1) np.random.shuffle(train_labels) state2 = np.random.get_state() np.random.shuffle(valid_images) np.random.set_state(state2) np.random.shuffle(valid_labels) datasets = [ ("train",train_images,train_labels,TRAIN_HDF5), ("val",valid_images,valid_labels,VAL_HDF5)] for (dType,images,labels,outputPath) in datasets: # 初始化HDF5写入 f = h5py.File(outputPath, "w") f.create_dataset("x_"+dType, data=images) f.create_dataset("y_"+dType, data=labels) #f.create_dataset("x_"+dType, data=images, compression="gzip", compression_opts=9) #f.create_dataset("y_"+dType, data=labels, compression="gzip", compression_opts=9) f.close() def read_dataset(dirs): files = os.listdir(dirs) print(files) for file in files: path = dirs+'/' + file file_read = os.listdir(path) for i in file_read: path_read = os.path.join(path, i) dataset = h5py.File(path_read, "r") i = i.split('.') set_x_orig = dataset["x_"+i[0]].shape[0] set_y_orig = dataset["y_"+i[0]].shape[0] print(set_x_orig) print(set_y_orig) #循环读取每个标签集下的所有图片 def load_dataset(path_name,data_path): images = [] labels = [] train_images = [] valid_images = [] train_labels = [] valid_labels = [] counter = 0 allpath = os.listdir(path_name) nb_classes = len(allpath) print("label_num: ",nb_classes) for child_dir in allpath: child_path = os.path.join(path_name, child_dir) for dir_image in os.listdir(child_path): if dir_image.endswith('.jpg'): img = cv2.imread(os.path.join(child_path, dir_image)) image = misc.imresize(img, (IMAGE_SIZE, IMAGE_SIZE), interp='bilinear') #resized_img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)) images.append(image) labels.append(counter) images = np.array(images) t_images, v_images, t_labels ,v_labels = split_dataset(images, labels) for i in range(len(t_images)): train_images.append(t_images[i]) train_labels.append(t_labels[i]) for j in range(len(v_images)): valid_images.append(v_images[j]) valid_labels.append(v_labels[j]) if counter%50== 49: print( counter+1 , "is read to the memory!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") images = [] labels = [] if ((counter % 4316 == 4315) or (counter == nb_classes - 1)): print("train_images num: ", len(train_images), " ", "valid_images num: ",len(valid_images)) print("start write images and labels data...................................................................") num = counter // 5000 dirs = data_path + "/" + "h5_" + str(num - 1) if not os.path.exists(dirs): os.makedirs(dirs) data2h5(dirs, train_images, valid_images, train_labels ,valid_labels) #read_dataset(dirs) print("File HDF5_%d "%num, " id done!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") train_images = [] valid_images = [] train_labels = [] valid_labels = [] counter = counter + 1 print("All File HDF5 done!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") read_dataset(data_path) #读取训练数据集的文件夹,把他们的名字返回给一个list def read_name_list(path_name): name_list = [] for child_dir in os.listdir(path_name): name_list.append(child_dir) return name_list if __name__ == '__main__': path = "data" data_path = "data_hdf5_half" if not os.path.exists(data_path): os.makedirs(data_path) load_dataset(path,data_path)
以上这篇完美解决keras 读取多个hdf5文件进行训练的问题就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持小牛知识库。
本文向大家介绍tensorflow训练中出现nan问题的解决,包括了tensorflow训练中出现nan问题的解决的使用技巧和注意事项,需要的朋友参考一下 深度学习中对于网络的训练是参数更新的过程,需要注意一种情况就是输入数据未做归一化时,如果前向传播结果已经是[0,0,0,1,0,0,0,0]这种形式,而真实结果是[1,0,0,0,0,0,0,0,0],此时由于得出的结论不惧有概率性,而是错误的
本文向大家介绍完美解决Java获取文件路径出现乱码的问题,包括了完美解决Java获取文件路径出现乱码的问题的使用技巧和注意事项,需要的朋友参考一下 在Java中有时获取文件路径,获取的路径中存在中文或者是空格的时候,路径出现乱码。这个问题困扰了我好久,用过各种的转码都无济于事。在后来终于解决了,在这里和大家分享一下,其实也就是加一行代码的问题。 我在项目的是使用 这样来获得当前类的文件路径的。但是
问题内容: 因此,我一直遵循Google的官方tensorflow指南,并尝试使用Keras构建一个简单的神经网络。但是,在训练模型时,它不使用整个数据集(具有60000个条目),而是仅使用1875个条目进行训练。有可能解决吗? 输出: 这是我一直在为此工作的原始Google colab笔记本:https ://colab.research.google.com/drive/1NdtzXHEpiN
问题内容: 我的训练数据的格式为(?,15)其中?是可变长度。 创建模型时,请指定以下内容: 我的训练数据的形状为(35730,?,15)。 在python中检查这个我得到: 输出:(35730,) 输出:(513,15) 当我尝试根据训练数据拟合模型时,出现ValueError: 我只能在单个样本上使用model.train_on_batch()来训练模型。 我该如何解决?似乎keras认为我的
问题内容: 我想知道是否有可能保存经过部分训练的Keras模型并在再次加载模型后继续进行训练。 这样做的原因是,将来我将拥有更多的训练数据,并且我不想再次对整个模型进行训练。 我正在使用的功能是: 编辑1:添加了完全正常的示例 对于10个纪元后的第一个数据集,最后一个纪元的损失将为0.0748,精度为0.9863。 保存,删除和重新加载模型后,第二个数据集上训练的模型的损失和准确性分别为0.171
本文向大家介绍Python3 解决读取中文文件txt编码的问题,包括了Python3 解决读取中文文件txt编码的问题的使用技巧和注意事项,需要的朋友参考一下 问题描述 尝试用Python写一个Wordcloud的时候,出现了编码问题。 照着网上某些博客的说法添添改改后,结果是变成了“UnicodeDecodeError: ‘utf-8' codec can't decode byte…”这个错误
本文向大家介绍解决python xlrd无法读取excel文件的问题,包括了解决python xlrd无法读取excel文件的问题的使用技巧和注意事项,需要的朋友参考一下 读取文件时报错: 该xls文件在打开时确实会警告该文件与扩展名格式不一致。用文本编辑器打开该xls文件查看,发现确实不是xls文件,而是xml文件被保存为了xls文件。 解决办法:将文件后缀名改为.xml,作为xml文件读入。
问题内容: 我正在将Kears与tensorflow一起使用,并且我有一个3输出的模型,我只想训练2个输出。 我尝试使用上面的代码来执行此操作,但是我不确定它是否可以执行我想要的操作。因此,我认为这会加总损失,并且会用该损失来训练每个输出,而我根本不希望进行训练。(我需要,因为它用于测试)。谁能告诉我如何实现这一目标或让我确信代码实际上满足了我的需求? 问题答案: 您必须创建2个这样的不同模型 您