问题：

使用数据帧建模数据

曾航

2023-03-14

我试图训练一个数据集来预测输入的文本是否来自科幻小说。我对html" target="_blank">python比较陌生，所以我不知道我到底做错了什么。

代码：

#class17.py
"""
Created on Fri Nov 17 14:07:36 2017

@author: twaters

Read three science fiction novels
Predict a sentence or paragraph
see whether sentence/phrase/book is from a science fiction novel or not
"""

import nltk
import pandas as pd
import csv
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

from sklearn import model_selection
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from nltk.corpus import stopwords

#nltk.download()


irobot = "C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/irobot.txt"
enders_game = "C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/endersgame.txt"
space_odyssey ="C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/spaceodyssey.txt"
to_kill_a_mockingbird = "C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/tokillamockingbird.txt"

sr = set(stopwords.words('english'))
freq = {}

def main():
    #read_novels()
    model_novels()


def read_novel(b, is_scifi):

    read_file = open(b)

    text = read_file.read()
    words = text.split()
    clean_tokens = words[:]
    filtered_list = []

    for word in clean_tokens:
        word = word.lower()
        if word not in sr:
            filtered_list.append(word)

    freq = nltk.FreqDist(clean_tokens)
    #print(filtered_list)
    for word in clean_tokens:
       count = freq.get(word,0)
       freq[word] = count + 1



    frequency_list = freq.keys()

    with open('C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/novels_data.txt', 'w', encoding='utf-8') as csvfile:
        fieldnames = ['word','frequency','is_scifi']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator = '\n')
        writer.writeheader()

        for words in frequency_list:
            writer.writerow({'word': words,'frequency': freq[words],'is_scifi':is_scifi})

    print("List compiled.")

def read_novels(): 

    read_novel(enders_game, 0)
    read_novel(space_odyssey, 0)
    read_novel(irobot, 0)
    read_novel(to_kill_a_mockingbird, 1)

def model_novels():

    df = pd.read_csv('C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/novels_data.txt', 'rb', delimiter='\t', encoding='utf-8')
    print(df)

    #for index in range(2, df.shape[0], 100):
    df_subset = df.loc[1:]
    #print(df_subset)
    X = df_subset.loc[:, 'frequency':'is_scifi']
    Y = df_subset.loc[:, 'frequency':'is_scifi']
    testing_size = 0.2
    seed = 7
    X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=testing_size, random_state=seed)

    selectedModel = LogisticRegression()
    selectedModel.fit(X_train, Y_train)  
    predictions = selectedModel.predict(X_validation)

#%%
#print("Accuracy Score:\n", accuracy_score(Y_validation, predictions))
#print("Confusion Matrix:\n",confusion_matrix(predictions, Y_validation))
#print("Class report:\n", classification_report(Y_validation, predictions))
#df_test = pd.read_csv('C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/novels_data.txt', delimiter='\t')
#predictions_test = selectedModel.predict(df_test)
#test_frame = pd.DataFrame(predictions_test)
#test_frame.to_csv('C:/Users/twaters/Desktop/Assignments/SQL/Python/DA Project/novels_data_result.txt', sep='\t')

错误：回溯（最近一次呼叫上次）：

文件“”，第1行，在main（）中

文件"C：/用户/用户/桌面/分配/SQL /Python/DA项目/class17.py"，第36行，在主model_novels（）

modelêselectedModel中的文件“C:/Users/user/Desktop/Assignments/SQL/Python/DA Project/class17.py”，第95行。装配（X_系列、Y_系列）

File"D：\Program Files（x86）\Anaconda\lib\site-包\skLearning\linear_model\logistic.py"，第1216行，符合顺序="C"）

文件“D:\Program Files（x86）\Anaconda\lib\site packages\sklearn\utils\validation.py”，第573行，在检查\u X\u y确保\u min\u功能、警告\u数据类型、估计器中）

文件“D:\Program Files（x86）\Anaconda\lib\site packages\sklearn\utils\validation.py”，第453行，在check\u array\u assert\u all\u finite（array）中

文件“D:\Program Files（x86）\Anaconda\lib\site packages\sklearn\utils\validation.py”，第44行，在“断言\u all\u finite”中，或值太大，不适合%r.%X.dtype）

ValueError:输入包含NaN、无穷大或对数据类型（'float64'）太大的值。

如果您需要访问我正在读取的文件，我可以链接它们。

谢谢你的帮助！

共有1个答案

司空玮

2023-03-14

以下是stacktrace中您应该注意的要点：

modelêselectedModel中的文件“C:/Users/user/Desktop/Assignments/SQL/Python/DA Project/class17.py”，第95行。装配（X_系列、Y_系列）

文件“D:\Program Files（x86）\Anaconda\lib\site packages\sklearn\utils\validation.py”，第44行，在“断言\u all\u finite”中，或值太大，不适合%r.%X.dtype）

这说明X的格式存在问题，因此逻辑回归将接受它。

您应该检查X_train和X，看看它们是否包含错误的值。

这个答案会给你一些如何做的建议。

Python：检查DataFrame中是否有值为NaN

类似资料：

使用另一个数据帧的索引创建空数据帧

我得到了一个具有多个列和行的数据帧df1。简单的例子：我想创建一个空的数据框df2，然后再添加新的列和计算结果。此时，我的代码如下所示： …添加两个新列：有没有更好/更安全/更快的方法？是否可以创建一个空数据帧df2，并且只从df1复制索引？
使用另一个数据帧或RDD搜索数据帧

我有2个数据帧在apache火花。 df 1有显示编号和说明。。。数据看起来像不显示描述a这是米奇b唐纳德来了c玛丽和乔治回家d玛丽和乔治进城第二个数据帧有字符人物乔治唐纳德玛丽米妮我需要搜索节目描述，找出哪个节目的特征是哪个角色... 最终输出应该如下所示乔治|c，d 唐纳德|b 玛丽|c. d 米妮|不显示这些数据集经过精心设计，非常简单，但它表达了我试图实现的搜索功能。我
使用loc更新数据帧

我有一个列结构的熊猫数据帧（df）：此数据框包含例如1月、2月、3月、4月的数据。A、B、C、D是数字列。对于2月份，我想重新计算列A并在数据框中更新它，即对于月份=2月份，A=B C D 我使用的代码：这运行时没有出现错误，但没有更改2月份A列中的值。在控制台中，它给出了一条消息：试图在数据帧切片的副本上设置值尝试使用。loc[row\u indexer，col\u indexer]=改
使用python将数据块数据帧写入S3

我有一个名为df的数据库数据帧。我想将它作为csv文件写入S3存储桶。我有S3存储桶名称和其他凭据。我检查了这里给出的在线留档https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3它说使用以下命令但我有的是数据帧，而不是文件。怎么才能实现？
合并数据帧列表以创建一个数据帧[重复]

我有一个包含18个数据帧的列表：所有数据帧都有一个公共id列，因此很容易将它们与pd连接在一起。一次合并2个。有没有一种方法可以一次将它们连接起来，从而使dfList作为单个数据帧返回？
使用scipy.sparse从pandas数据帧创建稀疏矩阵

我有一个带有两个变量X和Y的pandas数据帧（大约1M行），并且希望使用scipy,sparse创建一个稀疏矩阵。输出应该是一个n x m矩阵，如果x=x和Y=Y，则其条目为1。下面是数据结构的示例所需的输出为提前致谢

使用数据帧建模数据

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档