我有一份CSV档案
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
我需要的行被重新排列的网站,即后的部分-
;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
有可能使用熊猫吗?找到sting的存在并重新排列它,遍历所有行并对下一个字符串重复此操作?我浏览了系列的文档。str.contains
和str.extract
但无法找到解决方案。
问题更新后编辑:
这是abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
这是公司csv(有必要仔细看逗号):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
这是密码
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
这是abc的第一个结果。csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
将代码修改为后续查询的表单后,这里是已排序的公司。运行脚本产生的csv。
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
我希望有帮助!
假设空值为np.nan
:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
使现代化
鉴于问题的最新情况和样本数据如下
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
一个可能的解决办法是
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
导致
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18
使用带有键的
排序
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
既然你提到了
reindex
我认为get_dummies
会起作用
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL
我正在尝试按“百分比”对数据帧的内容进行排序。那种似乎不起作用。 代码-在此处输入图像描述
问题内容: 我有一堆具有相同列但顺序不同的csv文件。我们正在尝试使用SQL * Plus上载它们,但是我们需要具有固定列排列的列。 例 所需订单:ABCDEF csv文件:ACDEB(有时列不在csv中,因为它不可用) 用python可以实现吗?我们正在使用Access + Macros来做…但这太浪费时间了 PS。对不起,如果有人对我的英语能力感到沮丧。 问题答案: 您可以使用csv模块读取,
我正在尝试对两个熊猫数据框列进行排序。我知道Python有自己的内置函数: 但我想知道熊猫是否也有这个功能,是否可以将两列作为一对一起完成。 例如,我有以下数据集: 我想获得以下信息: 基本上我在这里做的是,我正在对列“特征”进行排序,以从最小值到最大值,但是我希望“总和”中的相应值也发生变化。 有人能帮我解决这个问题吗?我在Stackoverflow上看到过其他帖子,但是我没有找到解释这个过程的
问题内容: 如何在不更改HTML源代码的情况下重新排序div? 例如,我希望div以#div2,#div1,#div3的顺序显示,但是在HTML中它们是: 谢谢! 问题答案: 没有使用css对元素进行重新排序的万能方法。 您可以通过将它们全部向右浮动来水平反转它们的顺序。或者,您可以将它们相对于正文或其他包含元素的绝对位置进行定位- 但这对元素的大小以及相对于页面上其他元素的位置存在严格的限制。
问题内容: 如何对pandas groupby操作应用排序?下面的命令返回一个错误,指出“布尔”对象不可调用 问题答案: 通常,排序是在groupby键上执行的,并且您发现您无法调用groupby对象,您可以做的是调用并传递函数并将列作为kwarg参数传递: 另外,您可以在分组之前对df进行排序: 更新资料 对于不建议使用的版本,请参见docs,现在应使用: 在这里在评论中添加@xgdgsc的答案
问题内容: 我在SO上看到了很多与此问题有关的问题,但都与我的情况无关。除了进行基本的CRUD操作外,我与SQL专家们无二。因此,我对此非常执着。 我有一张表。 在哪里; myTable始终总有10行。 因此,首先,假设myTable具有以下数据: 预期的功能应如下所示; 插入新 当插入新项目时,用户应该能够将其插入到他/她想要的任何位置。现在,我仅设法通过删除OrderPosition = 10