我想进行文本挖掘分析,但遇到任何麻烦。使用dput(),我加载了一小部分文本。
text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L,
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L,
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L,
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L,
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L,
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg",
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g",
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g",
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g",
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g",
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+",
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL",
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.",
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL",
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER",
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))
(娜娜是偶然的。)正文是来自检查的产品名称。
我想把任何相似的名字分组。
例如。这里我手动取MAKFA makar(乌克兰名字)。我发现有7行“根或关键词MAKFA Makar”
Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g
所有产品位置都有相同的词根。MAKFA Makar不能像MFAMKR
那样作为我想要的输出
Initially class
1 Pasta Makfa snail flow-pack 450 g. MAKFA Makar.
2 MAKFA Macaroni feathers like. in / with MAKFA Makar.
3 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
4 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
5 6788 MAKFA Makar.perya 450g MAKFA Makar.
6 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
7 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
8 * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35 kolb
9 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg Spikachki
10 809 Bananas 1kg Bananas
11 Lemons 55+ Lemons
12 Napkins paper color 100pcs PL Napkins paper
13 SOFT Cotton sticks 100 PE (BELL Cotton sticks
14 SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g CAT seed
16 FetaXa Cheese product 60% 400g ( Cheese
17 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter TEA
18 2030918 MARIA TRADITIONAL Biscuit 180g Biscuit
19 197 Onion 1 kg Onion
20 TOBUSsteering-wheel 0.5kg flow steering-wheel
21 Package "Magnet" white (Plastiktre) Package (Plastiktre)
22 * 2108609 SLOB.Mayon.OLIVK.67% 400ml Mayon
23 TENDER AGE Cottage cheese 10 Cottage cheese
我怎样才能根据词根单词对产品进行分类?(相反,在单词Makar中存在相同的模式。Makfa,奶酪)
这里有一种方法,可以在其中搜索单词向量:
patt <- c("MAKFA Makar.", "kolb","Spikachki", "Bananas", "Lemons",
"Napkins paper", "Cotton sticks","SHEBEKINSKIE Macaroni","CAT seed","Cheese",
"TEA", "Biscuit", "Onion", "steering-wheel", "Package (Plastiktre)",
"Mayon", "Cottage", "cheese")
lst <-lapply(patt, function(x) text[grep(x,text$GOODS_NAME), ])
do.call(rbind.data.frame, lst)
ID_C_REGCODES_CASH_VOUCHER GOODS_NAME
15 3953 2013077 MAKFA Makar.RAKERS 450g
19 3960 2013077 MAKFA Makar.RAKERS 450g
20 3960 6788 MAKFA Makar.perya 450g
23 3967 2049750 MAKFA Makar.SHIGHTS 450g
24 3967 2049750 MAKFA Makar.SHIGHTS 450g
22 3960 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg
16 3953 809 Bananas 1kg
3 3941 Lemons 55+
2 3941 Napkins paper color 100pcs PL
7 3945 SOFT Cotton sticks 100 PE (BELL
10 3945 SHEBEKINSKIE Macaroni Butterfly №40
17 3960 * 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g
8 3945 FetaXa Cheese product 60% 400g (
18 3960 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter
14 3953 2030918 MARIA TRADITIONAL Biscuit 180g
11 3953 197 Onion 1 kg
6 3945 TOBUS steering-wheel 0.5kg flow
12 3953 * 2108609 SLOB.Mayon.OLIVK.67% 400ml
9 3945 TENDER AGE Cottage cheese 10
91 3945 TENDER AGE Cottage cheese 10
我认为你可以通过清理然后聚集你的文本来得到你想要的地方——这里有一个开始:
text <- text[1:24,]
library(quanteda)
library(tidyverse)
hc <- text %>%
pull(GOODS_NAME) %>%
as.character %>%
quanteda::tokens(
remove_numbers = T,
remove_punct = T,
remove_symbols = T,
remove_separators = T
) %>%
quanteda::tokens_tolower() %>%
quanteda::tokens_remove(valuetype="regex", pattern = c("^\\d.*")) %>%
quanteda::dfm() %>%
textstat_simil(method = "jaccard") %>%
magrittr::multiply_by(-1) %>%
`attr<-`("Labels", text$GOODS_NAME) %>%
hclust(method = "average")
pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
plot(hc)
dev.off()
shell.exec(tf)
clusters <- cutree(hc, h = -0.1)
split(text, clusters)
我有下面的课
我想从创建一个排序词 以下是我到目前为止的情况。 我不理解编译器的信息: 我到底做错了什么?
刚开始,我正在为一个应用程序创建一个数据库。由于模式在视觉上变得非常复杂,在多个模式中具有相同的user_table是可能的/可行的/推荐的吗? 例如,blogging模式将具有user_table和与此activity相关的其余表。购物模式将再次具有相同的user_table和表来管理购物activity。等等... 目标是将不同模式中的大量表分离开来,从而简化整体管理。 在另一篇文章中,有人建
我有这样一个脚本,当标题进入viewport.js和anime.js的视区时,它将动画化: 当我多次使用.title类时,当另一个标题进入视区时,所有标题都将再次动画化。我是否用.title1,.title2等复制脚本?还是有更短的路?
本文向大家介绍使用python对文件中的单词进行提取的方法示例,包括了使用python对文件中的单词进行提取的方法示例的使用技巧和注意事项,需要的朋友参考一下 由于需要使用一个纯单词组成的文件,在网上下载到了一个存放单词的文件,但是里面有中文的解释,那就需要做一下提取了。 文本的形式如下: 所见即所得,这个文本是有规律的,每个单词为一行,紧接着下一行便是单词的解释,有了这种规律我们就很好处理了。