当前位置: 首页 > 知识库问答 >
问题:

以正确的编码将带有希腊字符的Excel文件导入R

唐宇定
2023-03-14

导入以下文件时遇到问题:http://www.kuleuven.be/bio/ento/temp/test.xlsx以正确的编码输入R。特别地,

library("xlsx")
read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")

给了我

                                             X1
1                                     a-cadinol
2                                  a-calacorene
3                       a-caryophyllene alcohol
4                                   a-curcumene
5                                      a-elemol
6                                   a-muurolene
7                           a-terpineol acetate
8  ß-4-dimethyl-3-cyclohexane-1-ethanol acetate
9                                  ß-bisabolene
10                                  ß-bisabolol
11                                 ß-bourbonene
12                      ß-caryophyllene alcohol
13                                ß-cyclocitral
14                                   ß-farnesol
15                                   ß-selinene
16                         ß-sesquiphellandrene
17                            <U+03B3>-cadinene
18  <U+03B3>-Carboethoxy-<U+03B3>-butyrolactone
19        <U+03B3>-ethyl-<U+03B3>-butyrolactone
20                            <U+03B3>-eudesmol
21                           <U+03B3>-muurolene
22                         <U+03B3>-nonalactone
23                         <U+03B3>-octalactone
24                            <U+03B3>-selinene
25                       <U+03B3>-undecalactone
26                                   d-cadinene
27                                    d-cadinol
28                                  d-muurolene
29                              d-undecalactone

但是a-

关于如何以正确的编码导入文件有什么想法吗?

我正在Windows上工作,iconvlist()给了我

  [1] "437"                     "850"                     "852"                     "855"                     "857"                    
  [6] "860"                     "861"                     "862"                     "863"                     "865"                    
 [11] "866"                     "869"                     "ANSI_X3.4-1968"          "ANSI_X3.4-1986"          "ASCII"                  
 [16] "ASMO-708"                "BIG-5"                   "BIG-FIVE"                "big5"                    "BIG5"                   
 [21] "big5-hkscs"              "BIG5-HKSCS"              "big5hkscs"               "BIG5HKSCS"               "CP-GR"                  
 [26] "CP-IS"                   "cp1025"                  "CP1125"                  "CP1133"                  "CP1200"                 
 [31] "CP12000"                 "CP12001"                 "CP1201"                  "CP1250"                  "CP1251"                 
 [36] "CP1252"                  "CP1253"                  "CP1254"                  "CP1255"                  "CP1256"                 
 [41] "CP1257"                  "CP1258"                  "CP1361"                  "CP154"                   "CP367"                  
 [46] "CP437"                   "CP50221"                 "CP51932"                 "CP65001"                 "CP737"                  
 [51] "CP775"                   "CP819"                   "CP850"                   "CP852"                   "CP853"                  
 [56] "CP855"                   "CP857"                   "CP858"                   "CP860"                   "CP861"                  
 [61] "CP862"                   "CP863"                   "CP864"                   "CP865"                   "cp866"                  
 [66] "CP866"                   "CP869"                   "CP874"                   "cp875"                   "CP932"                  
 [71] "CP936"                   "CP949"                   "CP950"                   "CSASCII"                 "CSIBM855"               
 [76] "CSIBM857"                "CSIBM860"                "CSIBM861"                "CSIBM863"                "CSIBM864"               
 [81] "CSIBM865"                "CSIBM866"                "CSIBM869"                "csISO2022JP"             "CSISOLATIN1"            
 [86] "CSPC775BALTIC"           "CSPC850MULTILINGUAL"     "CSPC862LATINHEBREW"      "CSPC8CODEPAGE437"        "CSPCP852"               
 [91] "CSPTCP154"               "CSWINDOWS31J"            "CYRILLIC-ASIAN"          "DOS-720"                 "DOS-862"                
 [96] "EUC-CN"                  "euc-jp"                  "euc-kr"                  "EUC-KR"                  "EUCCN"                  
[101] "eucjp"                   "euckr"                   "GB18030"                 "gb2312"                  "GBK"                    
[106] "hz-gb-2312"              "IBM-CP1133"              "IBM-Thai"                "IBM00858"                "IBM00924"               
[111] "IBM01047"                "IBM01140"                "IBM01141"                "IBM01142"                "IBM01143"               
[116] "IBM01144"                "IBM01145"                "IBM01146"                "IBM01147"                "IBM01148"               
[121] "IBM01149"                "IBM037"                  "IBM1026"                 "IBM273"                  "IBM277"                 
[126] "IBM278"                  "IBM280"                  "IBM284"                  "IBM285"                  "IBM290"                 
[131] "IBM297"                  "IBM367"                  "IBM420"                  "IBM423"                  "IBM424"                 
[136] "IBM437"                  "IBM437"                  "IBM500"                  "ibm737"                  "ibm775"                 
[141] "IBM775"                  "IBM819"                  "ibm850"                  "IBM850"                  "ibm852"                 
[146] "IBM852"                  "IBM855"                  "IBM855"                  "ibm857"                  "IBM857"                 
[151] "IBM860"                  "IBM860"                  "ibm861"                  "IBM861"                  "IBM862"                 
[156] "IBM863"                  "IBM863"                  "IBM864"                  "IBM864"                  "IBM865"                 
[161] "IBM865"                  "IBM866"                  "ibm869"                  "IBM869"                  "IBM870"                 
[166] "IBM871"                  "IBM880"                  "IBM905"                  "iso-2022-jp"             "iso-2022-jp"            
[171] "ISO-2022-JP"             "ISO-2022-JP-MS"          "iso-2022-kr"             "ISO-8859-1"              "iso-8859-13"            
[176] "iso-8859-15"             "iso-8859-2"              "iso-8859-3"              "iso-8859-4"              "iso-8859-5"             
[181] "iso-8859-6"              "iso-8859-7"              "iso-8859-8"              "iso-8859-8-i"            "iso-8859-9"             
[186] "ISO-IR-100"              "ISO-IR-6"                "ISO_646.IRV:1991"        "ISO_8859-1"              "ISO_8859-1:1987"        
[191] "ISO2022-JP"              "ISO2022-JP-MS"           "iso2022-kr"              "ISO646-US"               "iso8859-1"              
[196] "ISO8859-1"               "iso8859-13"              "iso8859-15"              "iso8859-2"               "iso8859-3"              
[201] "iso8859-4"               "iso8859-5"               "iso8859-6"               "iso8859-7"               "iso8859-8"              
[206] "iso8859-8-i"             "iso8859-9"               "Johab"                   "JOHAB"                   "koi8-r"                 
[211] "koi8-u"                  "ks_c_5601-1987"          "L1"                      "latin-9"                 "LATIN1"                 
[216] "latin2"                  "latin3"                  "latin4"                  "latin5"                  "latin7"                 
[221] "latin9"                  "mac"                     "mac-centraleurope"       "mac-is"                  "macarabic"              
[226] "maccentraleurope"        "maccroatian"             "maccyrillic"             "macgreek"                "machebrew"              
[231] "maciceland"              "macintosh"               "macis"                   "macroman"                "macromania"             
[236] "macthai"                 "macturkish"              "macukraine"              "macukrainian"            "MS-ANSI"                
[241] "MS-ARAB"                 "MS-CYRL"                 "MS-EE"                   "MS-GREEK"                "MS-HEBR"                
[246] "MS-TURK"                 "MS50221"                 "MS51932"                 "MS932"                   "MS936"                  
[251] "PT154"                   "PTCP154"                 "SHIFFT_JIS"              "SHIFFT_JIS-MS"           "shift-jis"              
[256] "shift_jis"               "SJIS"                    "SJIS-MS"                 "SJIS-OPEN"               "SJIS-WIN"               
[261] "UCS-2"                   "UCS-2BE"                 "UCS-2LE"                 "UCS-4"                   "UCS-4BE"                
[266] "UCS-4BE"                 "UCS-4LE"                 "UCS-4LE"                 "UCS2"                    "UCS2BE"                 
[271] "UCS2LE"                  "UCS4"                    "UCS4BE"                  "UCS4LE"                  "UHC"                    
[276] "unicodeFFFE"             "US"                      "US-ASCII"                "UTF-16"                  "UTF-16BE"               
[281] "UTF-16LE"                "UTF-32"                  "UTF-32BE"                "UTF-32LE"                "UTF-8"                  
[286] "UTF16"                   "UTF16BE"                 "UTF16LE"                 "UTF32"                   "UTF32BE"                
[291] "UTF32LE"                 "UTF8"                    "WINBALTRIM"              "windows-1250"            "windows-1251"           
[296] "windows-1252"            "windows-1253"            "windows-1254"            "windows-1255"            "windows-1256"           
[301] "windows-1257"            "windows-1258"            "WINDOWS-31J"             "WINDOWS-50221"           "WINDOWS-51932"          
[306] "windows-874"             "WINDOWS-932"             "WINDOWS-936"             "x-Chinese_CNS"           "x-cp20001"              
[311] "x-cp20003"               "x-cp20004"               "x-cp20005"               "x-cp20261"               "x-cp20269"              
[316] "x-cp20936"               "x-cp20949"               "x-cp50227"               "x-EBCDIC-KoreanExtended" "x-Europa"               
[321] "x-IA5"                   "x-IA5-German"            "x-IA5-Norwegian"         "x-IA5-Swedish"           "x-iscii-as"             
[326] "x-iscii-be"              "x-iscii-de"              "x-iscii-gu"              "x-iscii-ka"              "x-iscii-ma"             
[331] "x-iscii-or"              "x-iscii-pa"              "x-iscii-ta"              "x-iscii-te"              "x-mac-arabic"           
[336] "x-mac-ce"                "x-mac-chinesesimp"       "x-mac-chinesetrad"       "x-mac-croatian"          "x-mac-cyrillic"         
[341] "x-mac-greek"             "x-mac-hebrew"            "x-mac-icelandic"         "x-mac-japanese"          "x-mac-korean"           
[346] "x-mac-romanian"          "x-mac-thai"              "x-mac-turkish"           "x-mac-ukrainian"         "x_Chinese-Eten"   

我尝试了其中的许多,但无济于事...不幸的是,我也不知道Excel将我的文件保存在什么编码中...

此外,R中是否有任何简单的函数可以让我将所有希腊字母alpha、beta、gamma和delta(原始编码)转换为“alpha”、“beta”、“gamma”和“delta”(即完整写出)?或者做相反的事情,即将完整写出的“alpha”、“beta”、“gamma”等转换为单个希腊字符?

编辑:关于我的最后一个问题,我尝试了

togreek=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("alpha","\u03B1",n)
  n=gsub("beta","\u03B2",n)
  n=gsub("gamma","\u03B3",n)
  n=gsub("delta","\u03B4",n)
  n=gsub("epsilon","\u03B5",n)
  n
}

tolatin=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("\u03B1","alpha",n)
  n=gsub("\u03B2","beta",n)
  n=gsub("\u03B3","gamma",n)
  n=gsub("\u03B4","delta",n)
  n=gsub("\u03B5","epsilon",n)
  n
}

tolatin的功能似乎有效:

library("xlsx")
test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
tolatin(test$X1)
 [1] "alpha-cadinol"                                   "alpha-calacorene"                                "alpha-caryophyllene alcohol"                    
 [4] "alpha-curcumene"                                 "alpha-elemol"                                    "alpha-muurolene"                                
 [7] "alpha-terpineol acetate"                         "beta-4-dimethyl-3-cyclohexane-1-ethanol acetate" "beta-bisabolene"                                
[10] "beta-bisabolol"                                  "beta-bourbonene"                                 "beta-caryophyllene alcohol"                     
[13] "beta-cyclocitral"                                "beta-farnesol"                                   "beta-selinene"                                  
[16] "beta-sesquiphellandrene"                         "gamma-cadinene"                                  "gamma-Carboethoxy-gamma-butyrolactone"          
[19] "gamma-ethyl-gamma-butyrolactone"                 "gamma-eudesmol"                                  "gamma-muurolene"                                
[22] "gamma-nonalactone"                               "gamma-octalactone"                               "gamma-selinene"                                 
[25] "gamma-undecalactone"                             "delta-cadinene"                                  "delta-cadinol"                                  
[28] "delta-muurolene"                                 "delta-undecalactone"  

但如果我再转换回希腊字符,我会再次遇到问题:

togreek(tolatin(test$X1))

 [1] "α-cadinol"                                   "α-calacorene"                                "α-caryophyllene alcohol"                    
 [4] "α-curcumene"                                 "α-elemol"                                    "α-muurolene"                                
 [7] "α-terpineol acetate"                         "ß-4-dimethyl-3-cyclohexane-1-ethanol acetate" "ß-bisabolene"                                
[10] "ß-bisabolol"                                  "ß-bourbonene"                                 "ß-caryophyllene alcohol"                     
[13] "ß-cyclocitral"                                "ß-farnesol"                                   "ß-selinene"                                  
[16] "ß-sesquiphellandrene"                         "<U+03B3>-cadinene"                            "<U+03B3>-Carboethoxy-<U+03B3>-butyrolactone" 
[19] "<U+03B3>-ethyl-<U+03B3>-butyrolactone"        "<U+03B3>-eudesmol"                            "<U+03B3>-muurolene"                          
[22] "<U+03B3>-nonalactone"                         "<U+03B3>-octalactone"                         "<U+03B3>-selinene"                           
[25] "<U+03B3>-undecalactone"                       "d-cadinene"                                   "d-cadinol"                                   
[28] "d-muurolene"                                  "d-undecalactone"  

你知道我做错了什么吗?


共有2个答案

卓俊晖
2023-03-14
匿名用户

试试这个:

库(字符串i)

?stringi

应用此项查看您的文件编码

stri_enc_detect("path-to-your-file/your-file.csv", filter_angle_brackets = T)

stri_enc_detect2("path-to-your-file/your-file.csv", locale = NULL)

第一个对我有用。

然后将结果应用于读取。csv(),如下所示:

    df <- read.csv("path-to-your-file/your-file.csv",header = TRUE, sep = ";", 
quote = "\"", na.strings = "", dec = ".", skip = 2, check.names = T, 
fileEncoding = "YOUR RESULT OF stri_enc_detect", encoding = "UTF-8")

侯善
2023-03-14

试试这个:
Sys.setlocale(类别="LC_ALL",区域设置="希腊语")

 类似资料:
  • 我在R中读取包含希腊字母的csv文件时遇到问题。我试过: 这个网站(如何检测read.csv的正确编码?)建议我尝试不同的文件编码,但似乎都不管用。R显示希腊字母,如下所示: 然而,在Excel中,它就像: 有谁知道我做错了什么,或者我如何才能正确地阅读希腊字母?非常感谢。文件可以在此处下载:https://drive.google.com/file/d/1K44FTvUFUWm5l-xwz58S

  • 我正在尝试使用iText 7为Java创建一个带有希腊字符的pdf。PDF中只有拉丁字符和数字可见。 我正在使用以下代码加载字体: 我该怎么办?

  • 我正在创建一些下载链接。我的问题是,如果“my_file_name.doc”文件是用英文字符保存的,那么就是下载。如果我用希腊字符保存它,那么我不能下载...我使用的是utf-8编码。 和我的download.php文件:

  • 我有一个导出CSV文件的PHP脚本。我的用户然后在Excel中编辑文件,保存它,并重新上传它。 如果他们在字段中键入欧元符号,则在上载文件时,欧元符号以及之后的所有内容都将丢失。我正在使用str_getcsv函数。 如果我尝试转换编码(比如UTF-8),欧元符号就会消失,我会得到一个缺少的字符标记(通常由一个空白的正方形或菱形中的问号表示)。 如何将编码转换为UTF-8,同时保留欧元符号(和其他非

  • 我正试着导入一个。使用Eclipse Neon 4.6.0将包含希伯来语和英语字符的dat文本文件写入java程序: 出于某种原因,希伯来文字符正被随机胡言乱语所取代: 原版:<码>码>根方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方

  • 问题内容: 我一直在尝试对导入的某些数据使用mongo,但是我无法在我的文档描述中正确使用它。 这是我使用mongoimport导入的.json的示例:https ://gist.github.com/2917854 我注意到,尽管为每个商店都创建了一个对象,但所有文档仍被导入到一个唯一的对象中。 这就是为什么当我尝试寻找商店或任何我想查询的东西时,所有文档都被退回的原因。 我希望能够查询数据库以