当前位置: 首页 > 工具软件 > fe-news > 使用案例 >

去除文件头部的u+feff_关于FEFF的简短故事,一个不可见的UTF-8字符破坏了我们的CSV文件

司空祯
2023-12-01

去除文件头部的u+feff

Today, we encountered an error while trying to create some database seeds from a CSV. This CSV was originally generated by me using a Ruby script which piped the output to a file and saved as a CSV.

今天,我们在尝试从CSV创建一些数据库种子时遇到错误。 该CSV最初是由我使用Ruby脚本生成的,该脚本将输出通过管道传输到文件并另存为CSV。

The CSV was checked in to Git and had been used for awhile until we had to update some parts of it by adding a new column and fixing some values.

CSV已签入Git,并使用了一段时间,直到我们不得不通过添加新列并修复一些值来更新其中的某些部分。

While we don’t know the exact reason yet, my theory is that somehow, Excel for Mac (we are all using Macs) added some additional metadata to it even after saving the file as a CSV.

尽管我们尚不知道确切原因,但我的理论是,即使将文件另存为CSV,Excel for Mac(我们都在使用Mac)也向其中添加了一些其他元数据。

This in turn made anyone using the seed receive the following error:

反过来,这使使用种子的任何人都收到以下错误:

CSV::MalformedCSVError: Illegal quoting in line 1.

I opened the CSV file and nothing looked suspicious. My first thought was some left/right quotation marks were somehow mixed into the file instead of just the ‘normal’ double quotes: ". But upon further investigation, there was nothing out of the ordinary. This led me to just wipe out the whole file, and actually type out the first row again.

我打开了CSV文件,但没有任何可疑的地方。 我首先想到的是,文件中混入了一些左/右引号,而不仅仅是“正常”双引号: " 。但是,经过进一步的调查,发现并没有什么不寻常的地方。这导致我只消了整个内容。文件,然后再次键入第一行。

I saved that file again and ran the migration:

我再次保存该文件并运行迁移:

CSV::MalformedCSVError: Illegal quoting in line 1.

What?!

什么?!

Okay, this was driving me nuts. I opened up a new file, typed the exact single line again, and ran the migration. It worked. So what was in that file?!

好吧,这真让我发疯。 我打开了一个新文件,再次键入了确切的单行,然后运行了迁移。 有效。 那那个文件里有什么?

Only one way to find out:

只有一种方法可以找出:

cat companies.csv | pbcopy | pbpaste > temp.csv
rm companies.csv
mv temp.csv companies.csv
git diff

So OSX has these two functions that are very useful: pbcopy and pbpaste. Basically anything piped to pbcopy gets into your clipboard and pbpaste puts what you have on your clipboard to standard output (stdout). But it removes all formatting.

因此OSX具有这两个非常有用的功能: pbcopypbpaste 。 基本上,通过管道传输到pbcopy都会进入剪贴板,而pbpaste会将剪贴板上的pbpaste放入标准输出(stdout)。 但是它将删除所有格式。

Very useful when you want to just copy some text from somewhere and you want to paste it into a WYSIWYG editor without all the formatting. Like when writing an email from Gmail, for example.

当您只想从某处复制一些文本并将其粘贴到WYSIWYG编辑器而不使用所有格式时,此功能非常有用。 例如,从Gmail编写电子邮件时。

I then removed the original file and saved the new ‘unformatted’ file with the same file name so I could see the difference.

然后,我删除了原始文件,并使用相同的文件名保存了新的“未格式化”文件,这样我就可以看到区别。

And we finally saw the invisible man:

最后我们看到了那个看不见的人:

A quick Google search told us that our friend U+FEFF was called a ZERO WIDTH NO-BREAK SPACE. Also, a quick trip to Wikipedia told us about the actual uses for U+FEFF, more commonly known as Byte order mark or BOM.

快速的Google搜索告诉我们,我们的朋友U+FEFF被称为ZERO WIDTH NO-BREAK SPACE 。 另外, 快速访问Wikipedia告诉了我们U+FEFF的实际用法,通常被称为Byte order markBOM

Our friend FEFF means different things, but it’s basically a signal for a program on how to read the text. It can be UTF-8 (more common), UTF-16, or even UTF-32.

我们的朋友FEFF意味着不同的事情,但这基本上是一个程序如何阅读文本的信号。 它可以是UTF-8 (更常见), UTF-16甚至UTF-32

FEFF itself is for UTF-16 — in UTF-8 it is more commonly known as 0xEF,0xBB, or 0xBF.

FEFF本身是针对UTF-16 -在UTF-8它通常被称为0xEF,0xBB, or 0xBF

From my understanding, when the CSV file was opened in Excel and saved, Excel created a space for our invisible stowaway, U+FEFF. And in front of the file to boot!

据我了解,当在Excel中打开并保存CSV文件时,Excel为我们的隐形U+FEFF创建了一个空间。 并在文件前面启动!

Excel did some magic, and it was probably saved in UTF-16 instead of UTF-8. UTF-8 does not understand BOM and just treats it as a non-character so visually, the file was okay. But Ruby’s CSV thought that there was something wrong because it assumed the file it was reading was UTF-8 and it couldn’t ignore Mr. U+FEFF.

Excel做了一些魔术,它可能保存在UTF-16而不是UTF-8UTF-8不了解BOM而只是将其视为非字符,因此从视觉上看,该文件还可以。 但是Ruby的CSV认为出了点问题,因为它假定正在读取的文件是UTF-8 ,并且不能忽略U+FEFF先生。

So lesson learned: don’t open (and save!) a CSV file in Excel if you want to feed it to Ruby’s CSV parser.

因此,我们汲取了教训:如果您想将其馈送到Ruby的CSV解析器中,请不要在Excel中打开(并保存!)CSV文件。

If you do ever encounter an error like that, be sure to look for hidden characters not shown by your editor. If you still can’t see it and are using OSX, then pbcopy and pbpaste will help you out — they strip out any formatting or hidden characters from text in addition to copying and pasting it.

如果您确实遇到过这样的错误,请确保查找编辑器未显示的隐藏字符。 如果您仍然看不到它并使用OSX,则pbcopypbpaste将为您提供帮助-除了复制和粘贴外,它们还会从文本中删除所有格式或隐藏字符。

翻译自: https://www.freecodecamp.org/news/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7/

去除文件头部的u+feff

 类似资料: