检测编码并制作一切UTF-8

卢勇

2023-03-14

问题内容：

我正在从各种RSS提要中读取很多文本，并将其插入到数据库中。

当然，提要中使用了几种不同的字符编码，例如UTF-8和ISO 8859-1。

不幸的是，文本的编码有时会出现问题。例：

“Fußball”中的“ß”在我的数据库中应如下所示：“ÂŸ”。如果它是“Â”，则正确显示。
有时，“Fußball”中的“ß”在我的数据库中看起来像这样：“ÃƒÂŸ”。然后，当然会显示错误。
在其他情况下，“ß”另存为“ß”-因此无需进行任何更改。然后它也会显示错误。

我该如何避免情况2和情况3？

如何使所有内容都使用相同的编码，最好是UTF-8？什么时候必须使用utf8_encode()，什么时候必须使用utf8_decode()（很清楚效果是什么，但是什么时候必须使用这些函数？），什么时候我什么都不要做？

如何使所有内容都具有相同的编码？也许具有功能mb_detect_encoding()？我可以为此编写函数吗？所以我的问题是：

如何找出文字使用的编码方式？
我如何将其转换为UTF-8-不管旧的编码是什么？

这样的功能会起作用吗？

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

我已经测试过了，但是没有用。它出什么问题了？

问题答案：

如果应用utf8_encode()到已经存在的UTF-8字符串，它将返回乱码的UTF-8输出。

我做了一个解决所有这些问题的功能。叫做Encoding::toUTF8()。

您不需要知道字符串的编码是什么。它可以是Latin1（ISO8859-1），Windows-1252或UTF-8，或者字符串可以混合使用。Encoding::toUTF8()会将所有内容转换为UTF-8。

我之所以这样做，是因为某项服务使我的数据馈送全乱了，在同一字符串中混合了UTF-8和Latin1。

用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

下载：

https://github.com/neitanod/forceutf8

我提供了另一个函数，Encoding::fixUFT8()它将修复每个看起来乱码的UTF-8字符串。

用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例子：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

我已将函数（forceUTF8）转换为称为的类上的一系列静态函数Encoding。新功能是Encoding::toUTF8()。

检测编码并制作一切UTF-8

相关阅读

相关文章

相关问答

相关工具

相关文档