json解码_从JSON解码HTML

宇文金鑫
2023-12-01

json解码

Often while web-scraping you will come across HTML values in the text that needs to be recoded into their character forms. While jQuery and various other languages have implementations for decoding these values, native JavaScript does not.

通常,在进行网络抓取时,您会在文本中遇到需要重新编码为字符格式HTML值。 尽管jQuery和其他各种语言都有用于解码这些值的实现,但本机JavaScript却没有。

To solve this I started by parsing out the HTML codes using a lookup table going word by word. This was a safe method that wouldn’t require accessing the DOM of the webpage. While the method worked it came at the cost of having a larger file size due to having to pack a lookup table into the file. Another negative of this solution was the time it took to parse hundreds of strings of varying complexity.

为了解决这个问题,我首先使用逐字逐句查找表来解析HTML代码。 这是一种安全的方法,不需要访问网页的DOM。 尽管该方法有效,但由于必须将查找表打包到文件中,因此以文件较大为代价。 该解决方案的另一个不利因素是解析数百个复杂程度不同的字符串所花费的时间。

Looking for other implementations of this idea, I found a solution from Rob W on Stack Overflow.

寻找这种想法的其他实现,我从Rob W的Stack Overflow找到了一个解决方案。

function decodeHtml(html) {
var txt = document.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}

The code above doesn’t remove HTML tags like other approches and is performant. The biggest issue is it has direct access to the DOM. To mitigate this I would recommend the following code I created by adapting Rob’s solution.

上面的代码不会像其他方法一样删除HTML标记,并且性能良好。 最大的问题是它可以直接访问DOM。 为了减轻这种情况,我建议采用Rob解决方案创建的以下代码。

function decodeHtml(html) {
var htmlDoc = document.implementation.createHTMLDocument("");
var txt = htmlDoc.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}

This will prevent scripts from running and creates a separation of your application and it’s parsing capabilities. It maintains everything great about the first solution and is safe to use in production.

这将阻止脚本运行,并创建应用程序及其解析功能的分离。 它保留了第一个解决方案的所有优点,并且可以在生产中安全使用。

翻译自: https://medium.com/swlh/decoding-html-from-text-174b5edcac28

json解码

 类似资料: