php转html输出txt纯文档,使用PHP将HTML上标转换为纯文本(HTML Superscript to Plain Text with PHP)...

段干俊茂

2023-12-01

使用PHP将HTML上标转换为纯文本(HTML Superscript to Plain Text with PHP)

我正在通过条带化和替换所有HTML标签将HTML文档转换为纯文本，并成功地完成了这项工作。但我遇到了这种情况，我需要处理上标。我有这个HTML代码：

11,500 平方米

(假设有上面显示的sup标签，但我不知道如何在这里显示它们)我需要将它转换为纯文本，以便它只会变成11,500平方米。我怎么能这样做？先谢谢你。

I am working on converting HTML document to plain text by striping and replacing all the HTML tags and succeed in doing so. But i have come across this situation where i need to handle the superscript. I have this HTML code :

11,500m²

(suppose there are sup tag shown above, but i don't know how to show them here) I need to convert it to plain text so that it will become just 11,500m². How can i do so? Thank you in advance.

原文：https://stackoverflow.com/questions/45914751

更新时间：2019-11-05 05:09

最满意答案

由于ASCII中只有少数上标数字。

// replace all ... things to a power of 1

str_replace("¹", "¹", $html)

// replace all squares

str_replace("²", "²", $html)

// replace all cubes

str_replace("³", "³", $html)

// for everything else use ^ notation

str_replace("^{", "^", $html)}

// remove leftover closing sup tags

str_replace("", "", $html)

由于纯文本无法使用大多数字符，因此此解决方案将：

找到如下文字：Some Text Other

并输出：一些文本^其他

As there are only a few superscript numbers in ASCII.

// replace all ... things to a power of 1

str_replace("¹", "¹", $html)

// replace all squares

str_replace("²", "²", $html)

// replace all cubes

str_replace("³", "³", $html)

// for everything else use ^ notation

str_replace("^{", "^", $html)}

// remove leftover closing sup tags

str_replace("", "", $html)

As there is no way in plain text to have most characters this solution will:

Find text like: Some TextOther

And output: Some Text^Other

2017-08-28

相关问答

我建议使用HTML到Markdown转换器。 https://github.com/Pixel418/Markdownify https://code.google.com/p/pandoc/source/browse/trunk/html2markdown?r=1651 I'd suggest using a HTML to Markdown converter. https://github.com/Pixel418/Markdownify https://code.google.com/p/

...

由于ASCII中只有少数上标数字。 // replace all ... things to a power of 1

str_replace("¹", "¹", $html)

// replace all squares

str_replace("²", "²", $html)

// replace all cubes

str_replace("³", "³", $html)

// for everything else use

...

这是另一个解决方案，这将捕获所有http / https / www并转换为可点击的链接。 $url = '~(?:(https?)://([^\s

$string = preg_replace($url, '$0', $string);

echo $string;

或者只是捕捉http / https，然后使用下面的

...

使用html2text (示例HTML到文本 )，根据Eclipse公共许可证许可。它使用PHP的DOM方法从HTML加载，然后迭代生成的DOM以提取纯文本。用法： $text = convert_html_to_text($html);

虽然不完整，它是开源的，贡献是值得欢迎的。其他转换脚本的问题：由于html2text (GPL)不兼容EPL。 lkessler的链接 (归因)与大多数开源许可证不兼容。 Use html2text (example HTML to text), l

...

它工作正常，但你帖子中链接的正则表达式不起作用。它没有返回正确的字符集，所以试试这个： function strip_html_tags( $text )

{

$text = preg_replace(

array(

// Remove invisible content

]*?>.*?@siu',

'@@siu',

...

你应该检查出nl2br ，它将换行符(它不会在HTML文档中显示 - 就像你注意到的那样)转换为HTML标签 You should check out nl2br, which converts the newlines (which won't be visible on a HTML document - as you noticed) to the HTML-tag

您将html内容设置为纯文本内容： ...

$message->setTXTBody($text);

...

你需要setHTMLBody() (或两者......)： ...

$message->setHTMLBody($text);

...

You are setting your html content as the plain text content: ...

$message->setTXTBody($text);

...

You need setHTMLBo

...

不 - 不是来自'网站'...... HTML会因站点而异，而且过于复杂而无法从一个来源过滤 - 添加多个来源并且任务是不可能的。这是坏消息。好消息是有一种方法：大多数“新闻”网站在RSS提要中提供其内容或部分内容。对RSS2和ATOM协议进行一些研究，你的答案就在那里...... 从这里开始： http ： //www.whatisrss.com Nope - not from a 'website'... html will vary from site to site and too

...

试试Python。使用BeautifulSoup解析HTML。 textwrap模块允许您格式化文本。但是，缺少两个功能。为了证明文本的合理性，您需要为每一行添加空格，但这不应该是一个大问题(请参阅此代码示例 )。对于连字符，请尝试此项目。 Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text. There are two feature

...

我最近遇到了继续这样做的理由，垃圾邮件过滤器对仅限HTML的电子邮件给予了极大的重视。 I recently ran into a reason to continue doing this, spam filters give some serious weight to emails that are HTML only.

php转html输出txt纯文档,使用PHP将HTML上标转换为纯文本(HTML Superscript to Plain Text with PHP)...

相关阅读

相关文章

相关问答

相关文档