问题：

在PDF中搜索带下划线和粗体的文本

隗俊誉

2023-03-14

使用iTextSharp，我如何确定解析的文本块是否同时加粗和下划线？

详细信息:
我正在尝试用C#解析。pdf文件，特别是针对既加粗又加下划线的文本。使用ITextSharp，我可以从LocationTextExtractionStrategy派生，并从传递给overridden.RenderText方法的ITextSharp.text.pdf.parser.TextRenderInfo对象获取文本、位置、字体等。
但是，从TextRenderInfo对象确定文本是否加粗和/下划线并不直接。

我试图使用textRenderInfo.getfont（）查找字体属性，但没有成功
通过访问TextRenderInfo对象上的private Graphics State字段并检查它的.font.PostScriptFontName属性中的单词“Bold”（很丑，但似乎有效。），我目前可以确定文本是否为粗体。
最大的问题：我没有找到任何东西来确定文本是否带下划线。如何确定？

以下是我目前的尝试：

        private FieldInfo _gsField = typeof(TextRenderInfo).GetField("gs",
        BindingFlags.GetField | BindingFlags.NonPublic | BindingFlags.Instance);

        //Automatically called for each chunk of text in the PDF
        public override void RenderText(TextRenderInfo renderInfo)
        {
            base.RenderText(renderInfo);
            //UNDONE:Need to determine if text is underlined.  How?

            //NOTE: renderInfo.GetFont().FontWeight does not contain any actual information
            var gs = (GraphicsState)_gsField.GetValue(renderInfo);
            var textChunkInfo = new TextChunkInfo(renderInfo);
            _allLocations.Add(textChunkInfo);
            if (gs.Font.PostscriptFontName.Contains("Bold"))
                //Add this to our found collection
                FoundItems.Add(new TextChunkInfo(renderInfo));

            if (!_lineHeights.Contains(textChunkInfo.LineHeight))
                _lineHeights.Add(textChunkInfo.LineHeight);
        }

GitHub Repository中当前尝试的完整源代码（两个示例（example.pdf和example2.pdf)包含了与我将搜索的文本类似的文本）

端木澄邈

2023-03-14

我试图使用textRenderInfo.getFont（）查找字体属性，但没有成功

我目前可以通过访问TextRenderInfo对象上的private Graphics State字段并检查它的.font.PostScriptFontName属性中的单词“Bold”（很丑，但似乎有效）来确定文本是否为粗体

我不太明白这种区分。textRenderInfo.getfont()与textRenderInfo的私有图形状态字段的font属性完全相同。

话虽如此，但这确实是决定胆量的主要方法之一。

在PDFs中粗体书写可以通过以下方式实现：

显式粗体字体（这是更好的方式）；在这种情况下，可以通过以下方法来确定字体是否为粗体

不仅填充字形轮廓，而且沿着它画了一条更粗的线，以形成大胆的印象，

画字形两次，第二次略有移位，也是为了给人一种大胆的印象。

在PDFs中带下划线的书写通常是通过在文本下显式地画一条线或一个非常细的矩形来实现的。您可以通过实现IExtRenderListener来尝试检测这样的行，用它解析有问题的页面以确定行位置，然后在文本提取过程中与文本位置匹配。两者都可以在一次传递中完成，但请注意，下划线不需要在文本之前绘制，甚至在文本之后不久，pdf制作者可能会首先绘制所有文本，然后才绘制所有下划线。此外，我还遇到了一个有趣的结构，非常短（例如1pt)非常宽（例如50pt)的垂直线实际上被视为水平线...

/**
 * Called when the current path is being modified. E.g. new segment is being added,
 * new subpath is being started etc.
 *
 * @param renderInfo Contains information about the path segment being added to the current path.
 */
void ModifyPath(PathConstructionRenderInfo renderInfo);

定义路径所包含的线条和曲线，然后最多调用一次clippath

/**
 * Called when the current path should be set as a new clipping path.
 *
 * @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
 */
void ClipPath(int rule);

（当且仅当该路径将作为以下绘制操作的剪辑路径），最后正好调用一个renderpath

/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo);

定义如何绘制路径（填充其内部和抚摸路径本身的任何组合）。

在PDF中搜索带下划线和粗体的文本

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档