使用JRex来获取经过浏览器渲染的HTML

周鸿光

2023-12-01

http://www.benjysbrain.com/misc/Render/index.html

我最近做一个网页数据挖掘的项目，这个项目在网络上寻找特定类型的图片。我过去使用HTML Parser来解析HTML并寻找IMG和OBJECT标签(tag)。但有时候，我发现使用原始的HTML并不合适。特别是img的src属性是用JavaScript运行时产生的：
    <IMG SRC="" name="image" id="image">
    <SCRIPT language="JavaScript" type="text/JavaScript">
      
    </script>
   我想找一款开源的工具包能够让我访问执行过JavaScript的渲染过的HTML。我没有找到合适的工具因此考虑在我自己的应用中嵌入Mozilla Gecko引擎。但是我后来放弃了这个方法因为为嵌入一个浏览器来创建一个开发环境并学习太费时间了。
   后来我发现了C. N. Medapp的JRex，Gecko的一个java封装。使用JRex，我写了一个java程序来访问页面并等待JavaScript被执行。然后遍历经过浏览器渲染和重构的HTML的DOM。在上面的例子，src属性的值是abc.jpg。

   上手

   我需要一个简单的例子来学习使用JRex。很幸运的找到了Dietrich Kappe的How to REALLY do Page Preview in Java with Embedded HTML Rendering
(http://blogs.pathf.com/agileajax/2007/01/how_to_really_d.html)
   这个例子让我知道怎么在java里加载Gecko。这之后，我使用JRex和org.w3c.dom的API，在加上一些实验，开发了一个抽取IMG和OBJECT标签的程序。
   我follow了Kappe的建议——使用JRE而不是JDK来运行JRex程序。当我使用JDK是，有些DLL找不到。我相信有办法来解决这个问题，但是用JRE来运行程序也没什么问题。
   JRex Mailing List(http://jrex.mozdev.org/list.html)是关于JRex的好地方。

一个例子
我创建了一个java类Render。pageParse()方法使用JRex来打开一个页眉并且等待它被加载。然后它调用一个递归的方法doTree()来遍历DOM。对于每一个标签，doElement()方法被调用，然后这个标签的子结点都递归的调研这个方法。当一个表情处理完了，doTagEnd方法被调用。

在Render里，doElement()打印标签，忽略所有的属性，比如<IMG SRC="xyz.gif"> 导致<IMG>被打印。doTagEnd()方法只是打印</IMG>。

   考虑这个例子：
<html>
<head>
<TITLE>Simple Page</TITLE>
</head>
   <body>
    <table>
      <tr><td></td><td></td></tr>
    </table>
   </body>
</html>

Render程序的输出是：
<HTML>
<HEAD>
<TITLE>
</TITLE>
</HEAD>
<BODY>
<TABLE>
<TBODY>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
</TBODY>
</TABLE>
</BODY>
</HTML>

注意当DOM被构建，遗漏的<TBODY>被补上了。
*********************说明***********************
如果没有用过JRex，请看看http://www.pathf.com/blogs/2007/01/how_to_really_d/，对这篇文章也翻译了一部分，请参考我的博客文章。
************************************************
Render的源代码：
package com.benjysbrain.htmlgrab ;

/**
   Render - This object is a wrapper for JRex, the Java library that
   allows a Java application to embed the Mozilla Gecko browser. It
   uses JRex to load a page and then act on the DOM that Gecko
   constructs. The intent of Render is to access the DOM after a page
   is loaded and JavaScript has been applied for web data extraction
   projects.
   <p>
   Subclass this object and override the
   <i>doElement(org.w3c.dom.Element element)</i> and
   <i>doTagEnd(org.w3c.dom.Element element)</i> methods to do some real
   work. In the base class, doElement() prints the tag name and
   doTagEnd() prints a closing version of the tag.
   <p>
   Thanks to Dietrich Kappe for his JRex
   <A HREF="http://blogs.pathf.com/agileajax/2007/01/how_to_really_d.html">
   article.</a> See my <A HREF="http://www.benjysbrain.com/misc/Render">
   article</a> for more details. Thanks to Jason Baumgartner for the
   tip on how to disable JRex logging of debug information.
   <p>
   Copyright (c) 2007 by Ben E. Cline. This code is presented as a teaching
   aid. No warranty is expressed or implied.
   <p>
   http://www.benjysbrain.com/
   @author Benjy Cline
*/

import org.mozilla.jrex.* ;
import org.mozilla.jrex.ui.* ;
import org.mozilla.jrex.window.* ;
import org.mozilla.jrex.navigation.* ;
import org.mozilla.jrex.event.progress.* ;
import org.w3c.dom.* ;
import java.lang.Exception.* ;
import javax.swing.*;
import java.net.*;

public class Render
implements org.mozilla.jrex.event.progress.ProgressListener {

String url ; // The page to be processed.

   // These variables can be used in subclasses and are created from
   // url. baseURL can be used to construct the absolute URL of the
   // relative URL's in the page. hostBase is just the http://host.com/
   // part of the URL and can be used to construct the full URL of
   // URLs in the page that are site relative, e.g., "/xyzzy.jpg".
   // Variable host is set to the host part of url, e.g., host.com.

   String baseURL ;
   String hostBase ;
   String host ;

// The JRexCanvas is the main browser component. The WebNavigator
// is used to access the DOM.

JRexCanvas canvas = null ;
WebNavigation navigation = null ;

// An event handler sets "done" to true when the document is loaded.

boolean done = false ;

   /**
      Create a Render object with a target URL.
   */

   public Render(String URL) {
      url = URL ;
   }

/** Load the given URL in Gecko. When the page is loaded,
recurse on the DOM and call doElement()/doTagEnd() for
each Element node. Execution can hang if the page causes a
window to be popped up. Return false on error.
*/

public boolean parsePage() {

// Parse the URL and build baseURL and hostURL for use by doElement()
// and doTagEnd().

      URI uri = null ;
      try {
uri = new URI(url) ;
      }
      catch(Exception e) {
System.out.println(e) ;
return false ;
      }

      String path = uri.getPath() ;
      baseURL = "http://" + uri.getHost() + path + "/" ;
      hostBase = "http://" + uri.getHost() ;
      host = uri.getHost() ;

// Start up JRex/Gecko.

      try {
JRexFactory.getInstance().startEngine();
      }
      catch (Exception e) {
System.err.println("Unable to start up JRex Engine.");
e.printStackTrace();
return false ;
      }

// Get a window manager and put the browser in a Swing frame.
// Based on Dietrich Kappe's code.

      JRexWindowManager winManager=(JRexWindowManager)
JRexFactory.getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER);
      winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE);
      JPanel panel = new JPanel();
      JFrame frame = new JFrame();
      frame.getContentPane().add(panel);
      winManager.init(panel);

      // Get the JRexCanvas, set Render to handle progress events so
      // we can determine when the page is loaded, and get the
      // WebNavigator object.

      canvas = (JRexCanvas) winManager.getBrowserForParent(panel);
      canvas.addProgressListener(this) ;
      navigation = canvas.getNavigator() ;

// Load and process the page.

try {
navigation.loadURI(url, WebNavigationConstants.LOAD_FLAGS_NONE,
null, null, null);

// Swing magic.

frame.setSize(640, 480);
frame.setVisible(false);

// Check if the DOM has loaded every two seconds.

while(!done) {
Thread.sleep(2000) ;
}

// Get the DOM and recurse on its nodes.

Document doc = navigation.getDocument() ;
Element ex = doc.getDocumentElement() ;
doTree((Node) ex) ;
      }
      catch(Exception e) {
System.out.println("Trouble walking DOM: " + e) ;
return false ;
      }

return true ;
}

   /**
      Recurse the DOM starting with Node node. For each Node of
      type Element, call doElement() with it and recurse over its
      children. The Elements refer to the HTML tags, and the children
      are tags contained inside the parent tag.
   */

public void doTree(Node node) {
if(node instanceof Element) {
Element element = (Element) node ;

// Visit tag.

doElement(element) ;

// Visit all the children, i.e., tags contained in this tag.

NodeList nl = element.getChildNodes() ;
if(nl == null) return ;
int num = nl.getLength() ;
for(int i=0; i<num; i++)
doTree(nl.item(i)) ;

// Process the end of this tag.

doTagEnd(element) ;
}
}

   /**
      Simple doElement() to print the tag name of the Element. Override
      to do something real.
   */

   public void doElement(Element element) {
      System.out.println("<" + element.getTagName() + ">") ;
   }

   /**
      Simple doTagEnd() to print the closing tag of the Element.
      Override to do something real.
   */

   public void doTagEnd(Element element) {
      System.out.println("</" + element.getTagName() + ">") ;
   }

   // org.mozilla.jrex.event.progress.ProgressListener methods.
   // onStateChange() seems the best place to watch for the
   // completion of the loading of the DOM.

   /** Noop */
   public void onLinkStatusChange(ProgressEvent event) {   }
   /** Noop */
   public void onLocationChange(ProgressEvent event) {   }
   /** Noop */
   public void onProgressChange(ProgressEvent event) {   }
   /** Noop */
   public void onSecurityChange(ProgressEvent event) {   }

   /** onStateChange is invoked several times when DOM loading is
       complete. Set the done flag the first time.
   */

   public void onStateChange(ProgressEvent event) {
      if(!event.isLoadingDocument()) {
if(done) return ;
done = true ;
      }
   }

/** Noop */
public void onStatusChange(ProgressEvent event) { }

   /**
      Main: java com.benjysbrain.htmlgrab.Render [url]. Run
      JRex on the given page, wait for the page to load, and
      traverse the DOM, printing tag names only.
   */

    public static void main(String[] args) {
       String url ="http://www.cnn.com" ;
       if(args.length == 1) url = args[0] ;
       Render p = new Render(url) ;
       p.parsePage() ;
       System.exit(0) ;
    }
}

为了在windows运行main函数，需要把JRex.jar放到classpath下，并设置另个-D
-Djrex.dom.enable=true
-Djrex.gre.path=%JREX_GRE_PATH% （说明，比如我的是-Djrex.gre.path=C:/jrex/jrex_gre）
这里%JREX_GRE_PATH%变量指向JRex GRE。如果你想访问cnn之外的页面，可以设置第一个参数为url

实际应用中，你可以重写doElement和doT方法来提取信息。为了取得标签的属性，首先运行boolean方法Element对象的hasAttributes()方法。如果这个标签有属性，则返回true。你可以使用getAttributes方法来获得一个NamedNodeMap对象，然后访问标签的属性。被NamedNodeMap引用的对象包括属性/值的对。getNodeName得到名字，getNodeValue得到属性值。

不足和问题
Render是使用JRex的一个简单例子，但不是全部。我在挖掘网页时使用Render的一个子类，它工作的很好，但是我测试的例子都是很正常的网页。
我使用一个事件监听器来判断页面是否加载完毕。Render的parsePage方法每过两秒就检测一下doneflag。如果页眉不能加载，就会死循环。
还有当它加载嵌入的浏览器时，浏览器窗口会显示出来，直到加载成功。我没有考虑这个问题因为在我的挖掘任务中不需要浏览器窗口。

使用JRex来获取经过浏览器渲染的HTML

相关阅读

相关文章

相关问答

相关文档