Arachnid Web Spider Framework2

樊胜

2023-12-01

PageInfo.java是一个页面对象。它抽象了页面的主要元素，并且封装了取得那些元素的方法。

public URL getUrl() { return(url); }\\取得页面的url

public URL getParentUrl() { return(parentUrl); }\\

public String getTitle() { return(title); }

public URL[] getLinks() { return(links); }
public URL[] getImages() { return(images); }
public String getContentType() { return(contentType); }
public boolean isValid() { return(valid); }
public int getResponseCode() { return responseCode; }

中间有个最核心的方法就是

public void extract(Reader reader) throws IOException
{
  // Note: contentLength of -1 means UNKNOWN
  if (reader == null || url == null ||
   responseCode != HttpURLConnection.HTTP_OK ||
   contentLength == 0 || contentType.equalsIgnoreCase(HTML) == false) {
   valid = false;
   return;
  }
  WebPageXtractor x = new WebPageXtractor();
  try { x.parse(reader); }
  catch(EOFException e) {
   valid = false;
   return;
  }
  catch(SocketTimeoutException e) {
   valid = false;
   throw(e);
  }
  catch(IOException e) {
   valid = false;
   return;
  }
  ArrayList rawlinks = x.getLinks();
  ArrayList rawimages = x.getImages();

  // Get web page title (1st title if more than one!)
  ArrayList rawtitle = x.getTitle();
  if (rawtitle.isEmpty()) title = null;
  else title = new String((String)rawtitle.get(0));

  // Get links
  int numelem = rawlinks.size();
  if (numelem == 0) links = null;
  else {
   ArrayList t = new ArrayList();
   for (int i=0; i     String slink = (String)rawlinks.get(i);
    try {
     URL link = new URL(url,slink);
     t.add(link);
    }
    catch(MalformedURLException e) { /* Ignore */ }
   }
   if (t.isEmpty()) links = null;
   else links = (URL[])t.toArray(dummy);
  }

  // Get images
  numelem = rawimages.size();
  if (numelem == 0) images = null;
  else {
   ArrayList t = new ArrayList();
   for (int i=0; i     String simage = (String)rawimages.get(i);
    try {
     URL image = new URL(url,simage);
     t.add(image);
    }
    catch(MalformedURLException e) { }
   }
   if (t.isEmpty()) images = null;
   else images = (URL[])t.toArray(dummy);
  }

// Set valid flag
valid = true;
}

这个方法主要是调用了WebPageXtractor来处理web页面。

然后把页面的各个属性的值都返回。

WebPageXtractor是继承了SimpleHTMLParser的方法，所以对页面元素进行分解是SimpleHTMLParser实现的

重点分析一下SimpleHTMLParser的一些方法。

Arachnid Web Spider Framework2

相关阅读

相关文章

相关问答