PageInfo.java是一个页面对象。它抽象了页面的主要元素,并且封装了取得那些元素的方法。
public URL getUrl() { return(url); }\\取得页面的url
public URL getParentUrl() { return(parentUrl); }\\
public String getTitle() { return(title); }
public URL[] getLinks() { return(links); }
public URL[] getImages() { return(images); }
public String getContentType() { return(contentType); }
public boolean isValid() { return(valid); }
public int getResponseCode() { return responseCode; }
中间有个最核心的方法就是
public void extract(Reader reader) throws IOException
{
// Note: contentLength of -1 means UNKNOWN
if (reader == null || url == null ||
responseCode != HttpURLConnection.HTTP_OK ||
contentLength == 0 || contentType.equalsIgnoreCase(HTML) == false) {
valid = false;
return;
}
WebPageXtractor x = new WebPageXtractor();
try { x.parse(reader); }
catch(EOFException e) {
valid = false;
return;
}
catch(SocketTimeoutException e) {
valid = false;
throw(e);
}
catch(IOException e) {
valid = false;
return;
}
ArrayList rawlinks = x.getLinks();
ArrayList rawimages = x.getImages();
// Get web page title (1st title if more than one!)
ArrayList rawtitle = x.getTitle();
if (rawtitle.isEmpty()) title = null;
else title = new String((String)rawtitle.get(0));
// Get links
int numelem = rawlinks.size();
if (numelem == 0) links = null;
else {
ArrayList t = new ArrayList();
for (int i=0; i String slink = (String)rawlinks.get(i);
try {
URL link = new URL(url,slink);
t.add(link);
}
catch(MalformedURLException e) { /* Ignore */ }
}
if (t.isEmpty()) links = null;
else links = (URL[])t.toArray(dummy);
}
// Get images
numelem = rawimages.size();
if (numelem == 0) images = null;
else {
ArrayList t = new ArrayList();
for (int i=0; i String simage = (String)rawimages.get(i);
try {
URL image = new URL(url,simage);
t.add(image);
}
catch(MalformedURLException e) { }
}
if (t.isEmpty()) images = null;
else images = (URL[])t.toArray(dummy);
}
// Set valid flag
valid = true;
}
这个方法主要是调用了WebPageXtractor来处理web页面。
然后把页面的各个属性的值都返回。
WebPageXtractor是继承了SimpleHTMLParser的方法,所以对页面元素进行分解是SimpleHTMLParser实现的
重点分析一下SimpleHTMLParser的一些方法。