在Java中解析HTML以获取Android应用

经慈

2023-03-14

问题内容：

我正在编写一个Android应用，该应用从网站获取相关数据并将其呈现给用户（html抓取）。该应用程序下载源代码并对其进行解析，以查找要存储在对象中的相关数据。我实际上使用JSoup进行了解析器，但事实证明，这在我的应用程序中真的很慢。而且，这些库往往很大，我希望我的应用程序轻巧。

我要解析的网页都具有相似的结构，并且我确切地知道我要寻找的标签。因此，我认为我不如下载源代码并逐行阅读它，并使用查找相关数据String.equals。例如，如果html看起来像这样：

<textTag class="text">I want this text</textTag>

我会使用类似的方法来解析它：

private void interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
        String text = s.substring(22, s.length() - 10);
    }
}

但是，我对建立连接知之甚少（我见过人们使用HttpGets，但我不确定如何从中获取数据）。我已经搜索了很多时间，以查找有关如何进行这种解析的信息，但是大多数人经常诉诸于使用诸如JSoup，SAX等库来进行解析。

是否有人碰巧知道如何进行这样的解析，也许是一个例子？还是以这种方式解析源代码是个坏主意？请给我您的意见。

感谢您的时间。

问题答案：

这是我要怎么做：

        StringBuffer text = new StringBuffer();
        HttpURLConnection conn = null;
        InputStreamReader in = null;
        BufferedReader buff = null;
        try {
            URL page = new URL(
                    "http://example.com/");
// URLEncoder.encode(someparameter); use when passing params that may contain symbols or spaces use URLEncoder to encode it and conver space to %20...etc other wise you will get a 404
            conn = (HttpURLConnection) page.openConnection();
            conn.connect();
            /* use this if you need to
            int responseCode = conn.getResponseCode();

            if (responseCode == 401 || responseCode == 403) {
                // Authorization Error
                Log.e(tag, "Authorization Error");
                throw new Exception("Authorization Error");
            }

            if (responseCode >= 500 && responseCode <= 504) {
                // Server Error
                Log.e(tag, "Internal Server Error");
                throw new Exception("Internal Server Error");
            }*/
            in = new InputStreamReader((InputStream) conn.getContent());
            buff = new BufferedReader(in);
            String line = "anything";
            while (line != null) {
                line = buff.readLine();
            String found = interpretHtml(line);
            if(null != found)
                return found; // comment the previous 2 lines and this one if u need to load the whole html document.
                text.append(line + "\n");
            }
        } catch (Exception e) {
            Log.e(Standards.tag,
                    "Exception while getting html from website, exception: "
                            + e.toString() + ", cause: " + e.getCause()
                            + ", message: " + e.getMessage());
        } finally {
            if (null != buff) {
                try {
                    buff.close();
                } catch (IOException e1) {
                }
                buff = null;
            }
            if (null != in) {
                try {
                    in.close();
                } catch (IOException e1) {
                }
                in = null;
            }
            if (null != conn) {
                conn.disconnect();
                conn = null;
            }
        }
        if (text.toString().length() > 0) {
            return interpretHtml(text.toString()); // use this if you don't need to load the whole page.
        } else return null;
    }

private String interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
    return s.substring(22, s.length() - 10);
    }
    return null;
}

在Java中解析HTML以获取Android应用

相关阅读

相关文章

相关问答

相关工具

相关文档