问题：

Jsoup解析Java中动态加载网页

阎单鹗

2023-03-14

import java.io.IOException;
import java.util.ArrayList;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;


public class listGrabber {
    public static void main(String[]args) {
        try {
            Document doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free").get();
            int count = 0;
            Elements elements;
            String url;
            ArrayList<String> list = new ArrayList<>();
            do{
                elements = doc.select("a[class^=title]").get(count).select("a[class^=title]");

                url = "";
                url = elements.attr("abs:title").replaceAll("https://play.google.com/store/apps/category/GAME_ACTION/collection/","");
                url = url.replaceAll("®|™","");
                url = url.replaceAll("[(](.*)[)]","");
                list.add(url);
                System.out.println(url);
                count++;
            }while (url!="" &&url!=null);
            // String divContents =
            // doc.select(".id-app-orig-desc").first().text();
            // elements.remove("div");
        } catch (IOException e) {

        }
    }
}

正如你在上面看到的，我正试图从https://play . Google . com/store/apps/category/GAME _ ACTION/collection/top selling _ free中抓取一个单词列表

谷歌Play商店页面加载更多的元素，每次你滚动到页面的底部。

我的程序将抓取显示的前40个元素，但由于j汤不会加载动态加载的网页的其余部分，因此我无法抓取前40个元素之外的任何元素。

此外，如果你在页面上滚动到游戏#300，会出现一个Show More按钮，我还想解析Show More按钮之外的元素。

Jsoup 有没有办法解析所有动态加载到页面上的元素？

阎乐池

2023-03-14

编辑-在OP的几句评论之后，我完全理解了他想要实现的目标。我对原来的解决方案做了一些修改并进行了测试。

您可以使用< code>JSOUP来完成。在第一页之后，获取下一页需要发送一个带有一些头的< code>post请求。标题包含(除了别的以外)起始编号和要获取多少条记录。如果您发送一个illegel数字(即，您询问包含游戏编号700的页面，但结果只包含600个游戏)，您将再次获得第一页。您可以循环浏览页面，直到得到您已经得到的结果。< br >有时服务器返回600个结果，有时只有540个，我不知道为什么。< br >代码是-

import java.util.regex.Pattern;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class HelloWorld {

public static void main(String[] args) {

    Connection.Response res = null;
    Document doc = null;
    Boolean OK = true;
    int start = 0;
    String query;
    ArrayList<String> tempList = new ArrayList<>();
    ArrayList<String> games = new ArrayList<>();
    Pattern r = Pattern.compile("title=\"(.*)\" a");

    try {   //first connection with GET request
        res = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free")
                .method(Method.GET)
                .execute(); 
        doc = res.parse();
    } catch (Exception ex) {
        //Do some exception handling here
    }
    for (int i=1; i <= 60; i++) {    //parse the result and add it to the list
        query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
        tempList.add(doc.select(query).toString());
    }

    while (OK) {    //loop until you get the same results again
        start += 60;    
        System.out.println("now at number " + start);
        try {      //send post request for each new page
            doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free?authuser=0")
                    .cookies(res.cookies())
                    .data("start", String.valueOf(start))
                    .data("num", "60")
                    .data("numChildren", "0") 
                    .data("ipf", "1")
                    .data("xhr", "1")
                    .post();
        } catch (Exception ex) {
            //Do some exception handling here
        }
        for (int i=1; i <= 60; i++) {    //parse the result and add it to the list
            query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
            if (!tempList.contains(doc.select(query).toString())) {
                tempList.add(doc.select(query).toString());
            } else {    //we've seen these games before, time to quit
                OK = false;
                break;
            }               
        }   
    }
    for (int i = 0; i < tempList.size(); i++) {    //remove all redundent info.
        Matcher m = r.matcher(tempList.get(i));
        if (m.find()) {
            games.add(m.group(1));
            System.out.println((i + 1) + " " + games.get(i));
        }           
    }
}
}

代码可以进一步改进（例如使用单独的方法处理所有列表），因此由您决定。
我希望这能为你工作。

Jsoup解析Java中动态加载网页

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档