问题：

使用Java Jsoup的问题刮网站，网站不“滚动”

施彦

2023-03-14

我的问题是关于从特定网站上收集数据的可能性。目前，我的算法正在将HTML转换为文本，然后检查文件中包含的标记词，并求和标记的数量。

我的问题在于在刮网站的同时无法向下“滚动”。正如你所看到的，它正在检查一个twitter帐户上的标志数，但它仅限于50sh最新的tweets。我希望我说清楚了。

附注：我给了twitter一个例子，我不是在为twitter寻找特定的东西，而是更健壮的东西。

我将非常感谢任何提示。

健康问题+H1N1：0例

基础设施安全：2个实例

西南边境暴力事件：1起

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.*;
import java.util.LinkedList;
import java.util.List;

public class Main {

static List<String> dhsAndOtherAgencies = new LinkedList<>();
static List<String> domesticSecurity = new LinkedList<>();
static List<String> hazmatNuclear = new LinkedList<>();
static List<String> healthConcern = new LinkedList<>();
static List<String> infrastructureSecurity = new LinkedList<>();
static List<String> southwestBorderViolence = new LinkedList<>();
static List<String> terrorism = new LinkedList<>();
static List<String> weatherDisasterEmergency = new LinkedList<>();
static List<String> cyberSecutiry = new LinkedList<>();
static String stream;



public static void main(String[] args) throws IOException {



    createLists();
    createStream();
    Raport raport = generateReport();
    System.out.println(raport);

}

static int flagStream(List<String> list, String stream){

    int counter = 0;
    for (String flag: list){
        if(stream.contains(flag)) {
            System.out.println(flag);
            counter++;
        }
    }

    return counter;
}

static Raport generateReport(){
    return new Raport(
            flagStream(dhsAndOtherAgencies,stream),
            flagStream(domesticSecurity,stream),
            flagStream(hazmatNuclear,stream),
            flagStream(healthConcern,stream),
            flagStream(infrastructureSecurity,stream),
            flagStream(southwestBorderViolence,stream),
            flagStream(terrorism,stream),
            flagStream(weatherDisasterEmergency,stream),
            flagStream(cyberSecutiry,stream)

    );

}
static void createStream() throws IOException {
    Document doc = Jsoup.connect("https://twitter.com/realDonaldTrump").userAgent("mozilla/17.0").get();
    stream = doc.text();
}
static void createLists() throws IOException {
    BufferedReader read = new BufferedReader(new FileReader("clearListAllCases.txt"));

    String input;

    int hashCounter = 0;

    while((input=read.readLine())!=null){
        if(input.charAt(0)=='#'){
            hashCounter++;
            continue;
        }
        switch (hashCounter){

            case 1:
                dhsAndOtherAgencies.add(input);
                break;
            case 2:
                domesticSecurity.add(input);
                break;
            case 3:
                hazmatNuclear.add(input);
                break;
            case 4:
                healthConcern.add(input);
                break;
            case 5:
                infrastructureSecurity.add(input);
                break;
            case 6:
                southwestBorderViolence.add(input);
                break;
            case 7:
                terrorism.add(input);
                break;
            case 8:
                weatherDisasterEmergency.add(input);
                break;
            case 9:
                cyberSecutiry.add(input);
                break;
        }
    }
}
}

 class Raport {

int a,b,c,d,e,f,g,h,i;
int totalFlags;

Raport(int a, int b, int c, int d, int e, int f, int g, int h, int i){
    this.a = a;
    this.b = b;
    this.c = c;
    this.d = d;
    this.e = e;
    this.f = f;
    this.g = g;
    this.h = h;
    this.i = i;
    totalFlags = a+b+c+d+e+f+g+h+i;
}



public String toString(){

    return "DHS & Other Agencies:\t\t\t"+a+" instances\n"+
            "Domestic security:\t\t\t\t"+b+" instances\n"+
            "HAZMAT & Nuclear:\t\t\t\t"+c+" instances\n"+
            "Health Concern + H1N1:\t\t\t"+d+" instances\n"+
            "Infrastructure Security:\t\t"+e+" instances\n"+
            "Southwest Border Violence:\t\t"+f+" instances\n"+
            "Terrorism:\t\t\t\t\t\t"+g+" instances\n"+
            "Weather/Disaster/Emergency:\t\t"+h+" instances\n"+
            "Cyber Security:\t\t\t\t\t"+i+" instances\n"+
            "TOTAL FLAGS:\t\t\t\t\t"+totalFlags+" instances";
}

}

共有1个答案

令狐献

2023-03-14

我建议打开浏览器的developer选项卡，尝试找出网站使用哪个URL/endpoint为infinite scroll获取新项目，因为JSoup本身并不执行Javascript。然后可以使用JSoup调用endpoint并解析结果。

如果它不起作用，那么使用HtmlUnit或Selenium可能会更好，因为它们都是功能齐全的web浏览器API，您可以使用Java进行控制。

类似资料：

无限滚动刮擦网站

问题内容：我已经写了很多刮板，但是我不确定如何处理无限滚动条。如今，大多数网站，Facebook，Pinterest等都有无限滚动条。问题答案：您可以使用硒来刮除Twitter或Facebook之类的无限滚动网站。步骤1：使用pip安装Selenium 第2步：使用下面的代码自动执行无限滚动并提取源代码步骤3：根据需要打印数据。
用Selenium刮网站时的NoSuchElementException

我正试图从以下URL中刮取球员姓名和位置:https://theDraftNetwork.com/articles/2021-NFL-draft-big-board-marino
使用登录信息刮网站与python

我正试图使用从我订阅的新闻网站上刮取文章。我在电脑上的每个浏览器上都登录了网站（这不重要吗？），但每当我试图从特定文章中获取任何文本时，请使用以下命令：页面=请求。得到（”http://www.SomeWebsite.com/blah/blah/blah.html") tree=html。fromstring（page.text）文章=tree.xpath（'//div/p/text（）'）
使用Selenium刮java-重网站-返回None

新编码器来了。一段时间以来，我一直试图在一个非常基于java的网站上删除一段文本，现在使用Selenium。我不知道这一点我做错了什么。试图刮取的元素的图像：我试图在这个容器中刮取那个美元金额，这样我最终就可以在我正在构建的每日报告中使用它。以下是网站链接:https://explorer.helium.com/accounts/13pm9jur7wpjaf7evwgq5eqaartppu2
如何使用python和selenium使用load more按钮刮无限滚动的网站

但是我不想做一个循环，而是想触发一个事件，比如，如果用户手动按下load more Post按钮，新页面被加载，我得到页面的页面源。有什么办法可以做到吗？如有任何帮助，不胜感激。
使用HtmlUnit访问动态网站

我想在不使用应用编程接口的情况下访问instagram页面。我需要找到追随者的数量，所以这不仅仅是一个源下载，因为页面是动态构建的。我发现HtmlUnit是一个模拟浏览器的库，这样JS就会被渲染，我就能得到想要的内容。但是，此调用会导致以下异常：所以它无法访问该脚本，但如果我正确解释了这一点，它只是为了加载字体，我不需要。我在google上搜索了如何告诉它忽略页面的某些部分，并找到了这条线索

使用Java Jsoup的问题刮网站，网站不“滚动”

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档