问题：

J汤reddit刮刀429错误

周浩博

2023-03-14

因此，我尝试使用jsoup来刮除图像的Reddit，但当我刮除某些子Reddit（如/r/wallpaper）时，我遇到了一个429错误，我想知道如何修复它。完全理解这段代码很糟糕，这是一个很普通的问题，但我对此完全陌生。无论如何：

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;

import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;

public class javascraper{

public static void main (String[]args) throws MalformedURLException
{
    Scanner scan = new Scanner (System.in);
    System.out.println("Where do you want to store the files?");
    String folderpath = scan.next();
    System.out.println("What subreddit do you want to scrape?");
    String subreddit = scan.next();
    subreddit = ("http://reddit.com/r/" + subreddit);
    new File(folderpath + "/" + subreddit).mkdir();

    //test

    try{
        //gets http protocol
        Document doc = Jsoup.connect(subreddit).timeout(0).get();

        //get page title
        String title = doc.title();
        System.out.println("title : " + title);

        //get all links
        Elements links = doc.select("a[href]");

        for(Element link : links){

            //get value from href attribute
            String checkLink = link.attr("href");
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
            if (imgCheck(checkLink)){ // checks to see if img link j
                System.out.println("link : " + link.attr("href"));
                downloadImages(checkLink, folderpath);
            }
        }
    }
    catch (IOException e){
        e.printStackTrace();
    }
}

public static boolean imgCheck(String http){
    String png = ".png";
    String jpg = ".jpg";
    String jpeg = "jpeg"; // no period so checker will only check last four characaters
    String gif = ".gif";
    int length = http.length();

    if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
        return true;
    }
    else{
        return false;
    }
}

private static void downloadImages(String src, String folderpath) throws IOException{
    String folder = null;

    //Exctract the name of the image from the src attribute

    int indexname = src.lastIndexOf("/");

    if (indexname == src.length()) {
        src = src.substring(1, indexname);
    }
    indexname = src.lastIndexOf("/");

    String name = src.substring(indexname, src.length());

    System.out.println(name);

    //Open a URL Stream

    URL url = new URL(src);

    InputStream in = url.openStream();

    OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));

    for (int b; (b = in.read()) != -1;) {

        out.write(b);

    }

    out.close();

    in.close();
}

}

共有2个答案

梁浩

2023-03-14

你可以查看维基百科，429状态码告诉你你有太多的请求:

用户在给定时间内发送了太多请求。旨在与速率限制方案一起使用。

解决办法是放慢刮板的速度。有一些方法可以做到这一点，其中之一就是使用睡眠。

通建安

2023-03-14

您的问题是由您的抓取工具违反了红迪的API规则引起的。错误 429 表示“请求太多” – 您请求的页面太多太快。

您可以每2秒钟发出一个请求，还需要设置一个合适的用户代理（他们推荐的格式是

要修复它，首先，将其添加到类的开头，在main方法之前：

public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";

（确保指定实际的用户代理）。

然后，更改它(在< code>downloadImages中)

URL url = new URL(src);
InputStream in = url.openStream();

到此：

URLConnection connection = (new URL(src)).openConnection();

Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);

InputStream in = connection.getInputStream();

您还需要更改这一点(在< code>main中)

Document doc = Jsoup.connect(subreddit).timeout(0).get();

到此：

Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();

然后，您的代码应该停止运行该错误。

请注意，使用reddit的API（IE， /r/subreddit.json而不是 /r/subreddit）可能会使这个项目更容易，但这不是必需的，您当前的代码可以工作。

类似资料：

Jsoup、Reddit、OAuth2和429 HTTP错误

因此，我正在尝试为我运行的一个小的subreddit编写一个可执行JAR。我有一篇文章，Jsoup连接并读取了该页面上的所有URL。在另一种方法中，它然后连接到所有这些URL（只是帖子上的注释），并从注释中获取HTML并将其保存到HashMap。这很好，但是我收到429 HTTP错误。因此，为了解决这个问题，我添加了一个短短的5秒等待。现在我得到一个“读取超时”。一旦我将时间降低到3秒，我就在
线程"main"中的异常java.lang.NoClassDefFoundError： org/j的汤/J的汤

我从互联网上复制了一个简单的网络爬虫，然后开始在测试类中运行该应用程序。每次我尝试运行该应用程序时，我都会得到“线程中的异常”主“java.lang.NoClassDefFoundError： org/j的/J的”错误。我首先在Libary中导入了一个外部罐子，因为我需要它来处理超文本传输协议的事情。错误消息：蜘蛛类蜘蛛腿类 SpiderTest类包装com.copiedcrawler；
Selendroid作为web刮刀

我打算创建一个Android应用程序，它可以无头登录一个网站，然后在维护登录会话的同时从后续页面中删除一些内容。我第一次在一个普通的Java项目中使用HtmlUnit，它工作得很好。但后来发现HtmlUnit与Android不兼容。然后我通过向登录表单发送HTTP“POST”请求来尝试JSoup库。但由于JSoup不支持JavaScript，因此生成的页面无法完全加载。然后有人建议我看看Se
解析JavaScript与j汤

在＜code＞HTML＜/code＞页面中，我想选择＜code＞javascript＜/code＞变量的值下面是页面的片段：我的目标是使用< code>jsoup从该页面读取变量< code>key的值。< br >可以使用< code>jsoup吗？如果是，那么怎么做？
漂亮的刮汤台

我有一小段代码来从web站点中提取表数据，然后以csv格式显示。问题是for循环多次打印记录。我不确定是不是因为标签。顺便说一句，我是Python新手。谢谢你的帮助！
Web刮刀的PyQuery代码

我对python有点陌生，但我正在尝试制作一个web scraper脚本，它可以在网站上下载所有图片。我正在使用requests和PyQuery，因为许多人在做了一些研究后推荐了它。这就是我现在所拥有的，我不知道该去哪里。我知道我需要获取img的来源，但在找到img标签后如何做到这一点？此外，我查看了一些htmls的页面源，一些图片存储在他们的数据库中，因此src以“/”开头一些扩展“所以我想知

J汤reddit刮刀429错误

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档