javascript使用C＃生成的抓取网页

邵赞

2023-03-14

问题内容：

我有一个webBrowser，在Visual Studio中有一个标签，基本上我想做的是从另一个网页中抓取一个部分。

我尝试使用WebClient.DownloadString和WebClient.DownloadFile，在JavaScript加载内容之前，它们都为我提供了网页的源代码。我的下一个想法是使用WebBrowser工具，并在页面加载后仅调用webBrowser.DocumentText，但该方法不起作用，它仍然为我提供了页面的原始来源。

有什么办法可以获取JavaScript后加载的页面？

问题答案：

问题在于浏览器通常会执行javascript，并且会生成更新的DOM。除非您可以分析JavaScript或拦截其使用的数据，否则您将需要像浏览器一样执行代码。在过去，我遇到了同样的问题，我利用selenium和PhantomJS渲染页面。呈现页面后，我将使用WebDriver客户端浏览DOM并检索所需的内容，然后发布AJAX。

从高层次上讲，这些步骤是：

已安装的selenium
将硒中心作为服务启动
下载的phantomjs（无头浏览器，可以执行javascript）
在webdriver模式下启动phantomjs指向selenium hub
在我的抓取应用程序中安装了webdriver客户端nuget软件包： Install-Package Selenium.WebDriver

这是phantomjs网络驱动程序的示例用法：

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

编辑：更简单的方法

似乎有一个适用于phantomjs的nuget包，这样您就不需要集线器（我使用集群以这种方式进行大规模报废）：

安装网络驱动程序：

Install-Package Selenium.WebDriver

安装嵌入式exe：

Install-Package phantomjs.exe

更新的代码：

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

javascript使用C＃生成的抓取网页

相关阅读

相关文章

相关问答

相关工具

相关文档