当前位置: 首页 > 工具软件 > Nokogiri > 使用案例 >

ruby向数据库里写数据_如何抓取Ruby和Nokogiri并映射数据

尚嘉庆
2023-12-01

ruby向数据库里写数据

by Andrew Bales

通过安德鲁·巴尔斯

如何抓取Ruby和Nokogiri并映射数据 (How to scrape with Ruby and Nokogiri and map the data)

Sometimes you want to grab data from a website for your own project. So what do you use? Ruby, Nokogiri, and JSON to the rescue!

有时您想从网站上为自己的项目获取数据。 那你用什么呢? Ruby,Nokogiri和JSON助您一臂之力!

Recently, I was working on a project to map data about bridges. Using Nokogiri, I was able to capture a city’s bridge data from a table. I then used links within that same table to scrape associated pages. Finally, I converted the scraped data to JSON and used it to populate a Google Map.

最近,我正在做一个项目来映射有关桥梁的数据 。 使用Nokogiri,我能够从表格中捕获城市的桥梁数据。 然后,我使用同一张表中的链接来抓取相关页面。 最后,我将抓取的数据转换为JSON,并用它来填充Google Map。

This article walks you through the tools I used and how the code works!

本文将引导您逐步了解我使用的工具以及代码的工作方式!

See the full code on my GitHub repo.

在我的GitHub存储库中查看完整代码。

Live map demo here.

现场地图演示在这里

该项目 (The Project)

My goal was to take a table from a bridge data website and turn it into a Google map with geolocated pins that would produce informational popups for each bridge.

我的目标是从桥梁数据网站中获取一张桌子,然后将其转换为Google地图,其中包含定位的图钉,这些图钉将为每个桥梁生成信息弹出窗口。

To make this happen, I’d need to:

为了实现这一点,我需要:

  1. Scrape data from the original website.

    从原始网站抓取数据。
  2. Convert that data into a JSON object.

    将该数据转换为JSON对象

  3. Apply that data to make a new, interactive map.

    应用该数据制作一个新的交互式地图。

Your project will vary, surely — how many people are trying to map antique bridges? — but I hope this process will prove useful for your context.

您的项目肯定会有所不同-多少人正在尝试绘制古董桥? -但我希望这个过程对您的情况有用。

能吉里 (Nokogiri)

Ruby has an amazing web scraping gem called Nokogiri. Among other features, it allows you to search HTML documents by CSS selectors. That means if we know the ids, classes, or even types of elements where the data is stored in the DOM, we’re able to pluck it out.

Ruby有一个很棒的网络抓取宝石,叫做Nokogiri 。 除其他功能外,它还允许您通过CSS选择器搜索HTML文档。 这意味着,如果我们知道在DOM中存储数据的id,类或什至元素类型,就可以将其拔出。

刮板 (The scraper)

If you’re following along with the GibHub repo, you can find my scraper in bridges_scraper.rb

如果您跟随GibHub仓库 ,则可以在bridges_scraper.rb中找到我的刮板

require 'open-uri'require 'nokogiri'require 'json'

Open-uri lets us open the HTML like a file and pass it to Nokogiri for the heavy lifting.

Open-uri让我们像打开文件一样打开HTML,并将其传递给Nokogiri进行繁重的工作。

In the code below, I’m passing the DOM information from the URL with the bridge data over to Nokogiri. I then find the table element holding the data, search for its rows, and iterate through them.

在下面的代码中,我将带有桥数据的URL中的DOM信息传递给Nokogiri。 然后,我找到保存数据的表元素,搜索其行,然后遍历它们。

url = 'https://bridgereports.com/city/wichita-kansas/'html = open(url)
doc = Nokogiri::HTML(html)bridges = []table = doc.at('table')
table.search('tr').each do |tr|  bridges.push(    carries: cells[1].text,    crosses: cells[2].text,    location: cells[3].text,    design: cells[4].text,    status: cells[5].text,    year_build: cells[6].text.to_i,    year_recon: cells[7].text,    span_length: cells[8].text.to_f,    total_length: cells[9].text.to_f,    condition: cells[10].text,    suff_rating: cells[11].text.to_f,    id: cells[12].text.to_i  )end
json = JSON.pretty_generate(bridges)File.open("data.json", 'w') { |file| file.write(json) }

Nokogiri has lots of methods (here’s a cheat sheet and a starter guide!). We’re using just a few.

Nokogiri有很多方法(以下是备忘单和入门指南 !)。 我们只用了一些。

The table is found with .at(‘table’), which returns the first occurrence of a table element in the DOM. This works just fine for this relatively simple page.

使用.at('table')查找表 ,该表返回DOM中表元素的首次出现。 对于这个相对简单的页面来说,这很好用。

With the table in hand, .search(‘tr’) provides an array of the row elements that we iterate over with .each. In each row, the data is cleaned up and pushed into a single entry for the bridges array.

有了表格, .search('tr')提供了一个数组的行元素,我们使用.each对其进行迭代。 在每一行中,将清理数据并将其压入bridges数组的单个条目中。

After all the rows are collected, the data is converted into JSON and saved in a new file called “data.json”.

收集所有行之后,数据将转换为JSON并保存在名为“ data.json”的新文件中。

合并来自多个页面的数据 (Combining data from multiple pages)

In this case, I needed information from other associated pages. Specifically, I needed the latitude and longitude of each bridge, which was not featured on the table. However, I found that the link in the first cell of each row led to a page that did provide those details.

在这种情况下,我需要其他关联页面的信息。 具体来说,我需要每个桥的纬度和经度,而这在桌上没有。 但是,我发现每行第一个单元格中的链接都指向了确实提供这些详细信息的页面。

I needed to write code that did a few things:

我需要编写执行以下操作的代码:

  • Gathered links from the first cell in the table.

    从表中的第一个单元格收集了链接。
  • Created a new Nokogiri object from the HTML on that page.

    从该页面上HTML创建了一个新的Nokogiri对象。
  • Pluck out the latitude and longitude.

    拔出纬度和经度。
  • Sleep the program until that process completes.

    Hibernate程序,直到该过程完成。
cells = tr.search('th, td')  links = {}  cells[0].css('a').each do |a|    links[a.text] = a['href']  end    got_coords = false    if links['NBI report']    nbi = links['NBI report']    report = "https://bridgereports.com" + nbi    report_html = open(report)    sleep 1 until report_html    r = Nokogiri::HTML(report_html)        lat = r.css('span.latitude').text.strip.to_f    long = r.css('span.longitude').text.strip.to_f
got_coords = true  else    got_coords = true  end    sleep 1 until got_coords == true
bridges.push(        links: links,        latitude: lat,        longitude: long,        carries: cells[1].text,        ..., # all other previous key/value pairs  )end

A few additional things are worth pointing out here:

还有一些其他事情值得在这里指出:

  • I’m using the “got_coords” as a simple binary. This is set to false by default and is toggled when the data is captured OR simply not available.

    我使用“ got_coords”作为简单的二进制文件。 默认情况下将其设置为false ,并在捕获数据或根本不可用时进行切换。

  • The latitude and longitude are located in spans with corresponding classes. That makes securing the data simple: .css(‘span.latitude’) This is followed by .text, .strip and .to_f which 1) gets the text from the span, 2) strips any excess whitespace, and 3) converts the string to a float number.

    纬度和经度位于具有相应类别的跨度中。 这使得保护数据变得简单: .css('span.latitude')之后是.text,.strip.to_f ,其中1)从范围中获取文本,2)剥离所有多余的空格,并且3)转换字符串到浮点数。

JSON→Google地图 (JSON → Google Map)

The newly formed JSON object has to be modified a touch to fit the Google Maps API. I did this with JavaScript inside map.js

必须将新形成的JSON对象修改为适合Google Maps API的样式。 我在map.js中使用JavaScript 做到了

The JSON data is accessible within map.js because it has been moved to the JS folder, assigned to a variable called “bridge_data”, and included in a <script> tag in index.html.

可以在map.js中访问JSON数据,因为它已移至JS文件夹,并分配给名为“ bridge_data”的变量,并包含在index.html的<script>标记中。

All right! We’ll now convert the JSON file (assigned to the variable bridge_data) to a new array that’s usable by Google Maps.

行! 现在,我们将JSON文件(分配给变量bridge_data)转换为Google Maps可用的新数组。

const locations = bridge_data.map(function(b) {  var mapEntry = [];  var info = "<b>Built In: </b>" + b.year_build + "<br>" +             "<b>Span Length: </b>" + b.span_length + " ft<br>" +             "<b>Total Length: </b>" + b.total_length + " ft<br>" +             "<b>Condition: </b>" + b.condition + "<br>" +             "<b>Design: </b>" + b.design + "<br>";  mapEntry.push(    info,    b.latitude,    b.longitude,    b.id  )  return mapEntry;});

I’m using .map to create a new dimensional array called “locations”. Each entry has info, which will appear in our Google Maps popup if the user clicks on that pin on the map. We also include the latitude, longitude, and unique bridge ID.

我正在使用.map创建一个称为“位置”的新尺寸数组。 每个条目都有信息,如果用户单击地图上的该图钉,这些信息将显示在我们的Google Maps弹出窗口中。 我们还包括纬度,经度和唯一的网桥ID。

The result is a Google Map that plots the array of locations with info-rich popups for each bridge!

结果是一个Google Map,它绘制了每个桥梁的位置数组以及信息丰富的弹出窗口!

Did this help you? Give it a few claps and follow!

这对您有帮助吗? 给它一些掌声,并按照!

翻译自: https://www.freecodecamp.org/news/how-to-scrape-with-ruby-and-nokogiri-and-map-the-data-bd9febb5e18a/

ruby向数据库里写数据

 类似资料: