Web scraping is a tricky but necessary part of some applications. In this article, we’re going to explore some principles to keep in mind when writing a web scraper. We’ll also look at what tools Rust has to make writing a web scraper easier.
What we’ll cover:
What is web scraping?
Web scraping refers to gathering data from a webpage in an automated way. If you can load a page in a web browser, you can load it into a script and parse the parts you need out of it!
However, web scraping can be pretty tricky. HTML isn’t a very structured format, so you usually have to dig around a bit to find the relevant parts.
If the data you want is available in another way — either through some sort of API call, or in a structured format like JSON, XML, or CSV — it will almost certainly be easier to get it that way instead. Web scraping can be a bit of a last resort because it can be cumbersome and brittle.
The details of web scraping highly depend on the page you’re getting the data from. We’ll look at an example below.
Web scraping principles
Let’s go over some general principles of web scraping that are good to follow.
Be a good citizen when writing a web scraper
When writing a web scraper, it’s easy to accidentally make a bunch of web requests quickly. This is considered rude, as it might swamp smaller web servers and make it hard for them to respond to requests from other clients.
Also, it might considered a denial-of-service (DoS) attack, and it’s possible your IP address could be blocked, either manually or automatically!
The best way to avoid this is to put a small delay in between requests. The example we’ll look at later on in this article has a 500ms delay between requests, which should be plenty of time to not overwhelm the web server.
Aim for robust web scraper solutions
As we’ll see in the example, a lot of the HTML out there is not designed to be read by humans, so it can be a bit tricky to figure out how to locate the data to extract.
One option is to do something like finding the seventh p
element in the document. But this is very fragile; if the HTML document page changes even a tiny bit, the seventh p
element could easily be something different.
It’s better to try to find something more robust that seems like it won’t change.
In the example we’ll look at below, to find the main data table, we find the table
element that has the most rows, which should be stable even if the page changes significantly.
Validate, validate, validate!
Another way to guard against unexpected page changes is to validate as much as you can. Exactly what you validate will be pretty specific to the page you are scraping and the application you are using to do so.
In the example below, some of the things we validate include:
- If a row has any of the headers that we’re looking for, then it has all three of the ones we expect
- The values are all between 0 and 100,000
- The values are decreasing (we know to expect this because of the specifics of the data we’re looking at)
- After parsing the page, we’ve gotten at least 50 rows of data
It’s also helpful to include reasonable error messages to make it easier to track down what invariant has been violated when a problem occurs.
Now, let’s look at an example of web scraping with Rust!
Building a web scraper with Rust
In this example, we are going to gather life expectancy data from the Social Security Administration (SSA). This data is available in “life tables” found on various pages of the SSA website.
The page we are using lists, for people born in 1900, their chances of surviving to various ages. The SSA provides a much more comprehensive explanation of these life tables, but we don’t need to read through the entire study for this article.
The table is split into two parts, male and female. Each row of the table represents a different age (that’s the “x” column). The various other columns show different statistics about survival rates at that age.
For our purposes, we care about the “lx” column, which starts with 100,000 babies born (at age 0) and shows how many are still alive at a given age. This is the data we want to capture and save into a JSON file.
The SSA provides this data for babies born every 10 years from 1900-2100 (I assume the data in the year 2100 is just a projection, unless they have time machines over there!). We’d like to capture all of it.
One thing to notice: in 1900, 14 percent of babies didn’t survive to age one! In 2020, that number was more like 0.5 percent. Hooray for modern medicine!
The HTML table itself is kind of weird; because it’s split up into male and female, there are essentially two tables in one table
element, a bunch of header rows, and blank rows inserted every five years to make it easier for humans to read. We’ll have to deal with all this while building our Rust web scraper.
The example code is in this GitHub repo. Feel free to follow along as we look at different parts of the scraper!
Fetching the page with the Ruse reqwest
crate
First, we need to fetch the webpage. We will use the reqwest
crate for this step. This crate has powerful ways to fetch pages in an async way in case you’re doing a bunch of work at once, but for our purposes, using the blocking API is simpler.
Note that to use the blocking API you need to add the “blocking” feature to the reqwest
dependency in your Cargo.toml
file; see an example at line nine of the file in the Github repo.
Fetching the page is done in the do_throttled_request()
method in scraper_utils.rs
. Here’s a simplified version of that code:
// Do a request for the given URL, with a minimum time between requests // to avoid overloading the server. pub fn do_throttled_request(url: &str) -> Result<String, Error> { // See the real code for the throttling - it's omitted here for clarity let response = reqwest::blocking::get(url)?; response.text() }
At its core, this method is pretty simple: do the request and return the body as a String
. We’re using the ?
operator to do an early return on any error we counter — for example, if our network connection is down.
Interestingly, the text()
method can also fail, and we just return that as well. Remember that since the last line doesn’t have a semicolon at the end, it’s the same as doing the following, but a bit more idiomatic for Rust:
return response.text();
Parsing the HTML with the Rust scraper
crate
Now to the hard part! We will be using the appropriately-named scraper
crate, which is based on the Servo project, which shares code with Firefox. In other words, it’s an industrial-strength parser!
The parsing is done using the parse_page()
method in your main.rs
file. Let’s break it down into steps.
First, we parse the document. Notice that the parse_document()
call below doesn’t return an error and thus can’t fail, which makes sense since this is code coming from a real web browser. No matter how badly formed the HTML is, the browser has to render something!
let document = Html::parse_document(&body); // Find the table with the most rows let main_table = document.select(&TABLE).max_by_key(|table| { table.select(&TR).count() }).expect("No tables found in document?");
Next, we want to find all the tables in the document. The select()
call allows us to pass in a CSS selector and returns all the nodes that match that selector.
CSS selectors are a very powerful way to specify which nodes you want. For our purposes, we just want to select all table nodes, which is easy to do with a simple Type
selector:
static ref TABLE: Selector = make_selector("table");
Once we have all of the table nodes, we want to find the one with the most rows. We will use the max_by_key()
method, and for the key we get the number of rows in the table.
Nodes also have a select()
method, so we can use another simple selector to get all the descendants that are rows and count them:
static ref TR: Selector = make_selector("tr");
Now it’s time to find out which columns have the “100,000” text. Here’s that code, with some parts omitted for clarity:
let mut column_indices: Option<ColumnIndices> = None; for row in main_table.select(&TR) { // Need to collect this into a Vec<> because we're going to be iterating over it // multiple times. let entries = row.select(&TD).collect::<Vec<_>>(); if column_indices.is_none() { let mut row_number_index: Option<usize> = None; let mut male_index: Option<usize> = None; let mut female_index: Option<usize> = None; // look for values of "0" (for the row number) and "100000" for (column_index, cell) in entries.iter().enumerate() { let text: String = get_numeric_text(cell); if text == "0" { // Only want the first column that has a value of "0" row_number_index = row_number_index.or(Some(column_index)); } else if text == "100000" { // male columns are first if male_index.is_none() { male_index = Some(column_index); } else if female_index.is_none() { female_index = Some(column_index); } else { panic!("Found too many columns with text \"100000\"!"); } } } assert_eq!(male_index.is_some(), female_index.is_some(), "Found male column but not female?"); if let Some(male_index) = male_index { assert!(row_number_index.is_some(), "Found male column but not row number?"); column_indices = Some(ColumnIndices { row_number: row_number_index.unwrap(), male: male_index, female: female_index.unwrap() }); } }
For each row, if we haven’t found the column indices we need, we’re looking for a value of 0
for the age and 100000
for male and female columns.
Note that the get_numeric_text()
function takes care of removing any commas from the text. Also notice the number of asserts and panics here to guard against the format of the page changing too much — we’d much rather have the script error out than get incorrect data!
Finally, here’s the code that gathers all the data:
if let Some(column_indices) = column_indices { if entries.len() < column_indices.max_index() { // Too few columns, this isn't a real row continue } let row_number_text = get_numeric_text(&entries[column_indices.row_number]); if row_number_text.parse::<u32>().map(|x| x == next_row_number) == Ok(true) { next_row_number += 1; let male_value = get_numeric_text(&entries[column_indices.male]).parse::<u32>(); let male_value = male_value.expect("Couldn't parse value in male cell"); // The page normalizes all values by assuming 100,000 babies were born in the // given year, so scale this down to a range of 0-1. let male_value = male_value as f32 / 100000_f32; assert!(male_value <= 1.0, "male value is out of range"); if let Some(last_value) = male_still_alive_values.last() { assert!(*last_value >= male_value, "male values are not decreasing"); } male_still_alive_values.push(male_value); // Similar code for female values omitted } }
This code just makes sure that the row number (i.e. the age) is the next expected value, and then gets the values from the columns, parses the number, and scales it down. Again, we do some assertions to make sure the values look reasonable.
Writing the data out to JSON
For this application, we wanted the data written out to a file in JSON format. We will use the json
crate for this step. Now that we have all the data, this part is pretty straightforward:
fn write_data(data: HashMap<u32, SurvivorsAtAgeTable>) -> std::io::Result<()> { let mut json_data = json::object! {}; let mut keys = data.keys().collect::<Vec<_>>(); keys.sort(); for &key in keys { let value = data.get(&key).unwrap(); let json_value = json::object! { "female": value.female.clone(), "male": value.male.clone() }; json_data[key.to_string()] = json_value; } let mut file = File::create("fileTables.json")?; write!(&mut file, "{}", json::stringify_pretty(json_data, 4))?; Ok(()) }
Sorting the keys isn’t strictly necessary, but it does make the output easier to read. We use the handy json::object!
macro to easily create the JSON data and write it out to a file with write!
. And we’re done!
Conclusion
Hopefully this article gives you a good starting point for doing web scraping in Rust.
With these tools, a lot of the work can be reduced to crafting CSS selectors to get the nodes you’re interested in, and figuring out what invariants you can use to assert that you’re getting the right ones in case the page changes!
The post Web scraping with Rust appeared first on LogRocket Blog.
from LogRocket Blog https://ift.tt/Rn9Gq0Q
via Read more