Many online services do not offer APIs to give access to their public data. At the same time, they might have all this data available on their website. In such a circumstance, why not scrape it?
Web scraping is a complicated subject and — to perform it consistently — you may need an equally complex solution. In most cases, you do not need such a sophisticated system.
In fact, an API that is capable of scraping data on the fly from a template-consistent website should be enough.
Let’s see how to build such an API to scrape data from a particular website in Spring Boot.
Please, note that the code will be in Kotlin, but you can achieve the same result in Java as well.
1. Adding the Required Dependencies
First, you need a library to perform web scraping in Spring Boot. Since Kotlin is interoperable with Java, you can use any Java web scraping library. Out of the many options that are available, I highly recommend jsoup.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. — jsoup: Java HTML Parser
So, you need to add jsoup
to your project’s dependencies.
If you are a Gradle user, add this dependency to your project’s build file:
compile "org.jsoup:jsoup:1.13.1"
Otherwise, if you are a Maven user, add the following dependency to your project’s build POM:
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
Now, you have all you need to start scraping data in Spring Boot. Consider other Spring Boot web scraping libraries as well!
2. Defining Your Scraping Logic
Since your scraping logic is based on how your target web page is structured, you must define it according to your goals. Be aware that every time the template of this page changes, you should update the logic accordingly.
The main advantage of defining an API to perform such an operation is that it scrapes data on the fly. This means that every time the API is called, up-to-date data is always returned.
In this tutorial, I am going to show how to build an API whose goal is to scrape the COVID-19 pandemic by country and territory Wikipedia page. Its purpose is to retrieve statistics on COVID-19 and return them in a human-readable format.
Firstly, you need to create a new connection to your target web page through the connect
method. Please, note that you might need to set a valid user agent, a specific set of headers, or cookies to prevent the connection refusal.
Secondly, you can call the get()
method to fetch and parse the desired HTML file. This would be represented by a Document
object, which offers whatever you need to navigate through DOM to find, extract, and manipulate data. You can get HTML elements either by using DOM traversal or CSS selectors.
This is what your scraping logic will look like:
fun retrieveCovidData() : List<CovidDataDto> { val covidDataList = ArrayList<CovidDataDto>() try { // retrieving the desired web page val webPage = Jsoup .connect("https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory") .get() val tbody = webPage .getElementById("thetable") .getElementsByTag("tbody")[0] val rows = tbody .children() .drop(2) // dropping the headers for (row in rows) { val country = row .getElementsByTag("a")[0] .text() val tds = row .getElementsByTag("td") // skipping the footer if (tds.size < 3) continue val cases = tds[0].text().replace(",", "").toIntOrNull() val deaths = tds[1].text().replace(",", "").toIntOrNull() val recoveries = tds[2].text().replace(",", "").toIntOrNull() covidDataList.add( CovidDataDto( country, cases, deaths, recoveries ) ) } } catch (e : HttpStatusException) { // an error occurred while connecting to the page // logging errors // ... throw e } return covidDataList }
And here is the CovidDataDto.kt
file:
class CovidDataDto { var country : String? = null var cases : Int? = null var deaths : Int? = null var recoveries : Int? = null constructor( country : String?, cases : Int?, deaths : Int?, recoveries : Int? ) { this.country = country this.cases = cases this.deaths = deaths this.recoveries = recoveries } constructor() }
As you can see, CovidDataDto
is only a DTO class used to carry data. Keep in mind that when dealing with such APIs, it may be useful to return CSV content directly. Spring Boot allows you to do so as described here.
What really matters is the retrieveCovidData
method, where the scraping logic lies. Thanks to jsoup, retrieving the desired data by navigating through the DOM of the downloaded web page is straightforward and no further explanation is required.
Based on my experience, consider that while connecting to your target web page and downloading it, many errors may occur. In order to make your code more robust, I strongly recommend adding retry logic, as described here.
3. Putting It All Together
Let’s create a controller and define an API to test out the scraping logic defined above.
@RestController @RequestMapping("/covid") class CovidDataController { @GetMapping("data") fun getCovidData() : ResponseEntity<List<CovidDataDto>> { return ResponseEntity( retrieveCodivData(), HttpStatus.OK ) } }
Now, by reaching http://localhost/covid/data
you will get the following response:
[ { "country": "United States", "cases": 28897871, "deaths": 518720 }, { "country": "India", "cases": 11096731, "deaths": 157051, "recoveries": 10775169 }, { "country": "Brazil", "cases": 10551259, "deaths": 255018, "recoveries": 9411033 }, ... { "country": "Vanuatu", "cases": 1, "deaths": 0, "recoveries": 1 } ]
Et voilà! Your API that scrapes COVID-19 data on the fly is ready!
Conclusion
In this article, we looked at how to build an API to scrape data on the fly from a precise web page in Spring Boot and Kotlin. This is especially useful when dealing with online services that do not offer APIs to retrieve their public data, and you want its up-to-date version.
Thanks for reading! I hope that you found this article helpful.