How to Build an API To Perform Web Scraping in Spring Boot

Many online services do not offer APIs to give access to their public data. At the same time, they might have all this data available on their website. In such a circumstance, why not scrape it?

Web scraping is a complicated subject and — to perform it consistently — you may need an equally complex solution. In most cases, you do not need such a sophisticated system.

In fact, an API that is capable of scraping data on the fly from a template-consistent website should be enough.

Let’s see how to build such an API to scrape data from a particular website in Spring Boot.

Please, note that the code will be in Kotlin, but you can achieve the same result in Java as well.

1. Adding the Required Dependencies

First, you need a library to perform web scraping in Spring Boot. Since Kotlin is interoperable with Java, you can use any Java web scraping library. Out of the many options that are available, I highly recommend jsoup.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. — jsoup: Java HTML Parser

So, you need to add jsoup to your project’s dependencies.

If you are a Gradle user, add this dependency to your project’s build file:

compile "org.jsoup:jsoup:1.13.1"

Otherwise, if you are a Maven user, add the following dependency to your project’s build POM:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>

Now, you have all you need to start scraping data in Spring Boot. Consider other Spring Boot web scraping libraries as well!

2. Defining Your Scraping Logic

Since your scraping logic is based on how your target web page is structured, you must define it according to your goals. Be aware that every time the template of this page changes, you should update the logic accordingly.

The main advantage of defining an API to perform such an operation is that it scrapes data on the fly. This means that every time the API is called, up-to-date data is always returned.

In this tutorial, I am going to show how to build an API whose goal is to scrape the COVID-19 pandemic by country and territory Wikipedia page. Its purpose is to retrieve statistics on COVID-19 and return them in a human-readable format.

Firstly, you need to create a new connection to your target web page through the connect method. Please, note that you might need to set a valid user agent, a specific set of headers, or cookies to prevent the connection refusal.

Secondly, you can call the get() method to fetch and parse the desired HTML file. This would be represented by a Document object, which offers whatever you need to navigate through DOM to find, extract, and manipulate data. You can get HTML elements either by using DOM traversal or CSS selectors.

This is what your scraping logic will look like:

fun retrieveCovidData() : List<CovidDataDto> {
    val covidDataList = ArrayList<CovidDataDto>()

    try {
        // retrieving the desired web page
        val webPage = Jsoup
            .connect("https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory")
            .get()

        val tbody = webPage
            .getElementById("thetable")
            .getElementsByTag("tbody")[0]

        val rows = tbody
            .children()
            .drop(2) // dropping the headers

        for (row in rows) {
            val country = row
                .getElementsByTag("a")[0]
                .text()

            val tds = row
                .getElementsByTag("td")

            // skipping the footer
            if (tds.size < 3)
                continue

            val cases = tds[0].text().replace(",", "").toIntOrNull()
            val deaths = tds[1].text().replace(",", "").toIntOrNull()
            val recoveries = tds[2].text().replace(",", "").toIntOrNull()

            covidDataList.add(
                CovidDataDto(
                    country,
                    cases,
                    deaths,
                    recoveries
                )
            )
        }
    } catch (e : HttpStatusException) {
        // an error occurred while connecting to the page

        // logging errors
        // ...

        throw e
    }

    return covidDataList
}

And here is the CovidDataDto.kt file:

class CovidDataDto {
    var country : String? = null
    var cases : Int? = null
    var deaths : Int? = null
    var recoveries : Int? = null

    constructor(
        country : String?,
        cases : Int?,
        deaths : Int?,
        recoveries : Int?
    ) {
        this.country = country
        this.cases = cases
        this.deaths = deaths
        this.recoveries = recoveries
    }

    constructor()
}

As you can see, CovidDataDto is only a DTO class used to carry data. Keep in mind that when dealing with such APIs, it may be useful to return CSV content directly. Spring Boot allows you to do so as described here.

What really matters is the retrieveCovidData method, where the scraping logic lies. Thanks to jsoup, retrieving the desired data by navigating through the DOM of the downloaded web page is straightforward and no further explanation is required.

Based on my experience, consider that while connecting to your target web page and downloading it, many errors may occur. In order to make your code more robust, I strongly recommend adding retry logic, as described here.

3. Putting It All Together

Let’s create a controller and define an API to test out the scraping logic defined above.

@RestController
@RequestMapping("/covid")
class CovidDataController {
    @GetMapping("data")
    fun getCovidData() : ResponseEntity<List<CovidDataDto>> {
        return ResponseEntity(
            retrieveCodivData(),
            HttpStatus.OK
        )
    }
}

Now, by reaching http://localhost/covid/data you will get the following response:

[
  {
    "country": "United States",
    "cases": 28897871,
    "deaths": 518720
  },
  {
    "country": "India",
    "cases": 11096731,
    "deaths": 157051,
    "recoveries": 10775169
  },
  {
    "country": "Brazil",
    "cases": 10551259,
    "deaths": 255018,
    "recoveries": 9411033
  },
  ...
  {
    "country": "Vanuatu",
    "cases": 1,
    "deaths": 0,
    "recoveries": 1
  }
]

Et voilà! Your API that scrapes COVID-19 data on the fly is ready!

Conclusion

In this article, we looked at how to build an API to scrape data on the fly from a precise web page in Spring Boot and Kotlin. This is especially useful when dealing with online services that do not offer APIs to retrieve their public data, and you want its up-to-date version.

Thanks for reading! I hope that you found this article helpful.

How to Build an API To Perform Web Scraping in Spring Boot

Scraping a web page on-the-fly to retrieve public data

1. Adding the Required Dependencies

2. Defining Your Scraping Logic

3. Putting It All Together

Conclusion

Antonello Zanini