September 28, 2024

Advanced API Usage: Pagination, Filtering, and Handling Large Datasets #

Welcome back to our programming tutorial series! Now that you’ve learned how to make basic API requests, we’ll explore more advanced topics: pagination, filtering, and handling large datasets. These skills are essential for working with APIs that return large amounts of data.

What Is Pagination? #

When APIs return large datasets, they often use pagination to break the results into smaller chunks. This prevents a single API call from returning too much data at once, making it easier for clients to handle and for servers to manage.

Example of Pagination #

Let’s say you’re querying an API that returns a list of users, but it limits the response to 10 users per page. You can request different pages of results using query parameters like page.

Example: #

1import requests
2
3url = "https://api.example.com/users?page=1"
4response = requests.get(url)
5data = response.json()
6print(data["users"])  # Outputs the first page of users

To retrieve the next page, you would change the page parameter:

1url = "https://api.example.com/users?page=2"
2response = requests.get(url)
3data = response.json()
4print(data["users"])  # Outputs the second page of users

Automating Pagination #

If you need to retrieve all pages of data, you can automate the process by looping through the pages until no more results are returned.

Example: #

 1import requests
 2
 3url = "https://api.example.com/users"
 4page = 1
 5
 6while True:
 7    response = requests.get(f"{url}?page={page}")
 8    data = response.json()
 9    
10    if not data["users"]:
11        break  # Stop if no more users are returned
12
13    for user in data["users"]:
14        print(user)
15    
16    page += 1

This loop will continue to request pages until the API stops returning users.

Filtering API Data #

Many APIs allow you to filter results based on specific criteria, which helps reduce the amount of data returned and makes the response more relevant to your query.

Example of Filtering #

Let’s say you’re querying an API for a list of posts, and you only want to retrieve posts created after a certain date:

 1import requests
 2
 3url = "https://api.example.com/posts"
 4params = {
 5    "date_created_after": "2024-01-01"
 6}
 7
 8response = requests.get(url, params=params)
 9data = response.json()
10print(data["posts"])  # Outputs posts created after 2024-01-01

You can add more filters as query parameters to narrow down the results even further.

Combining Pagination and Filtering #

It’s common to combine pagination and filtering when working with large datasets. You can filter the results and retrieve multiple pages until all matching data has been collected.

Example: #

 1import requests
 2
 3url = "https://api.example.com/posts"
 4params = {
 5    "date_created_after": "2024-01-01"
 6}
 7page = 1
 8
 9while True:
10    response = requests.get(f"{url}?page={page}", params=params)
11    data = response.json()
12    
13    if not data["posts"]:
14        break  # Stop if no more posts are returned
15
16    for post in data["posts"]:
17        print(post)
18
19    page += 1

Handling Large Datasets #

When working with APIs that return large datasets, it’s important to manage memory usage and performance. Here are some strategies for efficiently handling large amounts of data.

Storing Data in a Database #

If you need to work with large datasets over time, it’s often more efficient to store the data in a local database instead of keeping it all in memory.

Example Using SQLite: #

 1import sqlite3
 2import requests
 3
 4# Connect to SQLite database (or create it)
 5conn = sqlite3.connect("data.db")
 6c = conn.cursor()
 7
 8# Create table
 9c.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER, name TEXT, email TEXT)''')
10
11# Insert data into the table
12url = "https://api.example.com/users"
13page = 1
14
15while True:
16    response = requests.get(f"{url}?page={page}")
17    data = response.json()
18    
19    if not data["users"]:
20        break
21
22    for user in data["users"]:
23        c.execute("INSERT INTO users VALUES (?, ?, ?)", (user["id"], user["name"], user["email"]))
24    
25    page += 1
26
27# Commit changes and close the connection
28conn.commit()
29conn.close()

By storing the data in an SQLite database, you can retrieve and analyze it later without needing to keep everything in memory.

Processing Data in Batches #

If the API returns a large dataset in one request, you can process the data in batches instead of loading it all at once. This helps reduce memory usage.

Example: #

1import requests
2
3url = "https://api.example.com/large-dataset"
4response = requests.get(url, stream=True)  # Stream the response to avoid loading it all at once
5
6for chunk in response.iter_content(chunk_size=1024):
7    # Process each chunk of data
8    print(chunk)

Using the stream=True parameter ensures that the data is processed as it’s received, rather than loading the entire response into memory at once.

Practical Exercise: Fetch and Process Large User Data #

Now that you understand pagination and filtering, try this practical exercise:

Query a public API (e.g., the GitHub API) to fetch user data.
Implement pagination to retrieve multiple pages of users.
Store the user data in a local SQLite database.
Allow the user to query the database for specific users by name or ID.

Here’s a starter example:

 1import sqlite3
 2import requests
 3
 4# Connect to SQLite database
 5conn = sqlite3.connect("github_users.db")
 6c = conn.cursor()
 7
 8# Create table
 9c.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER, login TEXT, url TEXT)''')
10
11# Fetch user data from GitHub API and store it in the database
12url = "https://api.github.com/users"
13page = 1
14
15while True:
16    response = requests.get(f"{url}?since={page}")
17    users = response.json()
18
19    if not users:
20        break
21
22    for user in users:
23        c.execute("INSERT INTO users VALUES (?, ?, ?)", (user["id"], user["login"], user["html_url"]))
24    
25    page += 30  # Adjust based on GitHub's pagination
26
27conn.commit()
28conn.close()

What’s Next? #

You’ve just learned how to work with large datasets using pagination, filtering, and batch processing techniques. In the next post, we’ll dive into more advanced topics like OAuth and working with APIs that require complex authentication.

Happy coding, and we’ll see you in the next lesson!

Advanced API Usage: Pagination, Filtering, and Handling Large Datasets #

What Is Pagination? #

Example of Pagination #

Example: #

Automating Pagination #

Example: #

Filtering API Data #

Example of Filtering #

Combining Pagination and Filtering #

Example: #

Handling Large Datasets #

Storing Data in a Database #

Example Using SQLite: #

Processing Data in Batches #

Example: #

Practical Exercise: Fetch and Process Large User Data #

What’s Next? #

Related Articles #

Learn Next:

Re-enter Password

Confirm Action