Advanced API Usage: Pagination, Filtering, and Handling Large Datasets #
Welcome back to our programming tutorial series! Now that you’ve learned how to make basic API requests, we’ll explore more advanced topics: pagination, filtering, and handling large datasets. These skills are essential for working with APIs that return large amounts of data.
What Is Pagination? #
When APIs return large datasets, they often use pagination to break the results into smaller chunks. This prevents a single API call from returning too much data at once, making it easier for clients to handle and for servers to manage.
Example of Pagination #
Let’s say you’re querying an API that returns a list of users, but it limits the response to 10 users per page. You can request different pages of results using query parameters like page
.
Example: #
import requests
url = "https://api.example.com/users?page=1"
response = requests.get(url)
data = response.json()
print(data["users"]) # Outputs the first page of users
To retrieve the next page, you would change the page
parameter:
url = "https://api.example.com/users?page=2"
response = requests.get(url)
data = response.json()
print(data["users"]) # Outputs the second page of users
Automating Pagination #
If you need to retrieve all pages of data, you can automate the process by looping through the pages until no more results are returned.
Example: #
import requests
url = "https://api.example.com/users"
page = 1
while True:
response = requests.get(f"{url}?page={page}")
data = response.json()
if not data["users"]:
break # Stop if no more users are returned
for user in data["users"]:
print(user)
page += 1
This loop will continue to request pages until the API stops returning users.
Filtering API Data #
Many APIs allow you to filter results based on specific criteria, which helps reduce the amount of data returned and makes the response more relevant to your query.
Example of Filtering #
Let’s say you’re querying an API for a list of posts, and you only want to retrieve posts created after a certain date:
import requests
url = "https://api.example.com/posts"
params = {
"date_created_after": "2024-01-01"
}
response = requests.get(url, params=params)
data = response.json()
print(data["posts"]) # Outputs posts created after 2024-01-01
You can add more filters as query parameters to narrow down the results even further.
Combining Pagination and Filtering #
It’s common to combine pagination and filtering when working with large datasets. You can filter the results and retrieve multiple pages until all matching data has been collected.
Example: #
import requests
url = "https://api.example.com/posts"
params = {
"date_created_after": "2024-01-01"
}
page = 1
while True:
response = requests.get(f"{url}?page={page}", params=params)
data = response.json()
if not data["posts"]:
break # Stop if no more posts are returned
for post in data["posts"]:
print(post)
page += 1
Handling Large Datasets #
When working with APIs that return large datasets, it’s important to manage memory usage and performance. Here are some strategies for efficiently handling large amounts of data.
Storing Data in a Database #
If you need to work with large datasets over time, it’s often more efficient to store the data in a local database instead of keeping it all in memory.
Example Using SQLite: #
import sqlite3
import requests
# Connect to SQLite database (or create it)
conn = sqlite3.connect("data.db")
c = conn.cursor()
# Create table
c.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER, name TEXT, email TEXT)''')
# Insert data into the table
url = "https://api.example.com/users"
page = 1
while True:
response = requests.get(f"{url}?page={page}")
data = response.json()
if not data["users"]:
break
for user in data["users"]:
c.execute("INSERT INTO users VALUES (?, ?, ?)", (user["id"], user["name"], user["email"]))
page += 1
# Commit changes and close the connection
conn.commit()
conn.close()
By storing the data in an SQLite database, you can retrieve and analyze it later without needing to keep everything in memory.
Processing Data in Batches #
If the API returns a large dataset in one request, you can process the data in batches instead of loading it all at once. This helps reduce memory usage.
Example: #
import requests
url = "https://api.example.com/large-dataset"
response = requests.get(url, stream=True) # Stream the response to avoid loading it all at once
for chunk in response.iter_content(chunk_size=1024):
# Process each chunk of data
print(chunk)
Using the stream=True
parameter ensures that the data is processed as it’s received, rather than loading the entire response into memory at once.
Practical Exercise: Fetch and Process Large User Data #
Now that you understand pagination and filtering, try this practical exercise:
- Query a public API (e.g., the GitHub API) to fetch user data.
- Implement pagination to retrieve multiple pages of users.
- Store the user data in a local SQLite database.
- Allow the user to query the database for specific users by name or ID.
Here’s a starter example:
import sqlite3
import requests
# Connect to SQLite database
conn = sqlite3.connect("github_users.db")
c = conn.cursor()
# Create table
c.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER, login TEXT, url TEXT)''')
# Fetch user data from GitHub API and store it in the database
url = "https://api.github.com/users"
page = 1
while True:
response = requests.get(f"{url}?since={page}")
users = response.json()
if not users:
break
for user in users:
c.execute("INSERT INTO users VALUES (?, ?, ?)", (user["id"], user["login"], user["html_url"]))
page += 30 # Adjust based on GitHub's pagination
conn.commit()
conn.close()
What’s Next? #
You’ve just learned how to work with large datasets using pagination, filtering, and batch processing techniques. In the next post, we’ll dive into more advanced topics like OAuth and working with APIs that require complex authentication.
Related Articles #
- Working with APIs: Fetching Data from External Sources
- Error Handling and Exceptions in Python
- Dictionaries and Sets: Efficient Data Retrieval
Happy coding, and we’ll see you in the next lesson!