Harvesting

This page provides information on how to harvest data from the ARTESP Open Data Portal. Harvesting allows you to automatically collect and synchronize datasets from our portal to your own systems. We offer multiple methods for harvesting data, making it easy to integrate our open datasets into your applications, analysis tools, or other data platforms.

What is Harvesting?

Harvesting is the process of automatically collecting metadata and data from one data portal to another. It allows organizations and individuals to keep a local copy of datasets that are synchronized with the original source. This is particularly useful for:

Creating federated or aggregated data catalogs
Building applications that need regular data updates
Integrating open data into your own systems
Performing analysis across multiple datasets

Available Harvesting Methods

DCAT RDF Endpoints

Our portal supports the Data Catalog Vocabulary (DCAT) standard, which provides a framework for describing datasets in a catalog. We offer the following DCAT endpoints:

Catalog Endpoint

Access all datasets in our catalog through:

https://dadosabertos.artesp.sp.gov.br/catalog.{format} where {format} can be xml, ttl, n3, or jsonld

Parameters:

page={number} - For pagination (default: 1)
modified_since={ISO-date} - Filter datasets modified since a specific date
q={query} - Search query to filter datasets

Example: https://dadosabertos.artesp.sp.gov.br/catalog.xml?page=2&modified_since=2023-01-01

Individual Dataset Endpoints

Access metadata for a specific dataset:

https://dadosabertos.artesp.sp.gov.br/dataset/{dataset-id}.{format} where {format} can be xml, ttl, n3, or jsonld

Example: https://dadosabertos.artesp.sp.gov.br/dataset/acidentes.xml

Content Negotiation

Our portal also supports content negotiation, allowing clients to request specific formats using HTTP Accept headers:

application/rdf+xml for RDF/XML format
text/turtle for Turtle format
text/n3 for N3 format
application/ld+json for JSON-LD format

Example using curl: curl -H "Accept: text/turtle" https://dadosabertos.artesp.sp.gov.br/dataset/rodovias-concedidas

DCAT Configuration

Our DCAT implementation is configured with the following settings:

RDF profile: DCAT-AP 3.0
RDF endpoints enabled
Content negotiation enabled
100 datasets per page configuration

Setting Up a Harvester in CKAN

If you are using CKAN to harvest data from our portal, you can use the CKAN Harvester extension. Here are the basic steps:

Install the ckanext-harvest extension in your CKAN instance
Configure the harvester to use either the CKAN harvester (for CKAN-to-CKAN harvesting) or the DCAT RDF harvester (for harvesting via our DCAT endpoints)
Create a new harvest source pointing to our portal URL
Configure the harvester with appropriate options (frequency, filters, etc.)
Start the harvesting process

Example Configuration for DCAT RDF Harvester

When setting up a DCAT RDF harvester, you can use this configuration:

{
  "rdf_format": "xml",
  "profiles": ["euro_dcat_ap_3"],
  "default_extras": {
    "harvest_source_title": "ARTESP Open Data Portal",
    "harvest_source_url": "https://dadosabertos.artesp.sp.gov.br/"
  }
}

Using the CKAN API for Data Access

Beyond DCAT harvesting, the CKAN Action API offers a powerful RPC-style interface to interact with the portal programmatically. You can retrieve dataset information, search for data, and much more using HTTP requests with JSON payloads. This method provides fine-grained control over data access.

Introduction to the CKAN API

The API allows you to perform most actions available through the web interface.

API Base URL: https://dadosabertos.artesp.sp.gov.br/api/3/action/

For example, to list datasets, the action name is `package_list`, and the full URL would be: https://dadosabertos.artesp.sp.gov.br/api/3/action/package_list

JSON Response Structure

A typical API response is a JSON object with the following structure:


{
  "help": "Help text for the called action...",
  "success": true, // or false in case of error
  "result": [ /* ...data returned by the action... */ ],
  "error": { // Present only if "success" is false
    "message": "Error message",
    "__type": "Error Type"
  }
}

help: A documentation string for the API function you called.
success: A boolean indicating if the call was successful (true) or not (false). Always check this field.
result: The data returned by the function. Its structure depends on the specific action.
error: If `success` is false, this object contains details about the error.

API Version

The current and recommended API version is v3. It is good practice to include `/api/3/` in your request URLs to ensure compatibility.

Authentication

Most read actions on public data do not require authentication. However, actions that modify data (create, update, delete) or access private datasets require an API Key. This key should be included in the `Authorization` HTTP header.

Example header: Authorization: YOUR_API_KEY_HERE

You can usually find your API key on your user profile page on the CKAN site.

Common API Actions for Data Retrieval

Here are some common read-only actions useful for accessing data and metadata:

List Datasets (package_list)

Returns a list of the names (IDs) of all public datasets.

Example using cURL:

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_list"

Show Dataset Details (package_show)

Returns complete information about a specific dataset, including its resources.

Parameters:

id (string): The name (ID) or UUID of the dataset.

Example using cURL (replace `your-dataset-id` with an actual ID):

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_show?id=your-dataset-id"

Search Datasets (package_search)

Allows searching for datasets based on various criteria.

Common Parameters:

q (string): Search term (e.g., `q=transport`).
fq (string): Filter query using Solr syntax (e.g., `fq=tags:economy organization:artesp`).
rows (int): Number of results per page (default 10).
start (int): Offset for pagination.
sort (string): Sorting criteria (e.g., `sort=score desc, metadata_modified desc`).

Example using cURL (searching for "rodovias" and limiting to 5 results):

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_search?q=rodovias&rows=5"

Other Listing and Show Actions

Similar actions are available for other CKAN entities:

organization_list / organization_show: For organizations.
group_list / group_show: For groups.
tag_list / tag_show: For tags.
resource_show: To get details of a specific resource (file/link within a dataset). Requires resource ID.

API Actions Cheat Sheet

The following table provides a quick summary of common API actions:

Action Name	Description	HTTP Method	Key Parameters (in JSON body or URL query)
`package_list`	Returns a list of the names (IDs) of all public datasets.	GET or POST	`limit` (int, optional), `offset` (int, optional)
`package_show`	Returns detailed metadata for a specific dataset.	GET or POST	`id` (string, required: dataset ID or name)
`package_search`	Searches datasets based on various criteria.	GET or POST	`q` (string, optional: search term), `fq` (string, optional: filter query), `rows` (int, optional), `start` (int, optional)
`resource_show`	Returns detailed metadata for a specific resource.	GET or POST	`id` (string, required: resource ID)
`organization_list`	Returns a list of names (IDs) of all public organizations.	GET or POST	`limit` (int, optional), `offset` (int, optional), `all_fields` (boolean, optional)
`organization_show`	Returns detailed metadata for a specific organization.	GET or POST	`id` (string, required: organization ID or name), `include_datasets` (boolean, optional)
`group_list`	Returns a list of names (IDs) of all public groups.	GET or POST	`limit` (int, optional), `offset` (int, optional), `all_fields` (boolean, optional)
`tag_list`	Returns a list of all tag names.	GET or POST	`query` (string, optional), `vocabulary_id` (string, optional)
`package_create`	Creates a new dataset. (Requires Auth)	POST	`name` (string, required), `owner_org` (string, required: organization ID), `title` (string, optional), `resources` (list, optional)
`resource_create`	Adds a new resource to a dataset. (Requires Auth)	POST	`package_id` (string, required), `url` (string, if not uploading) or `upload` (file, for direct upload), `name` (string, optional)
`package_update` / `package_patch`	Updates an existing dataset (fully or partially). (Requires Auth)	POST	`id` or `name` (string, required), other dataset fields to modify.
`resource_update` / `resource_patch`	Updates an existing resource (fully or partially). (Requires Auth)	POST	`id` (string, required), other resource fields to modify.
`package_delete`	Marks a dataset as deleted. (Requires Auth)	POST	`id` (string, required: dataset ID or name)

Using the `ckanapi` Python Client and CLI

For Python users and system administrators, the `ckanapi` library offers a convenient way to interact with the CKAN API, both as a Python module and a command-line interface (CLI) tool.

Installation:

pip install ckanapi

CLI Examples:

List datasets:
ckanapi action package_list -r https://dadosabertos.artesp.sp.gov.br
Show dataset details (replace `your-dataset-id`):
ckanapi action package_show id=your-dataset-id -r https://dadosabertos.artesp.sp.gov.br

The `ckanapi` library is highly recommended for scripting interactions with the API.

API Usage Tips and Best Practices

Check `success` Field: Always verify the `success` field in the API response, not just the HTTP status code, to confirm the action was successful.
Error Handling: Implement robust error handling by parsing the `error` object when `success` is `false`.
Pagination: For actions that return lists (like `package_search` or `package_list`), use parameters like `rows` (or `limit`) and `start` (or `offset`) to paginate through results.
Rate Limiting: Be aware that the API might have rate limits. Design your applications to handle potential throttling gracefully.
Full Documentation: For a comprehensive list of API actions, their parameters, and more detailed examples, refer to the official CKAN API documentation (often found at `/api/3` on the CKAN instance) or specific guides provided by this portal.

Last updated: June 2025