Harvesting

This page provides information on how to harvest data from the ARTESP Open Data Portal. Harvesting allows you to automatically collect and synchronize datasets from our portal to your own systems. We offer multiple methods for harvesting data, making it easy to integrate our open datasets into your applications, analysis tools, or other data platforms.

What is Harvesting?

Harvesting is the process of automatically collecting metadata and data from one data portal to another. It allows organizations and individuals to keep a local copy of datasets that are synchronized with the original source. This is particularly useful for:

  • Creating federated or aggregated data catalogs
  • Building applications that need regular data updates
  • Integrating open data into your own systems
  • Performing analysis across multiple datasets

Available Harvesting Methods

DCAT RDF Endpoints

Our portal supports the Data Catalog Vocabulary (DCAT) standard, which provides a framework for describing datasets in a catalog. We offer the following DCAT endpoints:

Catalog Endpoint

Access all datasets in our catalog through:

  • https://dadosabertos.artesp.sp.gov.br/catalog.{format} where {format} can be xml, ttl, n3, or jsonld

Parameters:

  • page={number} - For pagination (default: 1)
  • modified_since={ISO-date} - Filter datasets modified since a specific date
  • q={query} - Search query to filter datasets

Example: https://dadosabertos.artesp.sp.gov.br/catalog.xml?page=2&modified_since=2023-01-01

Individual Dataset Endpoints

Access metadata for a specific dataset:

  • https://dadosabertos.artesp.sp.gov.br/dataset/{dataset-id}.{format} where {format} can be xml, ttl, n3, or jsonld

Example: https://dadosabertos.artesp.sp.gov.br/dataset/acidentes.xml

Content Negotiation

Our portal also supports content negotiation, allowing clients to request specific formats using HTTP Accept headers:

  • application/rdf+xml for RDF/XML format
  • text/turtle for Turtle format
  • text/n3 for N3 format
  • application/ld+json for JSON-LD format

Example using curl: curl -H "Accept: text/turtle" https://dadosabertos.artesp.sp.gov.br/dataset/rodovias-concedidas

DCAT Configuration

Our DCAT implementation is configured with the following settings:

  • RDF profile: DCAT-AP 3.0
  • RDF endpoints enabled
  • Content negotiation enabled
  • 100 datasets per page configuration

Setting Up a Harvester in CKAN

If you are using CKAN to harvest data from our portal, you can use the CKAN Harvester extension. Here are the basic steps:

  1. Install the ckanext-harvest extension in your CKAN instance
  2. Configure the harvester to use either the CKAN harvester (for CKAN-to-CKAN harvesting) or the DCAT RDF harvester (for harvesting via our DCAT endpoints)
  3. Create a new harvest source pointing to our portal URL
  4. Configure the harvester with appropriate options (frequency, filters, etc.)
  5. Start the harvesting process

Example Configuration for DCAT RDF Harvester

When setting up a DCAT RDF harvester, you can use this configuration:

{
  "rdf_format": "xml",
  "profiles": ["euro_dcat_ap_3"],
  "default_extras": {
    "harvest_source_title": "ARTESP Open Data Portal",
    "harvest_source_url": "https://dadosabertos.artesp.sp.gov.br/"
  }
}
  

Using the CKAN API for Data Access

Beyond DCAT harvesting, the CKAN Action API offers a powerful RPC-style interface to interact with the portal programmatically. You can retrieve dataset information, search for data, and much more using HTTP requests with JSON payloads. This method provides fine-grained control over data access.

Introduction to the CKAN API

The API allows you to perform most actions available through the web interface.

API Base URL: https://dadosabertos.artesp.sp.gov.br/api/3/action/

For example, to list datasets, the action name is `package_list`, and the full URL would be: https://dadosabertos.artesp.sp.gov.br/api/3/action/package_list

JSON Response Structure

A typical API response is a JSON object with the following structure:


{
  "help": "Help text for the called action...",
  "success": true, // or false in case of error
  "result": [ /* ...data returned by the action... */ ],
  "error": { // Present only if "success" is false
    "message": "Error message",
    "__type": "Error Type"
  }
}
  
  • help: A documentation string for the API function you called.
  • success: A boolean indicating if the call was successful (true) or not (false). Always check this field.
  • result: The data returned by the function. Its structure depends on the specific action.
  • error: If `success` is false, this object contains details about the error.

API Version

The current and recommended API version is v3. It is good practice to include `/api/3/` in your request URLs to ensure compatibility.

Authentication

Most read actions on public data do not require authentication. However, actions that modify data (create, update, delete) or access private datasets require an API Key. This key should be included in the `Authorization` HTTP header.

Example header: Authorization: YOUR_API_KEY_HERE

You can usually find your API key on your user profile page on the CKAN site.

Common API Actions for Data Retrieval

Here are some common read-only actions useful for accessing data and metadata:

List Datasets (package_list)

Returns a list of the names (IDs) of all public datasets.

Example using cURL:

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_list"

Show Dataset Details (package_show)

Returns complete information about a specific dataset, including its resources.

Parameters:

  • id (string): The name (ID) or UUID of the dataset.

Example using cURL (replace `your-dataset-id` with an actual ID):

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_show?id=your-dataset-id"

Search Datasets (package_search)

Allows searching for datasets based on various criteria.

Common Parameters:

  • q (string): Search term (e.g., `q=transport`).
  • fq (string): Filter query using Solr syntax (e.g., `fq=tags:economy organization:artesp`).
  • rows (int): Number of results per page (default 10).
  • start (int): Offset for pagination.
  • sort (string): Sorting criteria (e.g., `sort=score desc, metadata_modified desc`).

Example using cURL (searching for "rodovias" and limiting to 5 results):

curl -X GET "https://dadosabertos.artesp.sp.gov.br/api/3/action/package_search?q=rodovias&rows=5"

Other Listing and Show Actions

Similar actions are available for other CKAN entities:

  • organization_list / organization_show: For organizations.
  • group_list / group_show: For groups.
  • tag_list / tag_show: For tags.
  • resource_show: To get details of a specific resource (file/link within a dataset). Requires resource ID.

API Actions Cheat Sheet

The following table provides a quick summary of common API actions:

Action Name Description HTTP Method Key Parameters (in JSON body or URL query)
package_list Returns a list of the names (IDs) of all public datasets. GET or POST limit (int, optional), offset (int, optional)
package_show Returns detailed metadata for a specific dataset. GET or POST id (string, required: dataset ID or name)
package_search Searches datasets based on various criteria. GET or POST q (string, optional: search term), fq (string, optional: filter query), rows (int, optional), start (int, optional)
resource_show Returns detailed metadata for a specific resource. GET or POST id (string, required: resource ID)
organization_list Returns a list of names (IDs) of all public organizations. GET or POST limit (int, optional), offset (int, optional), all_fields (boolean, optional)
organization_show Returns detailed metadata for a specific organization. GET or POST id (string, required: organization ID or name), include_datasets (boolean, optional)
group_list Returns a list of names (IDs) of all public groups. GET or POST limit (int, optional), offset (int, optional), all_fields (boolean, optional)
tag_list Returns a list of all tag names. GET or POST query (string, optional), vocabulary_id (string, optional)
package_create Creates a new dataset. (Requires Auth) POST name (string, required), owner_org (string, required: organization ID), title (string, optional), resources (list, optional)
resource_create Adds a new resource to a dataset. (Requires Auth) POST package_id (string, required), url (string, if not uploading) or upload (file, for direct upload), name (string, optional)
package_update / package_patch Updates an existing dataset (fully or partially). (Requires Auth) POST id or name (string, required), other dataset fields to modify.
resource_update / resource_patch Updates an existing resource (fully or partially). (Requires Auth) POST id (string, required), other resource fields to modify.
package_delete Marks a dataset as deleted. (Requires Auth) POST id (string, required: dataset ID or name)

Using the `ckanapi` Python Client and CLI

For Python users and system administrators, the `ckanapi` library offers a convenient way to interact with the CKAN API, both as a Python module and a command-line interface (CLI) tool.

Installation:

pip install ckanapi

CLI Examples:

  • List datasets:
    ckanapi action package_list -r https://dadosabertos.artesp.sp.gov.br
  • Show dataset details (replace `your-dataset-id`):
    ckanapi action package_show id=your-dataset-id -r https://dadosabertos.artesp.sp.gov.br

The `ckanapi` library is highly recommended for scripting interactions with the API.

API Usage Tips and Best Practices

  • Check `success` Field: Always verify the `success` field in the API response, not just the HTTP status code, to confirm the action was successful.
  • Error Handling: Implement robust error handling by parsing the `error` object when `success` is `false`.
  • Pagination: For actions that return lists (like `package_search` or `package_list`), use parameters like `rows` (or `limit`) and `start` (or `offset`) to paginate through results.
  • Rate Limiting: Be aware that the API might have rate limits. Design your applications to handle potential throttling gracefully.
  • Full Documentation: For a comprehensive list of API actions, their parameters, and more detailed examples, refer to the official CKAN API documentation (often found at `/api/3` on the CKAN instance) or specific guides provided by this portal.

Last updated: June 2025