Openactive Real-Time Paged Data Exchange Standard v0.2.1 (draft)

Document Purpose

This document represents an early draft of a standard for real-time paged data exchange for physical activity data. It is likely to change between now and the first published version, so this document should only be used to guide implementations in conversation with the Openactive community where the quick win of data sharing and shared learning are high priorities.

Scope

This standard is tightly defined to cover data exchange and synchronisation itself; to cover the real-time exchange of generic entities between two systems.

Goals:

Non-Goals:

Objectives

Contents

  1. Overall approach

Explanation of the construction of this standard.

  1. Robust Paged JSON Data Exchange

A paged approach to JSON data exchange requiring minimum traffic for real-time or near-real-time synchronisation

  1. Real-time Transport Mechanism

A number of options that covers different levels of complexity and data volume:

  1. Polling
  2. Webhooks
  3. Server-Sent Events
  4. AMQP

  1. Implementation

Three common patterns of implementation are presented, along with specific advantages and disadvantages of each, together with miscellaneous notes:

  1. Single JSON entity cache table
  2. Multiple table on-demand JSON generation
  3. Hybrid approach: paging table
  4. Additional notes

1. Overall approach

In order to create a simple standard that is robust and scalable, the transport mechanism is separated from the paged exchange specifics. By applying paging to all transport alternatives, the approach is inherently scalable.

2. Robust Paged JSON Data Exchange

The paged JSON data exchange standard is incredibly simple to implement, but conceptually requires some explanation.

Consider the data exchanged as pages of a continuous stream of records which are sorted first by modified timestamp, and second by ID. This dual sort allows for pages of arbitrary size to be sent without concern for race conditions; if a record is updated during the transfer it will simply reappear further down the stream. If the consumer reaches the end of the stream they are up-to-date, and can frequently revisit the end of the stream in order to retrieve further updates.

Specifically, a JSON page is exchanged, the content of which is defined by two parameters:

Note that the “modified timestamp” (“from” parameter) does not need to be an actual time that can be associated with a clock to conform to this standard; a database-wide counter is sufficient (such as SQL Server’s timestamp or rowversion, the former deprecated as of SQL Server 2012 in favour of the latter). The consumer (“System 2”) maintains a timestamp cursor for each endpoint independently, so at a minimum the “modified” field must provide a deterministic chronological ordering within the scope of the endpoint only.

In the example diagram above, from=1453931925&after=12 would return the second page indicated, hence the last record of a returned page of results can be used as parameters to retrieve the next page of results. This paging allows for an ongoing data synchronisation that synchronizes all data, which can be replayed arbitrarily by the client.

For a single entity type (e.g. sessions), the data returned can be defined as follows:

var query;

if (queryParams.from) {

query = Session.query().filter((Session.modified == queryParams.from && Session.id > queryParams.after) || (Session.modified > queryParams.from));

} else {

query = Session.query()

}

return query.sort([Session.modified, Session.id])

The above can be run for each relevant entity type (e.g. club, courses, sessions), and they can be reassembled by the client.

Response grammar / example with notes:

  1. A suggested generic response standard is included here, the idea being that we standardise the transport encapsulation for the records (paging and polling logic) across entities and systems so that, we can genericise this logic on both sides.
  2. The “kind” attribute allows for the representation of different entity types. The standard does not advocate embedding of child entities if they change more frequently than the parent. Each entity type (“kind”) can be synchronised separately, this allows us to decouple the sync logic from the data structure, and allows us to reassemble the data structure on the client side. It also makes the implementation very simple.
  3. The <data> part of the response is then open for whatever data structure is most appropriate for the particular entity type (“kind”).
  4. Deleted items are included in the response with a “deleted” state, but no <data> associated.
  5. For polling, the “next” URL in the response is a precomputed next URL that would be called by the client to get the next page (which would be polled after a delay if the previous page had returned no data). Note “polling” and “paging” are differentiated only by the duration between requests.
  6. Although IDs shown here are GUIDs, and above are numeric, and although the example above shows Unix timestamps, the standard does not prescribe any specific format of either.
  7. Although IDs are provided for other related entities, this structure is not part of the standard.
  8. Although an example endpoint name is provided, this is outside the scope of this standard.

<response> => { (1)

items: [<item>,<item>,<item>,...],

next: '/getSessions?from=<date>&after=<id>' (5)

   }

<item> => {

state: 'Updated' | 'Deleted', (4)

kind: "session", (2)

id: "{21EC2020-3AEA-4069-A2DD-08002B303123}", (6)

modified: Date(a),

data: <data> (4)

   }

<data> => { (3)

lat: 51.5072,

lng: -0.1275,

name: 'Acrobatics with Dave',

                groupId: "{0657FD6D-A4AB-43C4-84E5-0933C84B4F4F}" (7)

clubId: "{38A52BE4-9352-453E-AF97-5C3B448652F0}"

   }

A full example REST response from polling:

/getSessions?from=Date(a)

-> { items: [{

   state: 'Updated',

   kind: “session”,

   id: “{c15814e5-8931-470c-8a16-ef45afedaece}”,

   modified: Date(a),

   data: {

       lat: 51.5072,

       lng: -0.1275,

       name: 'Acrobatics with Dave',

       clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"

   }

},{

   state: 'Deleted',

   kind: “session”,

   id: “{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}”,

   modified: Date(b)

}],

next: '/getSessions?from=Date(b)&after={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}'

}

3. Transport Mechanism

These transport options cover different levels of complexity and data volume. Note that in all cases polling must be implemented to support a full cache refresh and data download. The real-time transport mechanisms work alongside infrequent polling to keep the data current.

In the case of real time transport failure, a production client implementation can fall back to polling.

Transport Options

Advantages

Disadvantages

Primary Use Case

Polling

(Simple download)

Simple to implement

Does not provide a real-time feed, and heuristic polling will result in patchy sync

Full cache refresh (also can be used in isolation for prototype implementation).

Webhooks

(Real-time)

Less traffic than polling, more server-side control, allows for real-time, uses standard REST interface

Uses many high-latency connections

Basic production implementation

Server-Sent Events

(Real-time)

Optimisation over webhooks as uses one connection, so can handle higher volume

Requires additional libraries

High volume production implementation

AMQP

(Real-time)

Pages can be handed off to the queue to facilitate even higher volume than Server-Sent Events

Requires additional infrastructure

Very high volume production implementation

a. Polling

A basic REST endpoint which accepts the from and after parameters is required to allow a full cache refresh / data download on demand (e.g. /getSessions?from=Date(b)&after={d97f73fb}).


For cases where only a polling endpoint is available, the client will poll the endpoint using heuristic backoff.

Implementation of at least one other type of endpoint is recommended in order to enable real-time updates.

b. Webhooks

Webhooks use the same mechanism as polling, except that pages are pushed from server to client, rather than requested explicitly by the client from the server.

The client registers an endpoint with the server, and the server repeatedly sends subsequent pages to the client. Using the same paging features as with polling allows the server to batch items to increase throughput.

When sending a particular page to the client the server is expected to wait for a successful acknowledgement of the page before sending the next page. If sending of a page fails that page should be continuously retried with an exponential backoff. The server should only proceed to the following page after successfully sending the previous one.

Note that during a full cache refresh the client will page the server for all data, and may simultaneously be receiving webhook requests from the client. Using the timestamp of each record to ensure records are only updated with newer data, the client is able to perform both the full cache refresh and receive webhook requests simultaneously. Alternatively the client can choose to drop webhook requests until its full cache refresh is complete, which should trigger the exponential backoff behaviour from the server, ensuring a good crossover in items between the end of the cache refresh and the webhook updates resuming.

Example payload for the webhook:

  1. Note that the “next” from the response is not required here. Instead, from and after should be stored for each client of the server, in order that the server is able to send the relevant next page to each client.

/putSessions

-> { items: [{

   state: 'Updated',

   kind: “session”,

   id: “{c15814e5-8931-470c-8a16-ef45afedaece}”,

   modified: Date(a),

   data: {

       lat: 51.5072,

       lng: -0.1275,

       name: 'Acrobatics with Dave',

       clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"

   }

},{

   state: 'Deleted',

   kind: “session”,

   id: “{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}”,

   modified: Date(b)

}] (1)

}

c. Server-Sent Events

Server-Sent Events (over HTTPS) provides a simple and efficient channel through which a high volume of updates can pass. Although an additional library may be required depending on the servers’ platform, it is a very light implementation.

The responsibility is on the client to reestablish a connection to the server and inform it of the last retrieved record in order to continue the stream, which is closer to polling than to webhooks.

Response grammar / example with notes:

  1. Only the <item> from the <response> is required here, and is passed as “data” (the Server-Sent Event specification of “data”, which is different from the “data” part of the item). The explicit paging used with polling and webhooks is made redundant as this is a continuous stream over one connection.
  2. The event type is always set to “itemupdate”, as the state of each item is set with the “state” field consistent with polling.

/stream?from=Date(b)&after={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}

->

event: itemupdate (2)

data: <item> (1)

An example stream is below:

/stream?from=Date(a)&after={a97f73fb-4718-48ee-a6a9-9c7d717ebd85}

->

event: itemupdate

data: {

   state: 'Updated',

   kind: “session”,

   id: “{c15814e5-8931-470c-8a16-ef45afedaece}”,

   modified: Date(a),

   data: {

       lat: 51.5072,

       lng: -0.1275,

       name: 'Acrobatics with Dave',

       clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"

   }

event: itemupdate

data: {

   state: 'Deleted',

   kind: “session”,

   id: “{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}”,

   modified: Date(b)

}

d. AMQP

AMQP (RabbitMQ et. al.) provides a two way channel for events which includes buffering and multiple connections to increase throughput.

As with Server-sent events, only the <item> from the <response> is required to be sent in the message, as AMQP makes the explicit paging redundant.

For very high volumes, this allows the server to send multiple pages in parallel, as it can calculate the “from” and “after” parameter and control the send. The client can also process these in parallel (particularly useful in the case of shared or No-SQL data stores, with scaling queue processors), using the timestamp of each record to ensure records are only updated with newer data. 

However this requires additional infrastructure and the use of certificates (more complex to configure than HTTPS).

4. Implementation

a. Single JSON entity cache table

Create a cache table which is written to on each entity change (or related entity change, if a calculated field is created), either via an application or database trigger (which can also be used to trigger the webhook). The table contains the rendered JSON <item>, along with the modified timestamp and the ID.

Entries in the table overwrite old items with a newer modified timestamp.

This table can be easily parsed into output for the client. This has the advantage of allowing one endpoint and one process to manage the real-time sync by watching this single table, as the table can maintain a sort and page across all entity “kinds”.

b. Multiple table on-demand JSON generation

The JSON is generated from each table individually at the point that is requested by either the webhook or poll.

An endpoint will be required for each entity “kind”, as the sort cannot efficiently happen across tables. These endpoints would require separate webhook / polling processes to keep them in sync (though the webhooks can all share the same endpoint on the client).

c. Hybrid approach: paging table

A paging table could be created that contains only the Kind, Modified Timestamp and ID. This table is then updated with each entity update, however the JSON is only generated on-demand by getting the next page from the paging table and rendering JSON for each of IDs of the “kinds” returned.

d. Additional notes

Calculated fields (e.g. available spaces or available tickets for a session)

As some entities will not need to be synchronised, but fields calculated from them will need to be known to the client (e.g. the “tickets” table may not need to be synchronised, but the “available tickets” calculated field on the “sessions” table will be required).

The suggested approach is to calculate the field “available tickets” and store it in the “sessions” table on each ticket sale. This has three advantages:

An alternative could be to calculate it on each synchronisation, however this will slow down the sync. Assuming that reads will occur on this calculated data more frequently than writes, caching the calculated field is recommended.

Strict ordering of items

For high-throughput AMQP, when updating data in the client’s index, the timestamp of each record will be used to ensure records are only updated with newest data. Although this technique can be used specifically for AMQP, all other methods of transport must adhere to a strict ordering of items by modified timestamp and ID (per kind) in order to ensure data consistency. Ordering between kinds is not important.

Database Triggers (MS SQL Server example)

If using a timestamp/rowversion field on the parent table as the “modified timestamp”, using the trigger below for each child table will update the rowversion field in the relevant rows on the parent table when the child table is updated.

The SET SomeColumn = SomeColumn part of this trigger could easily be replaced with setting materialised calculated fields (e.g. “total number of tickets sold”) which contain the summary data required by “System 2”. This reduces the need to join the child table during the API endpoint response, which helps to optimise the endpoint.

The example below has been adapted from here, see here for an explanation of the mechanics.

CREATE TRIGGER tgUpdateParentRowVersion ON ChildTable FOR INSERT, DELETE, UPDATE

AS

BEGIN

   /*

    * The updates below force the update of the parent table rowversion

    */

   /* Materialised field calculation goes here*/

   UPDATE ParentTable

      SET SomeColumn = SomeColumn

     FROM ParentTable a

     JOIN inserted    i on a.pkParentTable = i.fkParentTable

   UPDATE ParentTable

      SET SomeColumn = SomeColumn

     FROM ParentTable a

     JOIN deleted     d on a.pkParentTable = d.fkParentTable

END

Version 0.2

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/