Realtime Paged Data Exchange 0.3.0

Abstract

This specification tackles the generic use-case of unidirectional real-time data synchronisation between two systems, where the receiving system requires only summary data from the origin system.

1. Introduction

This section is non-normative.

1.1 Openactive

The W3C Openactive Community Group was established with the objective of facilitating the sharing and use of physical activity data. The motivation for that is physical activity data is currently largely closed, and sharing and opening up of this data has enormous potential.

Openactive specifications are modular, each focussing on a specific use cases regarding physical activity data.

Other aspects of the Community Group's work includes the collation and specification of test suites to assess conformance to referenced standards, and the specification of a framework within which such conformance testing might be assessed are referenced.

1.2 Use case and requirements

Booking systems contains session data (a specific event at a time in a location). Examples include bookable squash courts, Yoga classes, and running groups.

1.2.1 Realtime

This session data is frequently updated, as sessions are booked, descriptions changed. The data must be available as close to real-time as possible.

1.2.2 Simplicity of implementation

Many booking systems are maintained by an agency or supplier, and changes are funded by organisations with a small budgets. Hence simplicity and speed of implementation is paramount.

1.2.3 Flexibility of interface

Due to the cost of work on the booking system side, this interface must not be a constraint to innovation. Innovators should be able to easily implement novel solutions on top of the data without additional effort on the booking system side. This means ideally pushing all query complexity out of the booking system in order to maximise flexibility, and providing a simple sync.

1.3 Scope

This specification is tightly defined to cover data exchange and synchronisation itself; to cover the real-time exchange of generic entities between two systems.

1.3.1 Goals

Sharing of session data, including related metadata (clubs, courses, membership requirements, skill level, cost), and events (e.g. open days).
Paging and synchronisation to ensure robust incremental transport of data.
Allows a client to "refresh its cache", by providing the facility to download all data.
(Stretch goal) Extendable to support real-time, high volume data transfer, to satisfy peak load requirements.

1.3.2 Non-Goals

User Authentication.
Booking and payment.
Availability of courts and facilities.
Membership packages.
Standardisation of content (e.g. "Football" = "Soccer").

1.4 Objectives

Easy to understand
Simple to implement (does not require complex libraries)
Based on existing standards where possible
Minimalistic (focus on removing complexity)
Minimise traffic between services
Robust to errors and failed requests
Capable of scaling to handle high volumes

4. Paging

A paged approach to [JSON] data exchange is used, requiring minimum traffic for real-time or near-real-time synchronisation.

4.1 Simplicity

The paged [JSON] data exchange standard is incredibly simple to implement, but conceptually requires some explanation.

4.2 Core concept

Consider an ordered list as shown in Fig. 2 Illustration of the deterministic list, with the following invariants:

A continuous list of records that MUST be sorted deterministically and chronological (in the order they were updated). Either (i) ordered first by modified timestamp, and second by ID or (ii) ordered by an incrementing counter where records are assigned a new unique value on each update.
Every record MUST only be represented once in this list at a given moment, with its position in the list depending on when it was last updated. Records can freely move position in the list as they are updated.
This deterministic ordering based on timestamp allows for pages of arbitrary size to be sent without concern for race conditions; if a record is updated during the transfer of a page it MUST appear on a subsiquent page (i.e. simply reappear further down the list).
Pages are defined using a "next page URL", which MUST contain enough information to identify a position in the list (e.g. by a "timestamp" and "ID" combination). It MUST NOT reference a specific record, as that record can change position in the list.
If a "next page URL" is not used to access the list, the first page MUST be returned.
If the consumer reaches the end of the list they consider themselves up-to-date at that moment, and can frequently revisit the end of the list in order to retrieve further updates.

Fig. 2 Illustration of the deterministic list

4.3 Example Implementations

Two example implementations are described below, both adhere to the specified invariants. One of these two examples SHOULD be used as a basis for any implementation, unless the invariants are well understood and the implementation can be thoroughly tested.

4.3.1 Modified Timestamp and ID

The content of a [JSON] page which could be defined by two parameters:

afterTimestamp: the modified timestamp, after which results will be returned (if not specified will return from beginning of time)
afterId: the ID after which results will be returned (if not specified will return all IDs from the first)

In this example, afterTimestamp=1453931925&afterId=12 would return Page 2 labelled in Fig. 2 Illustration of the deterministic list. ence the last record of a returned page of results can be used as parameters to retrieve the next page of results. This paging allows for an ongoing data synchronisation that synchronises all data, which can be replayed arbitrarily by the client.

Note that the timestamp does not need to reflect actual time, as long as it providers a chronological ordering. An integer, string or other representation is sufficient, provided it is sufficiently comparable to support the invariants.

For a single entity type (e.g. sessions), the data returned can be defined as follows:

   var query;
   if (queryParams.from) {
       query = Session.query().filter((Session.modified == queryParams.afterTimestamp && Session.id > queryParams.afterId) || (Session.modified > queryParams.afterTimestamp));
   } else {
       query = Session.query()
   }
   return query.sort([Session.modified, Session.id])

4.3.2 Incrementing Unique ID

The content of a [JSON] page could be defined by just one parameter:

afterId: the ID after which results will be returned (if not specified will return all IDs from the first)

The ID MUST provide a deterministic chronological ordering within the scope of the endpoint. A database-wide counter is sufficient (such as SQL Server's timestamp or rowversion). The consumer ("System 2") simply maintains the "next page URL", so the detail of the ID is not constrained.

This specification can be implemented for each relevant entity type within System 1 (e.g. club, courses, sessions).

Data from related entities can either be:

Embedded inside the parent entity. This requires the parent "updated" field to reflect the maximum value of all children - achievable via database triggers or similar.
Provided by separate endpoints implementing this specification, the combination of which can be reassembled by System 2.

4.5 Response grammar

It should be noted that although the <data> element is open to conformance with [JSON-LD], the paging has been deemed too trivial to require the application of any specific serialization.

4.5.1 `<response>`

   <response> => {
         items: [<item>,<item>,<item>,...],
         next: "/getSessions?afterTimestamp=<date>&afterId=<id>"
      }

A generic response specification is included here, the idea being that we standardise the transport encapsulation for the records (paging and polling logic) across entities and systems so that, we can genericise this logic on both sides.

Key	Description
items	An array of `<item>`, which should simply by empty `[]` if no results are returned.
next	For polling, the "next" URL in the response is a precomputed next URL that would be called by the client to get the next page (which would be polled after a delay if the previous page had returned no data). Note "polling" and "paging" are differentiated only by the duration between requests. Although an example endpoint name is provided, this is outside the scope of this standard.

4.5.2 `<item>`

   <item> => {
         state: 'updated' | 'deleted', (4)
         kind: "session", (2)
         id: "{21EC2020-3AEA-4069-A2DD-08002B303123}", (6)
         modified: Date(a),
         data: <data> (4)
      }

Key	Description
state	Deleted items are included in the response with a "deleted" state, but no `<data>` associated.
kind	The "kind" attribute allows for the representation of different entity types. The standard does not advocate embedding of child entities if they change more frequently than the parent. Each entity type ("kind") can be synchronised separately, this allows us to decouple the sync logic from the data structure, and allows us to reassemble the data structure on the client side. It also makes the implementation very simple.
id	Although IDs shown here are GUIDs, and above are numeric, and although the example above shows Unix timestamps, the standard does not prescribe any specific format of either.
modified	Modified timestamp of the item. Must be comparable to itself.
data	Note this key is not included if state is `'deleted'`

4.5.3 `<data>`

The <data> part of the response is open for applications for populate with whatever data structure is most appropriate for the entity type ("kind") being exchanged.

Applications SHOULD use a standard data model to increase interoperability of the exchanged data. For example an application may choose to use entities and properties defined in [SCHEMA-ORG] to help structure data in a useful way for reusers.

Applications that are exchanging data about entities described in [Modelling-Opportunity-Data] SHOULD ensure that their data conforms to that specification. This includes data on Events, Organisations, Places, etc. [Publishing-Opportunity-Data] provides additional examples and guidance on using that data model in ways that are compatible with the other standards publishing by OpenActive.

Applications that publish data using bespoke data models SHOULD provide users with relevant documentation.

Note

The remainder of this specification uses simple examples when showing <data> elements in responses. However these examples are for illustrative purposes only. For a comprehensive selection of examples of opportunity data, read the [Publishing-Opportunity-Data] primer.

4.6 Example

A full example REST response from polling:

   /getSessions?afterTimestamp=1453931101&afterId={c15814e5-8931-470c-8a16-ef45afedaece}
   -> { items: [
           {
               state: "updated",
               kind: "session",
               id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
               modified: 1453931101,
               data: {
                  lat: 51.5072,
                  lng: -0.1275,
                  name: 'Acrobatics with Dave',
                  clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
               }
           },
           {
               state: "deleted",
               kind: "session",
               id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
               modified: 1453931925
           }
         ],
         next: '/getSessions?afterTimestamp=1453931925&afterId={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}'
   }

5. Transport

These transport options cover different levels of complexity and data volume. Note that in all cases polling must be implemented to support a full cache refresh and data download. The real-time transport mechanisms work alongside infrequent polling to keep the data current.

In the case of real time transport failure, a production client implementation can fall back to polling.

Transport Options	Advantages	Disadvantages	Primary Use Case
Polling (Simple download)	Simple to implement	Does not provide a real-time feed, and heuristic polling will result in patchy sync	Full cache refresh (also can be used in isolation for prototype implementation).
Webhooks (Real-time)	Less traffic than polling, more server-side control, allows for real-time, uses standard REST interface	Uses many high-latency connections	Basic production implementation
Server-Sent Events (Real-time)	Optimisation over webhooks as uses one connection, so can handle higher volume	Requires additional libraries	High volume production implementation
AMQP (Real-time)	Pages can be handed off to the queue to facilitate even higher volume than Server-Sent Events	Requires additional infrastructure	Very high volume production implementation

5.1 Polling

A basic REST endpoint which accepts the from and after parameters is required to allow a full cache refresh / data download on demand (e.g. /getSessions?afterTimestamp=Date(b)&afterId={d97f73fb}).

For cases where only a polling endpoint is available, the client will poll the endpoint using heuristic backoff.

Implementation of at least one other type of endpoint is recommended in order to enable real-time updates.

5.2 Webhooks

Webhooks use the same mechanism as polling, except that pages are pushed from server to client, rather than requested explicitly by the client from the server.

The client registers an endpoint with the server, and the server repeatedly sends subsequent pages to the client. Using the same paging features as with polling allows the server to batch items to increase throughput.

When sending a particular page to the client the server is expected to wait for a successful acknowledgement of the page before sending the next page. If sending of a page fails that page should be continuously retried with an exponential backoff. The server should only proceed to the following page after successfully sending the previous one.

Note that during a full cache refresh the client will page the server for all data, and may simultaneously be receiving webhook requests from the client. Using the timestamp of each record to ensure records are only updated with newer data, the client is able to perform both the full cache refresh and receive webhook requests simultaneously. Alternatively the client can choose to drop webhook requests until its full cache refresh is complete, which should trigger the exponential backoff behaviour from the server, ensuring a good crossover in items between the end of the cache refresh and the webhook updates resuming.

5.2.1 Webhooks Example

Example payload for the webhook:

   /putSessions
   -> { items: [{
          state: 'updated',
          kind: "session",
          id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
          modified: Date(a),
          data: {
              lat: 51.5072,
              lng: -0.1275,
              name: 'Acrobatics with Dave',
              clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
          }
       },{
          state: 'deleted',
          kind: "session",
          id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
          modified: Date(b)
       }]
   }

Note that the "next" from the response is not required here. Instead, from and after should be stored for each client of the server, in order that the server is able to send the relevant next page to each client.

5.3 Server-Sent Events ([eventsource])

Server-Sent Events ([eventsource]) (over HTTPS) provides a simple and efficient channel through which a high volume of updates can pass. Although an additional library may be required depending on the servers' platform, it is a very light implementation.

Server-Sent Events require System 1 to implement a system to keep track of events and trigger the sending of data through this channel. This non-trivial implementation, together with the complexities of library support in various languages and frameworks compared with a straightforward [JSON] API implementation used in Polling and Webhooks is the reason that the other transport mechanisms are provided.

The responsibility is on the client to reestablish a connection to the server and inform it of the last retrieved record in order to continue the stream, which is closer to polling than to webhooks.

Response grammar / example:

   /stream?afterTimestamp=Date(b)&afterId={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}
   ->
   event: itemupdate
   data: <item>

Only the <item> from the <response> is required here, and is passed as "data" (the Server-Sent Event specification of "data", which is different from the "data" part of the item). The explicit paging used with polling and webhooks is made redundant as this is a continuous stream over one connection.

The event type is always set to "itemupdate", as the state of each item is set with the "state" field consistent with polling.

5.3.1 Server-Sent Events Example

An example stream is below:

   /stream?afterTimestamp=Date(a)&afterId={a97f73fb-4718-48ee-a6a9-9c7d717ebd85}
   ->
   event: itemupdate
   data: {
      state: 'updated',
      kind: "session",
      id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
      modified: Date(a),
      data: {
          lat: 51.5072,
          lng: -0.1275,
          name: 'Acrobatics with Dave',
          clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
      }

   event: itemupdate
   data: {
      state: 'deleted',
      kind: "session",
      id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
      modified: Date(b)
   }

5.4 AMQP

[OASIS AMQP] (RabbitMQ et. al.) provides a two way channel for events which includes buffering and multiple connections to increase throughput.

As with Server-sent events, only the <item> from the <response> is required to be sent in the message, as AMQP makes the explicit paging redundant.

For very high volumes, this allows the server to send multiple pages in parallel, as it can calculate the "afterTimestamp" and "afterId" parameter and control the send. The client can also process these in parallel (particularly useful in the case of shared or No-SQL data stores, with scaling queue processors), using the timestamp of each record to ensure records are only updated with newer data. However this requires additional infrastructure and the use of certificates (more complex to configure than HTTPS).

6. Implementation

This section is non-normative.

Three common patterns of implementation are presented, along with specific advantages and disadvantages of each, together with miscellaneous notes.

6.1 Single JSON entity cache table

Create a cache table which is written to on each entity change (or related entity change, if a calculated field is created), either via an application or database trigger (which can also be used to trigger the webhook). The table contains the rendered [JSON] <item>, along with the modified timestamp and the ID.

Entries in the table overwrite old items with a newer modified timestamp.

This table can be easily parsed into output for the client. This has the advantage of allowing one endpoint and one process to manage the real-time sync by watching this single table, as the table can maintain a sort and page across all entity "kinds".

6.2 Multiple table on-demand JSON generation

The [JSON] is generated from each table individually at the point that is requested by either the webhook or poll.

An endpoint will be required for each entity "kind", as the sort cannot efficiently happen across tables. These endpoints would require separate webhook / polling processes to keep them in sync (though the webhooks can all share the same endpoint on the client).

6.3 Hybrid approach: paging table

A paging table could be created that contains only the Kind, Modified Timestamp and ID. This table is then updated with each entity update, however the [JSON] is only generated on-demand by getting the next page from the paging table and rendering [JSON] for each of IDs of the "kinds" returned.

6.4 Additional notes

6.4.1 Calculated fields (e.g. available spaces or available tickets for a session)

As some entities will not need to be synchronised, but fields calculated from them will need to be known to the client (e.g. the "tickets" table may not need to be synchronised, but the "available tickets" calculated field on the "sessions" table will be required).

The suggested approach is to calculate the field "available tickets" and store it in the "sessions" table on each ticket sale. This has three advantages:

It keeps the application logic for such calculation together with the original action.
It creates a cached value that can be used in other parts of the application to increase performance.
It prevents the creation of complex calculations which are used only on-demand for the purposes of data synchronisation.

An alternative could be to calculate it on each synchronisation, however this will slow down the sync. Assuming that reads will occur on this calculated data more frequently than writes, caching the calculated field is recommended.

6.4.2 Strict ordering of items

For high-throughput AMQP, when updating data in the client's index, the timestamp of each record will be used to ensure records are only updated with newest data. Although this technique can be used specifically for AMQP, all other methods of transport must adhere to a strict ordering of items by modified timestamp and ID (per kind) in order to ensure data consistency. Ordering between kinds is not important.

6.4.3 Database Triggers (MS SQL Server example)

If using a timestamp/rowversion field on the parent table as the "modified timestamp", using the trigger below for each child table will update the rowversion field in the relevant rows on the parent table when the child table is updated.

The SET SomeColumn = SomeColumn part of this trigger could easily be replaced with setting materialised calculated fields (e.g. "total number of tickets sold") which contain the summary data required by "System 2". This reduces the need to join the child table during the API endpoint response, which helps to optimise the endpoint.

The example below has been adapted from here, see here for an explanation of the mechanics.

CREATE TRIGGER tgUpdateParentRowVersion ON ChildTable FOR INSERT, DELETE, UPDATE
AS
BEGIN
  /
    The updates below force the update of the parent table rowversion
   /

  / Materialised field calculation goes here*/

  UPDATE ParentTable
     SET SomeColumn = SomeColumn
    FROM ParentTable a
    JOIN inserted    i on a.pkParentTable = i.fkParentTable

  UPDATE ParentTable
     SET SomeColumn = SomeColumn
    FROM ParentTable a
    JOIN deleted     d on a.pkParentTable = d.fkParentTable
END