Problem Statement

The Data Transfer Initiative (DTI) works to streamline how users can execute server-to-server data portability. We won’t get into why we think data portability is important because we’ve written about that elsewhere. DTI’s work aims to make things better in several places, such as libraries and open source projects, interoperable data schemas, and the trust work that resulted in this site. This document is about how trusted connections between service pairs today is both procedurally and technically slow and involves a lot of time and overhead, but how it could be much better. Today, service connections must be approved and configured through extensive processes that must be completed (then maintained year after year!) before any user is presented any options to do data transfer.

The diagram below shows the connectivity among the user, source, and destination in executing server-to-server transfer, but with “Trust & Identity” as step 2 rather than step 1. This raises significant questions for interoperability and architecture, namely: Is it possible for the user to request data transfer and THEN for the services to identify each other and trust each other? Or can the user only ever choose from a limited set of pre-approved transfer destinations?

Connectivity among the user, source, and destination during a server-to-server data transfer
Figure 1: Connectivity among the user, source service, and destination service during a server-to-server data transfer

Today, the services pre-arrange their secure connections and the user can only choose among available connections, but we see a future where we can dynamically establish trust and identity.

We believe that making this architecture more flexible will result in increased user choice and innovation, because the costs for services to trust and identify each other today are so high as to deter startups and experiments. We know startups that have taken months to work through applications to multiple large platforms and see months of delay before launching an innovative service. For example, a social service built around music can hardly recruit its network of music lovers if it can’t get users’ playlists from Apple Music, Amazon Music and Spotify. It sounds like it should be doable, but it’s extraordinarily time-consuming for the startup to achieve all three and it might just not happen if the startup is denied by one of them.

The costs don’t end there. Typically the service developers must implement a different API for every data source they’d like to get user data from. Assuming they were approved for access to each platform, the startup has to maintain access, dealing with different renewal or notification systems. The service developers must also securely store and manage their access identifiers (API key, OAuth client_id/secret pair, secure token, service account login, etc) from each platform. Without a different access identifier for every service, the developer cannot connect to each service API. The developer must manage a different access identifier for every service, whether specific to the user account or generic, and separately manage the refreshing or expiration of all these access identifiers, in addition to managing any test accounts and test access identifiers needed.

DTI’s Data Trust Registry will streamline the trust verification process and reduce costs and uncertainty in that step. But with the technology we presently have, even successfully improving trust verification will still leave the startup manually applying to every different service for their access identifier and implementing a different authorization protocol and data transfer protocol for each service. Can we do better than this? Yes - but it’s going to take time and technology.

Our vision is to standardize the trust verification process and drastically improve the ability of services to connect to other services when authorized by the user. The early Web saw a similar transition from experimental encryption and manually-issued certificates to SSL then TLS and automated certificate management (ACME) – and now it takes minutes to get a secure Web site online.

We will begin to achieve our goals when a new entrant to the DTR’s ecosystem can

  • register with one data source instead of several;
  • can implement one protocol or suite of protocols which will work with everybody;
  • and once they do these two things, everything else will just work.
Trust, identity, and authorization evolution
Figure 2: Evolution of trust, identity, and authorization

We break down connecting and data transfer into three major phases:

  • Service introduction. How do the services recognize each other and exchange secrets (to avoid man-in-the-middle or impersonation attacks)? How do they agree to trust each other with the user’s data? Can this be done without manual registration processes between every two services?
  • User authorization. How does the user authorize one service to access an account at the other service to start fetching or sending data?
  • Data transfer. What protocol is used to transfer data? Does each service have its own unique API?

We prefer to thoroughly evaluate what is in use today, conduct a gap assessment, and determine if we can close the gaps, rather than introduce wholly new protocols.

Service introduction

Today, most service introductions require manual processes before any user can transfer their data. Two services must already know about each other before a user can even start a transfer. Normally, this involves humans reading human instructions and ensuring the right platform-specific information is exchanged. If a user wants to initiate a data transfer, their only options are destinations for which the service provider has already put in the trust and identity work. These arrangements are often not only manually configured but manually approved. New services wanting to connect to existing services often need to apply through systems that may require long delays, human approvals, or compliance reviews.

This means that a tiny mismatch in expectations — what a service expects from a partner during an authorization or API request — can break interoperability, and most interoperability is established through manual back-and-forth, human communication, and internal configuration.

In the future, we envision a world where services can dynamically introduce themselves to each other, securely and automatically. The user initiates a transfer, and only THEN do the services perform trust + identity checks. If service B has never seen service A before, it can still identify it, validate its credentials, check its trust status in the Data Trust Registry, and proceed — without pre-registration, without manual approvals, and without long delays.

Authorization Landscape Today

Today, although participants in the Data Transfer Registry have a wide diversity of APIs and data transfer protocols, there is some standardization around using OAuth and HTTP APIs. This section outlines some of the most common architectural and technology choices so that we can bridge to solutions that streamline trust and service identification.

Export/Import

Although common, export/import architectures are outside our scope. The data source and data destination do not connect to each other, so no negotiation of trust, identity or protocol are involved. We still believe common export formats would be a benefit.

Common Pattern: Web API and OAuth

Many services host Web APIs for 3rd party access to personal data. The most common solution for authorizing this access without knowing the user’s password is to use OAuth. In this OAuth flow, the site that wants access to data redirects the user’s browser to the site that has the data, with a request that signals to the data source what data and what scope of access is being requested. Because this takes place in the browser, the source can present any Web UX to allow the user to approve, deny or modify the data access authorization. The source redirects the user back to the service that is requesting data, and then data transfer can go directly between the services.

Destination initiated transfer
Figure 3: Destination initiated transfer

Alternate pattern - Source-initiated

Other approaches are possible. Some services offer APIs that work in the opposite direction, starting with the user interacting with the data source. The user can request that the data source push data to the data destination.

After the user request, the source can proceed with a couple of different architectural choices:

  • The data source can contact the data destination and use OAuth in the reverse of the destination-initiated flow. Once data access is authorized, the source might literally use HTTP PUT to send data to the destination.
  • The data source could initiate the data transfer but then hand over control to the data destination. For example, it could send a capability URL [TODO: ref] or a manifest of access URLs, allowing the destination to fetch those resources using HTTP GET requests.

Note that in both destination-initiated and source-initiated architectures, we’ve assumed that the OAuth client (initiating the OAuth flow) and the HTTP client (making HTTP GET, PUT, or other requests) are the same party. This is not always the case. It is technically possible for the OAuth client to initiate authorization and then transfer the HTTP client role to the OAuth relying party, but such mixed architectures are rare.

Alternate Patterns – No OAuth

Some data transfer approaches work without OAuth. For example, the source service with the data might allow the user to create “API keys”. Then it’s up to the user to copy the correct API key to the destination service. With a persistent API key, the service requesting data access can now use a Web API to access the user data until the API key expires or is revoked.

Related variations or terminology

  • API keys
  • Service accounts
  • Web hooks
  • Installed apps

Application-Specific Protocols

The last common pattern is protocols like IMAP, WebDAV, CalDAV and ActivityPub used for data portability. In the worst cases, the user provides their IMAP password from one service to another service so that the service can login as the user and fetch data. It’s possible to build better solutions that don’t include password sharing – delegated accounts, capability URLs, etc. Some work is being done to integrate OAuth into IMAP and other application protocols, which can greatly protect user privacy, both by saving them from sharing passwords and by allowing them to limit scope of access when granting authorization.

Related Variations or Terminology

  • API keys
  • Service accounts
  • Web hooks
  • Installed apps

Application-Specific Protocols

Another common pattern uses application-specific protocols such as IMAP, WebDAV, CalDAV, and ActivityPub for data portability. In the worst cases, users provide their IMAP password from one service to another so the destination service can log in as the user and fetch data.

More secure solutions are possible that avoid password sharing, such as delegated accounts and capability URLs. Work is ongoing to integrate OAuth into IMAP and other application protocols, which can significantly improve user privacy by eliminating password sharing and allowing users to limit the scope of access when granting authorization.

App Verification Processes Today

A bunch of information is needed by both parties to securely connect to each other to transfer user data. In addition to the key trust information of “can I trust the destination”, data services collect a great deal of information in app verification processes, and use that information in allowing or setting up data transfers. Will we be able to automate accessing or exchanging this information?

The information serves many purposes, despite usually being combined into one application so that the applicant doesn’t have to fill out multiple applications or repeat information.

Establishing Trust: Some of the information obtained in app verification allows one service to decide if trust should be extended to the other service. Our Trust Model work investigated which kinds of questions and documentation were most reasonable for that purpose. The Data Trust Registry directly addresses these parts of app verification and aims to avoid the trust work being done over and over by different parties. We also make trust bilateral, so that both parties can trust each other.

Establishing Secure Connections: If the API uses Oauth, the relying party needs to ask for the OAuth client’s redirect URL (explained here from the server’s perspective). Sometimes the service requesting data needs to provide URIs for its own API – e.g. an HTTP URL where data can be sent to.

Establishing Identity: The client service usually needs to provide or be given an identity token, key, ID or certificate. When OAuth is used, this is the OAuth client_id. This is closely related to establishing secure connections. App verification processes also link services to real-world identity, often asking for an organizational identifier and business address.

Establishing Human Communication Channels: Once the services are interoperating to exchange data, an interpersonal communication channel may also be needed. The platform requiring verification may request email or other contact addresses for notifications of API changes, breaches, or notifications of trust or access expiring or being withdrawn.

Establishing Legal Relationship: Many service providers have Terms and Conditions (T&C) or license agreements associated with their data APIs, and make agreeing to those legal terms part of app verification.

Collecting Information for Users: The service provider may collect information that is intended for display to users, such as the requesting service’s name, logo, Data Protection Officer information, or a link to a privacy policy. For example, when a startup requests access via OAuth to a user’s data, the service holding the data may present an approval UX that includes the destination’s name and logo to confirm the user has not been shown something entirely different as the destination.

Some of these purposes overlap – a corporate address can help establish an organization’s real-world identity as well as provide a contact mechanism for notifications.

Today, the Data Trust Registry can address the first and some of the rest of these purposes. The registry can hold public information such as:

  • Privacy policy URL
  • Service name and logo
  • API URL to connect to for the service
  • OAuth redirect URLs
  • Address information for a Data Protection Officer for the service

However, there’s also information we cannot include in the Data Trust Registry, either because it’s unique per service-to-service relationship, or because making the information public would have bad privacy or security consequences. For example:

  • API keys issued by the server holding data to the accessing service are unique per pair, and also need to be secret to avoid public misuse.
  • OAuth client_ids and client_secrets have the same characteristics.
  • Contact information for individual people should not be put in a public API, in part to avoid providing valuable contact information and context to spammers and phishers.
  • A centralized DTR cannot host T&C documents specific to each API, although the DTR T&C a participant signs may make an API-specific T&C not necessary in many cases.

These are the gaps blocking dynamic service-to-service connections. Even if we convince platforms to do without API keys or API-specific T&C terms, it would be unwise to try to share or transmit or store OAuth client_secret values or shared secrets. We need to close these gaps to be able to achieve our vision.

Future Technical Solutions

With the Data Trust Registry to establish trust and provide most connection information, we just need to figure out how to fill the gaps and provide keys, certificates, and non-phishable communications channels. The community will need to consider which (possibly a couple) solutions to implement and encourage for better automation, lower costs, streamlining use, and maintaining high security and trust.

Solutions to Service Identification and Authorization

Because OAuth is the dominant interoperable standard for users to authorize one service to another, we can focus on adding service identification to OAuth 2.0. Today, Oauth 2.0 requires a pre-existing relationship between the OAuth client and the OAuth relying party so that they both know what client_id and client_secret to use. These are critical components protecting OAuth 2.0 interactions from various attacks.

Since it’s now widely recognized that requiring pre-arranged shared secrets is an unwanted friction in many OAuth applications, there are already a number of standards-track proposals to address this problem. They’re not always described as “service identification” or “service registration” extensions to OAuth because they may also solve other problems. In order of when they were issued as RFCs:

The Dynamic Registration solution seems at first glance to be the most directly applicable, but problems have also been identified. Instead, the community is now working on a client metadata draft that has the client use a URL for client_id, and that URL is where the client can share public keys instead of using client_secret. Mutual TLS (mTLS) offers terrific connection security characteristics, but requires services to do more service operations and migration work to build mTLS in as a lower-layer service in both resource servers and access authorization servers. Finally, Distributed Proof of Possession (DPOP) solves both the client identity problem and makes token exchange more secure, a feature we definitely want as more OAuth access tokens are created and live in databases in thousands of online services.

DTI is actively soliciting Data Trust Registry participants to review and discuss these solutions with us.

Solutions to negotiating data transfer protocols

Once authorization is granted, we need to segue to data transfer protocols. There are no widely used universal data access protocols, although historically many attempts have been made. It turns out that many different applications have different protocol needs.

  • Email services need ways to select among millions of emails with a particular focus on time frames (emails received in the last month, for example).
  • Photo albums and video sharing need ways to negotiate supported content types (which services support Apple Live photos, which don’t, as well as video formats), sizes and compression – some data services don’t need the full RAW photo file but can be more efficient with a smaller version.
  • Music playlist sharing doesn’t usually involve large collections and filters, or negotiation over the music file format. Instead, playlist interoperability requires agreeing on how to identify not only songs and performers, but also versions, as well as additional complex metadata. The Grateful Dead performing Dark Star Live at Oakland-Alameda County Coliseum on 1991-10-31 is different from other versions of the same song by the same performer!

Some pre-Web-2.0 applications have standard access protocols. IMAP and CalDAV (mail and calendaring) were designed for the user’s own client access but can also be used for 3rd party access. Since mail and calendar servers already support these protocols, we only need to agree on ways to authorize access that doesn’t involve the user sharing their password.

Even when a service supports its own unilateral API for data access, it may be good to know what tools exist. The Data Transfer Project provides operators with a shared set of adaptors to help each of them use each others’ APIs, but it’s not obvious to a new startup which platforms operate this service. While it does not have code to directly connect to the Data Trust Registry, it can be (and is) used alongside an approach where the Data Trust Registry is used to make trust decisions or even allow dynamic new service connections using mTLS or DPoP.

For any given service offering data, the rest of the ecosystem would like to know what protocols or APIs are available. The Data Trust Registry can hold non-sensitive information about which protocols and formats a service supports, to aid in establishing the appropriate data connection, and can even hold information about deprecated interfaces and planned EOL shutdowns. The Data Transfer Initiative is also promoting interoperable data schemas and protocols for data access in venues like IETF, where these can be turned into standards.

We welcome feedback on which data schemas and transfer protocols should be promoted for interoperability, and how to expose information in the Data Trust Registry.

The Path Forward

Significant friction remains in the widespread implementation of data portability through APIs and direct service-to-service transfer. The Data Trust Registry will help with certain dimensions of this; and we will work in parallel to pursue the service integration and other trust improvements identified in this vision. Broader participation in and diverse contributions to our efforts will help both accelerate progress and improve outcomes for this work. But the biggest beneficiaries will be the participants in the portability ecosystem, and their users.

Join us in our efforts to make the ecosystem richer and larger. Help us continue to build critical mass for the trust registry itself, and work alongside us and other registry participants to implement the technical solutions that streamline trustworthy access to data, and efficient, effective data transfer.