DASL — Data-Addressed Structures & Links

What is this?

DASL ("dazzle") is a small set of simple, standard primitives for working with content-addressed, linked data. It builds on content addressing, a proven approach used in Git and IPFS to create reliable content identifiers (known as CIDs) through cryptographic hashing. Content addressing enables robust data integrity checks and efficient networking: systems can verify they received exactly what they asked for and avoid downloading the same content twice. The linked data part lets you link to stuff by its hash. You can build very big graphs with these primitives, such as the graph behind Bluesky.

We call DASL "data-addressed" because it supports a data serialization component that makes content-addressing sweet and easy when working with data. The design is taken from subcomponents of the IPFS universe, but simplified to improve interoperability, decrease costs, and work well with the web. More specifically, our priorities are:

This is intended to work for the community, to grow support for what we need. If you have thoughts, don't be shy and submit an issue! No stupid questions, don't assume everyone else has context that you don't. If this page isn't enough to understand DASL, then we're the ones who screwed up.

Implementations

DASL is a strict subset of IPFS CIDs and IPLD, so existing IPFS and IPLD implementations will just read DASL CIDs and dCBOR42 without so much as a hiccup. Some implementations also specifically target a DASL subset.

Here are some implementations that partially or fully support DASL:

Specification

There are two specifications in DASL: CIDs and dCBOR42. CIDs (Content IDs) are identifiers used for addressing resources by their contents, as in IPFS; dCBOR42 (deterministically-serialized CBOR with optional CBOR tag 42 supported) is a serialization format that is deterministic and aware of CID-linked graphs, i.e. "IPLD".

Content IDs (CIDs)

DASL CIDs are a strict subset of IPFS CIDs (but you don't need to understanding anything about IPFS to either use or implement them) with the following properties:

  • Only modern CIDv1 CIDs are used, not legacy CIDv0.
  • Only the lowercase base32 multibase encoding (the b prefix) is used for human-readable (and subdomain-usable) string encoding.
  • Only the raw binary multicodec (0x55) and dag-cbor multicodec (0x71), with the latter used only for dCBOR42-conformant DAGs.
  • Only SHA-256 (0x12) and BLAKE3 hash functions (0x1e), and the latter only in certain circumstances.
  • Regardless of size, resources should not be "chunked" into a DAG or Merkle tree (as historically done with UnixFS canonicalization in IPFS systems) but rather hashed in their entirety and content-addressed directly.
  • This set of options has the added advantage that all the aforementioned single-byte prefixes require no additional varint processing or byte-fiddling.

Supporting two hashes isn't ideal, but having one hash type that can stream large resources (and do incremental verification mid-stream) is a plus. Because BLAKE3 is still far from being supported by web browsers, it is strongly recommended that CID producers limit themselves to SHA-256 if possible. Implementations intending to run in web contexts are likely to either forego BLAKE3 verification in-browser, outsource verification to a trusted component, or to have to dynamically load a BLAKE3 library in the browser, which may cause latency.

Use the following steps to parse a CID string:

  1. Accept a string CID.
  2. Remove the first character from CID and store it in prefix.
  3. If prefix is not equal to b, throw an error.
  4. Decode the rest of CID using the base32 algorithm from RFC4648 with a lowercase alphabet and store the result in CID bytes.
  5. Return the result of applying the steps to decode a CID to CID bytes.

Use the following steps to parse a binary CID:

  1. Accept an array of bytes binary CID.
  2. Remove the first byte in binary CID and store it in prefix.
  3. If prefix is not equal to 0 (a null byte, the binary base256 prefix), throw an error.
  4. Store the rest of binary CID in CID bytes.
  5. Return the result of applying the steps to decode a CID to CID bytes.

Use the following steps to decode a CID:

  1. Accept an array of bytes CID bytes.
  2. Remove the first byte in CID bytes and store it in version.
  3. If version is not equal to 1, throw an error.
  4. Remove the next byte in CID bytes and store it in codec.
  5. If codec is not equal to 0x55 (raw) or 0x71 (dCBOR42), throw an error.
  6. Remove the next two bytes in CID bytes and store them in hash type and hash size, respectively.
  7. If hash type is not equal to 0x12 (SHA-256) or 0x1e (BLAKE3), throw an error.
  8. If there are fewer than hash size bytes left in CID bytes, throw an error.
  9. Remove the first hash size bytes from CID bytes and store them in digest. Store the rest in remaining bytes.
  10. Return version, codec, hash type, hash size, digest, and remaining bytes.

Deterministic CBOR with tag 42 (dCBOR42)

dCBOR42 is a form of IPLD that serializes only to deterministic CBOR, by normalizing and reducing some type flexibility. Notably, we support no ADLs. (See the current draft specification for dCBOR, and Carsten Bormann's BCP document on the underspecified determinism of Section 4.2 of the CBOR specification). For debugging purposes, either one-way conversion to DAG-JSON or CBOR Extended Diagnostic Notation can be used, but either way, note that the CIDs in such debugging outputs should be the CIDs of the dCBOR42 content, not of other debugging resources.

Further details forthcoming.