DASL — Data-Addressed Structures & Links
What is this?
DASL ("dazzle") is a small set of simple, standard primitives for working with content-addressed, linked data. It builds on content addressing, a proven approach used in Git and IPFS to create reliable content identifiers (known as CIDs) through cryptographic hashing. Content addressing enables robust data integrity checks and efficient networking: systems can verify they received exactly what they asked for and avoid downloading the same content twice. The linked data part lets you link to stuff by its hash. You can build very big graphs with these primitives, such as the graph behind Bluesky.
We call DASL "data-addressed" because it supports a data serialization component that makes content-addressing sweet and easy when working with data. The design is taken from subcomponents of the IPFS universe, but simplified to improve interoperability, decrease costs, and work well with the web. More specifically, our priorities are:
- pave the cowpaths: we focus on supporting what people trying to solve real-world problems actually use. This takes over any consideration of engineering ideals or theoretical purity. We're retconning the spec to what people actually use implement — as it should be.
- extensibility vs optionality: extensibility is important for long-lived distributed systems, because the world will happen and you will need to change. But introducing optionality reduces interoperability and increases cost of both implementation and adoption. So rather than require support for many options now, we have extension points now but deliberately don't use their full range.
- don't make me think: you don't want to be thinking about content addressing. You want to grab this off the shelf and have something that works out of the box. Nothing weird, no impedance mismatch with the systems you know and love (or maybe know and hate, but whatever, it just works).
- lightweight loading: some people like JavaScript, others don't. We don't care, we just want things that work. What's certain is that you can't ignore it and be relevant. The ability to ship small code to the browser is critical.
This is intended to work for the community, to grow support for what we need. If you have thoughts, don't be shy and submit an issue! No stupid questions, don't assume everyone else has context that you don't. If this page isn't enough to understand DASL, then we're the ones who screwed up.
Implementations
DASL is a strict subset of IPFS CIDs and IPLD, so existing IPFS and IPLD implementations will just read DASL CIDs and dCBOR42 without so much as a hiccup. Some implementations also specifically target a DASL subset.
Here are some implementations that partially or fully support DASL:
- atcute (JS/TS): a collection of lightweight packages to make working with Bluesky and the ATmosphere easy.
- dag-cbrrr (Python): fast DAG-CBOR implementation.
- python-libipld (Python): a Python wrapper around Rust, focused on the ATmosphere.
-
libipld (Rust): fast Rust implementation, extracted
from the original
ipfs-rust
project. - rust_cid_npm (Rust): Fast and tiny rust library, CLI tool, and npm package to generate CIDs without a full IPFS client.
- Kubo/Boxo (Go): the Swiss-Army chainsaw of all things IPFS.
- Helia (JS/TS): a browser- and CDN-friendly, modular, "import only what you need" JS implementation of IPFS.
Specification
There are two specifications in DASL: CIDs and dCBOR42. CIDs
(Content IDs) are identifiers used for addressing resources by their contents, as in IPFS; dCBOR42
(deterministically-serialized CBOR with optional CBOR tag 42
supported)
is a serialization format that is deterministic and aware of CID-linked graphs, i.e. "IPLD".
Content IDs (CIDs)
DASL CIDs are a strict subset of IPFS CIDs (but you don't need to understanding anything about IPFS to either use or implement them) with the following properties:
- Only modern CIDv1 CIDs are used, not legacy CIDv0.
-
Only the lowercase base32 multibase encoding (the
b
prefix) is used for human-readable (and subdomain-usable) string encoding. -
Only the
raw
binary multicodec (0x55) anddag-cbor
multicodec (0x71), with the latter used only for dCBOR42-conformant DAGs. - Only SHA-256 (0x12) and BLAKE3 hash functions (0x1e), and the latter only in certain circumstances.
- Regardless of size, resources should not be "chunked" into a DAG or Merkle tree (as historically done with UnixFS canonicalization in IPFS systems) but rather hashed in their entirety and content-addressed directly.
- This set of options has the added advantage that all the aforementioned single-byte prefixes require no additional varint processing or byte-fiddling.
Supporting two hashes isn't ideal, but having one hash type that can stream large resources (and do incremental verification mid-stream) is a plus. Because BLAKE3 is still far from being supported by web browsers, it is strongly recommended that CID producers limit themselves to SHA-256 if possible. Implementations intending to run in web contexts are likely to either forego BLAKE3 verification in-browser, outsource verification to a trusted component, or to have to dynamically load a BLAKE3 library in the browser, which may cause latency.
Use the following steps to parse a CID string:
- Accept a string CID.
- Remove the first character from CID and store it in prefix.
- If prefix is not equal to
b
, throw an error. - Decode the rest of CID using the base32 algorithm from RFC4648 with a lowercase alphabet and store the result in CID bytes.
- Return the result of applying the steps to decode a CID to CID bytes.
Use the following steps to parse a binary CID:
- Accept an array of bytes binary CID.
- Remove the first byte in binary CID and store it in prefix.
- If prefix is not equal to
0
(a null byte, the binary base256 prefix), throw an error. - Store the rest of binary CID in CID bytes.
- Return the result of applying the steps to decode a CID to CID bytes.
Use the following steps to decode a CID:
- Accept an array of bytes CID bytes.
- Remove the first byte in CID bytes and store it in version.
- If version is not equal to
1
, throw an error. - Remove the next byte in CID bytes and store it in codec.
- If codec is not equal to
0x55
(raw) or0x71
(dCBOR42), throw an error. - Remove the next two bytes in CID bytes and store them in hash type and hash size, respectively.
- If hash type is not equal to
0x12
(SHA-256) or0x1e
(BLAKE3), throw an error. - If there are fewer than hash size bytes left in CID bytes, throw an error.
- Remove the first hash size bytes from CID bytes and store them in
digest
. Store the rest in remaining bytes. - Return version, codec, hash type, hash size, digest, and remaining bytes.
Deterministic CBOR with tag 42 (dCBOR42)
dCBOR42 is a form of IPLD that serializes only to deterministic CBOR, by normalizing and reducing some type flexibility. Notably, we support no ADLs. (See the current draft specification for dCBOR, and Carsten Bormann's BCP document on the underspecified determinism of Section 4.2 of the CBOR specification). For debugging purposes, either one-way conversion to DAG-JSON or CBOR Extended Diagnostic Notation can be used, but either way, note that the CIDs in such debugging outputs should be the CIDs of the dCBOR42 content, not of other debugging resources.
Further details forthcoming.