Perceptual Fingerprint (PFP)

date2026-02-19
editorsCole Anthony Capilongo <cole@hypha.coop>
issueslist, new
abstract

DASL PFPs are a simple structured identifier for referring to similar media using perceptual hashing. They are extensible and algorithm-agnostic, supporting image, video, or other kinds of content fingerprinting.

Introduction

DASL PFPs are a simple structured identifier for referring to similar media using perceptual hashing. They are extensible and algorithm-agnostic, supporting image, video, or other kinds of content fingerprinting.

Unlike a [cid], a PFP does not refer to any single file or sequence of bits. It refers to a set of similar content files that have the same or similar PFPs. You must be able to parse and understand the hash algorithm referred to within the PFP to use it; simply comparing strings is not enough.

A DASL PFP can be represented as a string or as an array of bytes. It has the following structure:

  1. A p prefix (only in string form) to differentiate it from CIDs.
  2. An algorithm type, indicating which perceptual hashing algorithm was used.
  3. A length for the rest of the identifier.
  4. The data, which is either the algorithm output hash or a CID of that hash, depending on the algorithm type.

The data can be a CID for cases where the algorithm's data is too long to include directly. For example a video perceptual hash may be 256 KiB in size. The CID allows the data to be retrieved from content-addressed storage, wherever your application stores its data. You could use [rasl] to retrieve it, for example.

Parsing PFPs

Use the following steps to parse a string-encoded PFP, i.e. translate it to a bytestring:

  1. Accept a string PFP.
  2. Remove the first character from PFP and store it in prefix.
  3. If prefix is not equal to p, throw an error.
  4. Decode the rest of PFP using the base32 algorithm from RFC4648 with a lowercase alphabet and store the result in PFP bytes ([rfc4648]).
  5. This results in PFP bytes, which can be used to decode a PFP.

Use the following steps to decode a PFP:

  1. Accept an array of bytes PFP bytes.
  2. Read a [varint] from PFP bytes and store it in algorithm type. For most use cases, you can assume it's a single byte and reject unknown integers (unknown algorithm types).
  3. Check algorithm type against the algorithm registry. If it is not a supported algorithm, throw an error.
  4. Read a varint from PFP bytes and store it in length.
  5. Verify that length matches the expected length for the algorithm type as specified in the registry. If it does not match, throw an error.
  6. Read length bytes from PFP bytes and store them in data. If there are fewer than length bytes remaining, throw an error.
  7. If the algorithm type specifies that data contains a [cid], parse it according to that specification. Otherwise, data contains the perceptual hash directly.
  8. Return algorithm type, length, and data.

AT Protocol

When storing PFPs in an AT Protocol record, using a pseudo-type is recommended to make them easier to identify by consumers who are not familiar with your lexicon schema. The type for PFPs is {"__pfp": "p..."}. This disambiguates PFPs from any other string that starts with p in your record.

An example:

{
  "$type": "baz.bar.myrecord",
  "foo": "bar",
  "my cid": {"$link": "bafkreiapgas3dluwwzthuv2fnc475ytvve3xd5acoproje3lr2446yno3q"},
  "my pfp": {"__pfp": "paeqo5rgntyjx44a5dse6zygcmprz2ym7rxrbym6ogpzt44mgetbam3a"},
}
      

Here is the lexicon for using that custom type:

{
  "lexicon": 1,
  "id": "ing.dasl.pfpRef",
  "defs": {
    "main": {
      "type": "object",
      "description": "Reference to a perceptual fingerprint (PFP).",
      "required": ["__pfp"],
      "properties": {
        "__pfp": {
          "type": "string",
          "description": "Perceptual fingerprint per DASL spec. Format: p<base32-payload>."
        }
      }
    }
  }
}
      

Registry

The following table lists the officially registered perceptual hashing algorithms. Note number 0x00 is reserved and must be considered invalid if parsed.

The length column refers to the length of the raw hash in the PFP, or the length of the CID (fixed at 36 bytes). If a CID is used, the length of the data the CID points to is not defined.

Algorithm Number Content type Hash or CID Length
PDQ 0x01 Image Hash 32 bytes
TMK+PDQF 0x02 Video CID 36 bytes

References

[cid]
Robin Berjon & Juan Caballero. Content IDs (CIDs). 2026-02-19. URL: https://dasl.ing/cid.html
[rasl]
Robin Berjon & Juan Caballero. RASL — Retrieval of Arbitrary Structures & Links. 2026-02-19. URL: https://dasl.ing/rasl.html
[rfc4648]
S. Josefsson. The Base16, Base32, and Base64 Data Encodings. October 2006. URL: https://www.rfc-editor.org/rfc/rfc4648
[varint]
unsigned varint. URL: https://github.com/multiformats/unsigned-varint