Introduction
When data scientists discuss normalization, they do so in the context of
achieving a particular set of goals. Since the same information may
sometimes be expressed in a variety of different ways, it often becomes
necessary to be able to transform each of these different ways into a
single, standard format. With a standard format, the differences between
two different sets of data can be easily determined, a
cryptographically-strong hash identifier can be generated for a particular
set of data, and a particular set of data may be digitally-signed for later
verification.
In particular, this specification is about normalizing
RDF datasets, which are collections of graphs. Since
a directed graph can express the same information in more than one
way, it requires normalization to achieve the aforementioned goals
and any others that may arise via surrendipity.
Most RDF dataset can be normalized fairly quickly, in terms
of algorithmic time complexity. However, those that contain nodes that do
not have globally unique identifiers pose a greater challenge. Normalizing
these datasets presents the graph isomorphism problem, a
problem that is believed to be difficult to solve quickly. More formally,
it is believed to be an NP-Intermediate problem, that is, neither known to
be solvable in polynomial time nor NP-complete. Fortunately, existing real
world data is rarely modeled in a way that manifests this problem and new
data can be modeled to avoid it. In fact, software systems can detect a
problematic dataset and may choose to assume it's an attempted denial of
service attack, rather than a real input, and abort.
This document outlines an algorithm for generating a normalized
RDF dataset given an RDF dataset as input. The
algorithm is called the
Universal RDF Dataset Normalization Algorithm 2015 or
URDNA2015.
How to Read this Document
This document is a detailed specification for an RDF dataset
normalization algorithm. The document is primarily intended for the
following audiences:
- Software developers that want to implement an RDF dataset
normalization algorithm.
- Masochists.
To understand the basics in this specification you must be familiar with
basic RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of
graph theory and
graph isomorphism
is also recommended.
Contributing
There are a number of ways that one may participate in the development
of this specification:
Normalization
Normalization is the process of transforming an
input RDF dataset to a normalized RDF dataset. That
is, any two input RDF datasets that contain the same
information, regardless of their arrangement, will be transformed into
identical normalized RDF datasets. The problem requires directed
graphs to be deterministically ordered into sets of nodes and edges. This
is easy to do when all of the nodes have globally-unique identifiers, but
can be difficult to do when some of the nodes do not. Any nodes without
globally-unique identifiers must be issued deteriministic identifiers.
In time, there may be more than one normalization algorithm and,
therefore, for identification purposes, this algorithm is named the
"Universal RDF Dataset Normalization Algorithm 2015"
(URDNA2015).
Normalization Algorithm Terms
- input RDF dataset
- The abstract RDF dataset that is provided as input to
the algorithm.
- normalized RDF dataset
- The abstract RDF dataset and normalized
RDF blank node identifiers that are produced as output by
the algorithm.
- identifier issuer
- An identifier issuer is used to issue new RDF blank node identifiers. It
maintains a
blank node identifier issuer state.
- hash
- The lowercase, hexadecimal representation of a message digest.
- hash algorithm
- The hash algorithm used by URDNA2015, namely, SHA-256.
Normalization State
When performing the steps required by the normalization algorithm,
it is helpful to track state in a data structure called the
normalization state. The information contained in the
normalization state is described below.
- blank node to quads map
- A data structure that maps RDF blank node identifiers to
the RDF quads in which they appear in the
input RDF dataset.
- hash to blank nodes map
- A data structure that maps a hash to a list of
RDF blank node identifiers.
- canonical issuer
- An identifier issuer, initialized with the
prefix
_:c14n
, for issuing canonical
RDF blank node identifiers.
Blank Node Identifier Issuer State
During the normalization algorithm, it is sometimes necessary to
issue new identifiers to RDF blank nodes. The
Issue Identifier algorithm uses an
identifier issuer to accomplish this task. The information
an identifier issuer needs to keep track of is described
below.
- identifier prefix
- The identifier prefix is a string that is used at the beginning of an
RDF blank node identifier. It should be initialized to a
string that is specified by the normalization algorithm. When
generating a new RDF blank node identifier, the prefix
is concatenated with a identifier counter. For example,
_:c14n
is a proper initial value for the
identifier prefix that would produce
RDF blank node identifiers like _:c14n1
.
- identifier counter
- A counter that is appended to the identifier prefix to
create an RDF blank node identifier. It is initialized to
0
.
- issued identifiers list
- A list that tracks previously issued identifiers in the order in
which they were issued. It also tracks the existing identifier that
triggered the issuance, to prevent issuing more than one new identifier
per existing identifier, and to allow RDF blank nodes to
be reassigned identifiers some time after issuance.
Normalization Algorithm
The normalization algorithm converts an input RDF dataset
into a normalized RDF dataset. This algorithm will assign
deterministic identifiers to any RDF blank nodes in the
input RDF dataset.
Documenting the algorithm is a WIP, various steps will
become more detailed in time.
Algorithm
- Create the normalization state.
- For every RDF quad in input RDF dataset:
- For each RDF blank node that occurs in the
RDF quad, add a reference to the
RDF quad using the
RDF blank node identifier in the
blank node to quads map, creating a new entry if
necessary.
- Create a list of non-normalized blank node identifiers
non-normalized identifiers and populate it using the keys
from the blank node to quads map.
- Initialize simple, a boolean flag, to
true
.
- While simple is
true
,
issue canonical identifiers for RDF blank nodes:
- Set simple to
false
.
- Clear hash to blank nodes map.
- For each RDF blank node identifier
identifier in non-normalized identifiers:
- Create a hash, hash, according to the
Hash First Degree Quads algorithm.
- Add hash and identifier to
hash to blank nodes map, creating a new entry if
necessary.
- For each hash to identifier list mapping in
hash to blank nodes map, lexicographically-sorted by
hash:
- If the length of identifier list is greater than
1
, continue to the next mapping.
- Use the
Issue Identifier algorithm,
passing canonical issuer and the
single RDF blank node identifier in
identifier list, identifier, to issue a
canonical replacment identifier for identifier.
- Remove identifier from
non-normalized identifiers.
- Remove hash from the
hash to blank nodes map.
- Set simple to
true
.
- For each hash to identifier list mapping in
hash to blank nodes map, lexicographically-sorted by
hash:
- Create hash path list where each item will be a result
of running the
Hash N-Degree Quads algorithm.
- For each RDF blank node identifier
identifier in identifier list:
- If a canonical identifier has already been issued for
identifier, continue to the next
identifier.
- Create temporary issuer, an
identifier issuer initialized with the prefix
_:b
.
- Use the
Issue Identifier algorithm,
passing temporary issuer and identifier, to
issue a new temporary RDF blank node identifier
for identifier.
- Run the
Hash N-Degree Quads algorithm,
passing temporary issuer, and append the
result to the hash path list.
- For each result in the hash path list,
lexicographically-sorted by the hash in result:
- For each RDF blank node identifier,
existing identifier, that was issued a temporary
identifier by identifier issuer in result, issue
a canonical identifier, in the same order, using the
Issue Identifier algorithm,
passing canonical issuer and
existing identifier.
For each RDF quad, quad, in
input RDF dataset:
- Create a copy, quad copy, of quad and replace any
existing RDF blank node identifiers using the
canonical identifiers previously issued
by canonical issuer.
- Add quad copy to the
normalized RDF dataset.
Return the normalized RDF dataset.
Issue Identifier Algorithm
This algorithm issues a new RDF blank node identifier for
a given existing RDF blank node identifier. It also updates
state information that tracks the order in which new
RDF blank node identifiers were issued.
This algorithm takes an identifier issuer and an
existing identifier as inputs. The output is a new
issued identifier. The steps of the algorithm are:
- If there is already an issued identifier for
existing identifier in issued identifiers list,
return it.
- Generate issued identifier by concatenating
identifier prefix with the string value of
identifier counter.
- Append an item to issued identifiers list that maps
existing identifier to issued identifier.
- Increment identifier counter.
- Return issued identifier.
Hash First Degree Quads
Add note that it is safe to cache the result of
hashing the quads for later reuse (to reduce unnecessary computation).
Algorithm
This algorithm takes the normalization state and a
reference blank node identifier as inputs.
- Initialize nquads to an empty list. It will be
used to store quads in N-Quads format.
- Get the list of quads quads associated with the
reference blank node identifier in the
blank node to quads map.
- For each RDF quad quad in quads:
- Serialize the RDF quad in N-Quads format with the
following special rule:
- If any component in quad is an
RDF blank node, then serialize it using a
special identifier as follows:
- If the blank node's existing
RDF blank node identifier matches the
reference blank node identifier then use the
RDF blank node identifier
_:a
,
otherwise, use the RDF blank node identifier
_:z
.
- Sort nquads in lexicographical order.
- Return the hash that results from passing the sorted,
joined nquads through the
hash algorithm.
Hash Related Blank Node
Algorithm
This algorithm creates a hash to identify how one
RDF blank node is related to another. It takes the
normalization state, a related
RDF blank node identifier, a quad, an
identifier issuer, issuer, and a
string position as inputs.
- Set the identifier to use for related, preferring
first the canonical identifier for related if issued, second
the identifier issued by issuer if issued, and last, if
necessary, the result of the
Hash First Degree Quads algorithm,
passing related.
- Initialize a string input to the value of
position.
- If position is not
g
, append
<
, the value of the RDF predicate in
quad, and >
to input.
- Append identifier to input.
- Return the hash that results from passing input
through the hash algorithm.
Hash N-Degree Quads
The relationship and difference between this algorithm
and the hash first degree quads algorithm should be better explained.
There may also be better names for the two algorithms.
The 'path' terminology could also be changed to better
indicate what a path is (a particular deterministic serialization for
a subgraph/subdataset of nodes without globally-unique identifiers).
Algorithm
The inputs to this algorithm are the normalization state,
the identifier for the RDF blank node to
recursively hash quads for, and path identifier issuer which is
an identifier issuer that issues temporary
RDF blank node identifiers. The output from this algorithm
will be a hash and the identifier issuer used
to help generate it.
In URGNA2012, the position parameter passed to
the Hash Related Blank Node algorithm
was instead modeled as a direction parameter, where it could have
the value p
, for property, when the related blank node was a
`subject` and the value r
, for reverse or reference, when
the related blank node was an `object`. Since URGNA2012 only normalized
graphs, not datasets, there was no use of the `graph` position.
- Create an hash to related blank nodes map for storing
hashes that identify related RDF blank nodes.
- Get a reference, quads, to the list of RDF quads
in the blank node to quads map for the key
identifier.
- For each quad in quads:
- For each component in quad, if component
is the RDF subject, RDF object, and
RDF graph name and it is an
RDF blank node that is not identified by
identifier:
- Set hash to the result of the
Hash Related Blank Node algorithm,
passing the RDF blank node identifier for
component as related, quad,
path identifier issuer as issuer, and
position as either
s
, o
, or
g
based on whether component is an
RDF subject, RDF object,
RDF graph name, respectively.
- Add a mapping of hash to the
RDF blank node identifier for component
hash to related blank nodes map, adding an entry
as necessary.
- Create an empty string, data to hash.
- For each related hash to blank node list mapping in
hash to related blank nodes map, sorted lexicographically
by related hash:
- Append the related hash to the data to hash.
- Create a string chosen path.
- Create an unset chosen issuer variable.
- For each permutation of blank node list:
- Create a copy of issuer, issuer copy.
- Create a string path.
- Create a recursion list, to store
RDF blank node identifiers that must be
recursively processed by this algorithm.
- For each related in permutation:
- If a canonical identifier has been issued for
related, append it to path.
- Otherwise, if issuer copy has not issued
an identifier for related, append related
to recursion list. Use the
Issue Identifier algorithm,
passing issuer copy and related and
append the result to path.
- If chosen path is not empty and the length
of path is greater than or equal to the length
of chosen path and path is lexicographically
greater than chosen path, then skip to the next
permutation.
- For each related in recursion list:
- Set result to the result of recursively executing
the
Hash N-Degree Quads algorithm,
passing related for identifier and
issuer copy for path identifier issuer.
- Use the
Issue Identifier algorithm,
passing issuer copy and related and
append the result to path.
- Append
<
, the hash in
result, and >
to path.
- Set issuer copy to the
identifier issuer in result.
- If chosen path is not empty and the length
of path is greater than or equal to the length
of chosen path and path is lexicographically
greater than chosen path, then skip to the next
permutation.
- If chosen path is empty or path is
lexicographically less than chosen path, set
chosen path to path and chosen issuer
to issuer copy.
- Append chosen path to data to hash.
- Replace issuer, by reference, with
chosen issuer.
- Return issuer and the hash that results from
passing data to hash through the
hash algorithm.