| Title: | Public Suffix List Engine |
|---|---|
| Description: | A focused implementation of the Public Suffix List (PSL). Bundles a reproducible, pinned PSL snapshot and implements the official prevailing-rule algorithm to answer public-suffix (eTLD) and registrable-domain (eTLD+1) queries. Distinguishes ICANN and PRIVATE rule sections, accepts Unicode and ASCII hostnames via 'punycoder' canonicalization, and supports an explicit, validated offline refresh path. The matcher is compiled with 'cpp11' and requires no external system library. Used as the PSL engine by the 'rurl' package. |
| Authors: | Bart Turczynski [aut, cre] (ORCID: <https://orcid.org/0000-0002-8788-7980>) |
| Maintainer: | Bart Turczynski <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.2 |
| Built: | 2026-06-23 10:41:43 UTC |
| Source: | https://github.com/bart-turczynski/pslr |
TRUE exactly when the valid canonical host equals its own public suffix
under the selected policy. Returns NA whenever public_suffix() would
return NA (missing or invalid input, or an unresolved host under
unknown = "na"). Under the default unknown = "default", an unlisted
single label such as "madeuptld" is TRUE via the implicit * rule; ask
unknown = "na" to test explicit membership instead.
is_public_suffix( domain, section = c("all", "icann", "private"), unknown = c("default", "na"), invalid = c("na", "error") )is_public_suffix( domain, section = c("all", "icann", "private"), unknown = c("default", "na"), invalid = c("na", "error") )
domain |
Character vector of DNS hostnames (not URLs). Each element may be a mixed-case ASCII, Unicode, or A-label hostname, a single label, or a hostname with exactly one terminal root dot. See Input contract. |
section |
Which rule sections are eligible: |
unknown |
|
invalid |
|
A logical vector with length(domain), preserving the names of
domain.
NA is treated as missing (returns NA), not invalid. Invalid elements
include empty or whitespace-only strings, leading or consecutive dots, URL
syntax, IPv6 addresses, canonical dotted-decimal IPv4 literals, and labels
that fail hostname/IDNA validation. Wrong argument types and non-scalar or
unknown option values always abort regardless of invalid.
is_public_suffix("com") is_public_suffix("example.com") is_public_suffix("madeuptld") is_public_suffix("madeuptld", unknown = "na")is_public_suffix("com") is_public_suffix("example.com") is_public_suffix("madeuptld") is_public_suffix("madeuptld", unknown = "na")
Downloads, validates, and publishes a fresh Public Suffix List into the user cache. This is the only function in the package that accesses the network, and only when you call it explicitly.
psl_refresh( url = "https://publicsuffix.org/list/public_suffix_list.dat", force = FALSE, activate = FALSE )psl_refresh( url = "https://publicsuffix.org/list/public_suffix_list.dat", force = FALSE, activate = FALSE )
url |
Absolute |
force |
When |
activate |
When |
Cache age is measured from the successful network retrieval timestamp; reusing a fresh cache does not advance that timestamp. The download goes to a temporary file in binary mode and must be no larger than a documented maximum (16 MiB). The source is then fully validated – UTF-8, section markers, rule grammar, conflicting rules, and successful canonicalization of every rule – and exact same-section duplicates warn once and are deduplicated. Source and metadata are published only after validation succeeds, using an atomic commit that never exposes a partial or mismatched snapshot. A failed refresh never replaces a valid cache or the active matcher.
Invisibly, a one-row data.frame shaped like psl_version()
describing the selected cache snapshot, whether or not it was activated.
## Not run: psl_refresh() psl_refresh(force = TRUE, activate = TRUE) ## End(Not run)## Not run: psl_refresh() psl_refresh(force = TRUE, activate = TRUE) ## End(Not run)
Returns the explicit rules of the active list as a base data.frame, one row
per rule. The implicit default * rule is not included.
psl_rules(section = c("all", "icann", "private"))psl_rules(section = c("all", "icann", "private"))
section |
Which rule sections to return: |
A base data.frame with columns, in order: rule (original source
rule text), canonical_rule (the canonicalized rule, including the *. or
! marker), kind ("normal", "wildcard", or "exception"), section
("icann" or "private"), and labels (integer rule depth, counting a
wildcard label). Rows are ordered first by section (ICANN before PRIVATE)
and then by source-file order.
psl_version(), public_suffix_rule()
head(psl_rules("icann")) nrow(psl_rules("private"))head(psl_rules("icann")) nrow(psl_rules("private"))
Switches the list backing every query in the current R session. The change is session-only and is validated before any session state changes; a failure leaves the previously active list usable. A successful switch invalidates the match-result cache.
psl_use(source = c("bundled", "cache", "path"), path = NULL)psl_use(source = c("bundled", "cache", "path"), path = NULL)
source |
Where to load the list from: |
path |
For |
A custom path is held to the same runtime duplicate policy as
psl_refresh(): exact same-section duplicates warn once and are
deduplicated, while conflicting rule kinds for the same labels are fatal.
Cache and custom-path sources are read in source form and indexed under the
runtime normalizer; they never reuse the bundled generated index.
Invisibly, the psl_version() row for the newly active list.
psl_refresh(), psl_version(), psl_rules()
psl_use("bundled") ## Not run: psl_use("cache") psl_use("path", path = "my_list.dat") ## End(Not run)psl_use("bundled") ## Not run: psl_use("cache") psl_use("path", path = "my_list.dat") ## End(Not run)
Returns a one-row data.frame describing the list currently active in this R session: its source-snapshot provenance and the normalization identifiers actually used to index the active matcher. Reproducing a query result requires both the active-list identity and these normalization identifiers (PRD s10), so a reproducibility-sensitive workflow should record this row.
psl_version()psl_version()
The columns, in order, are:
source"bundled", "cache", or "path".
pathFile path of a "cache" or "path" source; NA otherwise.
retrieved_atNetwork retrieval timestamp, or NA.
list_dateUpstream list date, or NA when unknown.
commitUpstream commit SHA, or NA when unknown.
sizeSource byte size (integer).
checksumSource checksum, including its algorithm prefix
(e.g. "sha256:...").
normalizerThe dependency providing canonicalization,
currently "punycoder".
normalizer_versionIts installed package version.
normalization_profileIts stable case-mapping / IDNA / validation profile identifier.
unicode_versionThe Unicode data version used by that profile.
Unavailable metadata is a typed NA, never omitted. The normalization
identifiers describe the implementation used by the current session, whether
the active list came from the bundled snapshot, the user cache, or a custom
path; an in-memory compatibility rebuild (PRD s8.3) updates them without
altering the shipped source identity or checksum.
A one-row base data.frame with the columns described in Details.
psl_use(), psl_refresh(), psl_rules()
psl_version()psl_version()
Returns the public suffix (effective top-level domain, eTLD) of each host under the selected Public Suffix List policy, following the official prevailing-rule algorithm.
public_suffix( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )public_suffix( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )
domain |
Character vector of DNS hostnames (not URLs). Each element may be a mixed-case ASCII, Unicode, or A-label hostname, a single label, or a hostname with exactly one terminal root dot. See Input contract. |
section |
Which rule sections are eligible: |
output |
|
unknown |
|
invalid |
|
A character vector with length(domain), preserving the names of
domain. Other attributes are dropped.
NA is treated as missing (returns NA), not invalid. Invalid elements
include empty or whitespace-only strings, leading or consecutive dots, URL
syntax, IPv6 addresses, canonical dotted-decimal IPv4 literals, and labels
that fail hostname/IDNA validation. Wrong argument types and non-scalar or
unknown option values always abort regardless of invalid.
registrable_domain(), is_public_suffix(), suffix_extract(),
public_suffix_rule()
public_suffix("www.example.com") public_suffix("example.co.uk") public_suffix("example.com.") public_suffix("madeuptld", unknown = "na")public_suffix("www.example.com") public_suffix("example.co.uk") public_suffix("example.com.") public_suffix("madeuptld", unknown = "na")
Inspect the prevailing PSL rule for each host
public_suffix_rule( domain, section = c("all", "icann", "private"), unknown = c("default", "na"), invalid = c("na", "error") )public_suffix_rule( domain, section = c("all", "icann", "private"), unknown = c("default", "na"), invalid = c("na", "error") )
domain |
Character vector of DNS hostnames (not URLs). Each element may be a mixed-case ASCII, Unicode, or A-label hostname, a single label, or a hostname with exactly one terminal root dot. See Input contract. |
section |
Which rule sections are eligible: |
unknown |
|
invalid |
|
A base data.frame with one row per input and columns, in order:
input (original), host_ascii (canonical A-label host), rule (the
canonical rule including *. or !, "*" for the implicit default),
kind ("normal", "wildcard", "exception", or "default"),
rule_section ("icann", "private", or NA for the default/no result),
and public_suffix_ascii (the derived A-label public suffix). Invalid rows
are NA in every derived column. A valid host left unresolved by
unknown = "na" keeps host_ascii while the rule and suffix columns are
NA. An exception rule retains its ! for auditability. Zero-length
input returns a zero-row frame; all-invalid input keeps one row per input.
NA is treated as missing (returns NA), not invalid. Invalid elements
include empty or whitespace-only strings, leading or consecutive dots, URL
syntax, IPv6 addresses, canonical dotted-decimal IPv4 literals, and labels
that fail hostname/IDNA validation. Wrong argument types and non-scalar or
unknown option values always abort regardless of invalid.
public_suffix(), suffix_extract()
public_suffix_rule("www.example.co.uk") public_suffix_rule("madeuptld")public_suffix_rule("www.example.co.uk") public_suffix_rule("madeuptld")
Returns the registrable domain (eTLD+1) of each host: its public suffix plus
one host label to the left. It is NA when no such label exists (the host is
itself a public suffix) or when the public suffix is NA.
registrable_domain( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )registrable_domain( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )
domain |
Character vector of DNS hostnames (not URLs). Each element may be a mixed-case ASCII, Unicode, or A-label hostname, a single label, or a hostname with exactly one terminal root dot. See Input contract. |
section |
Which rule sections are eligible: |
output |
|
unknown |
|
invalid |
|
A character vector with length(domain), preserving the names of
domain. Other attributes are dropped.
NA is treated as missing (returns NA), not invalid. Invalid elements
include empty or whitespace-only strings, leading or consecutive dots, URL
syntax, IPv6 addresses, canonical dotted-decimal IPv4 literals, and labels
that fail hostname/IDNA validation. Wrong argument types and non-scalar or
unknown option values always abort regardless of invalid.
public_suffix(), is_public_suffix(), suffix_extract()
registrable_domain("www.example.co.uk") registrable_domain("com") registrable_domain("foo.madeuptld", unknown = "na")registrable_domain("www.example.co.uk") registrable_domain("com") registrable_domain("foo.madeuptld", unknown = "na")
Split hosts into subdomain, registrant label, and public suffix
suffix_extract( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )suffix_extract( domain, section = c("all", "icann", "private"), output = c("ascii", "unicode"), unknown = c("default", "na"), invalid = c("na", "error") )
domain |
Character vector of DNS hostnames (not URLs). Each element may be a mixed-case ASCII, Unicode, or A-label hostname, a single label, or a hostname with exactly one terminal root dot. See Input contract. |
section |
Which rule sections are eligible: |
output |
|
unknown |
|
invalid |
|
A base data.frame with one row per input and columns, in order:
input (original, unchanged), host (canonical host in output form),
subdomain (labels left of the registrable domain; "" when none),
domain (the single registrant label left of the suffix), suffix (the
public suffix), and registrable_domain (eTLD+1). domain, subdomain,
and registrable_domain are NA when the host is itself a public suffix.
If public-suffix resolution is NA, every derived column except input
and a successfully normalized host is NA. Zero-length input returns a
zero-row frame; all-invalid input keeps one row per input. Root dots are
preserved on host, suffix, and registrable_domain only.
NA is treated as missing (returns NA), not invalid. Invalid elements
include empty or whitespace-only strings, leading or consecutive dots, URL
syntax, IPv6 addresses, canonical dotted-decimal IPv4 literals, and labels
that fail hostname/IDNA validation. Wrong argument types and non-scalar or
unknown option values always abort regardless of invalid.
public_suffix(), public_suffix_rule()
suffix_extract("www.example.co.uk") suffix_extract(c("example.com", "com", NA))suffix_extract("www.example.co.uk") suffix_extract(c("example.com", "com", NA))