v1.RELEASE_NOTES_v1.md.1.0.0 (see DESCRIPTION).get_path() gains path_normalization, index_page_handling,
trailing_slash_handling, and path_encoding arguments, matching
the corresponding options of safe_parse_url().get_scheme() gains scheme_relative_handling.get_parse_status() gains source (mapped to tld_source) so
warning statuses can be queried under a specific PSL section.get_clean_url() and get_host() gain source (mapped to tld_source).get_host() gains host_encoding.get_domain(), get_tld(), and get_subdomain() gain host_encoding,
mirroring get_host().All new arguments default to the same values as safe_parse_url(), so
existing calls are unaffected.
get_domain(), get_tld(),
get_subdomain()) now follow host_encoding (default "keep") instead
of always returning Unicode. Under "keep" the emitted domain/TLD/
subdomain mirrors the input host's own spelling: an A-label (xn--…)
host yields A-label parts, a Unicode host yields Unicode parts. Pass
host_encoding = "unicode" for the previous always-decoded output, or
"idna" to force A-labels. This makes the domain accessors consistent
with get_host(), whose host_encoding already defaulted to "keep".R/status-constants.R) and predicates (.is_ok_status(),
.is_warning_status(), .is_joinable_status()).R/zzz.R now driven from a single .CACHE_REGISTRY
instead of repeating cache names by hand.lintr/goodpractice findings across R/ and the tests
(e.g. fixed = TRUE dot splits, condition-message construction, dropped
unnecessary lambdas) with no behavior change..lintr now mirrors goodpractice's linter set, so a local
lintr::lint_package() matches the goodpractice report; intentional
test-idiom deviations are documented in the config header..punycode_to_unicode(""), .host_is_ace(), and .cache_enabled()
guard branches and the derive_parse_status() NA-host-dot fallback
(and fixed an over-escaped regex literal that left the scheme-slash NA
guard untested). The two genuinely unreachable www-prefix
regex-capture fallbacks are now marked # nocov with justification.canonical_join() (47→7),
get_subdomain() (26→6), rurl_cache_config() (23→5), and
safe_parse_urls() (19→3) by extracting named sub-helpers (e.g.
.cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(),
.subdomain_labels(), .validate_max_full_parse(),
.spu_coerce_original()). No behavior change; no function in the package
now exceeds the goodpractice cyclocomp threshold of 15.pslr package
(Imports: pslr (>= 1.0.1)). rurl no longer ships its own processed copy of
the list (R/sysdata.rda) or its embedded matcher, and data-raw/update_psl.R
has been removed. punycoder is now required at >= 1.1.0.The embedded matcher used through 1.2.0 was not fully spec-correct. Delegating
to pslr fixes the following; outputs change accordingly:
*.) are now honored by TLD extraction. For example
get_tld("a.b.kobe.jp") is now "b.kobe.jp" (was "kobe.jp").!) are now honored by TLD extraction. For example
get_tld("www.ck") is now "ck" (was "www.ck"), and get_tld("foo.ck")
is now "foo.ck" (was "ck").get_domain("example.рф") is now "example.рф" (was NA).safe_parse_url() / safe_parse_urls() now derive the domain field using
the requested tld_source rather than always using the combined list, so
domain and tld are consistent within a parse. Under
tld_source = "private" (or "icann"), a host with no suffix in that section
now has domain = NA; consequently subdomain_levels_to_keep is a no-op for
such hosts (there is no registered domain to trim toward). The default
tld_source = "all" is unaffected.NA for both domain and TLD
(rurl queries pslr with unknown = "na"), rather than treating an unknown
single label as a public suffix.domain and tld memoization caches have been removed; pslr
caches its own query results. rurl_cache_config() and rurl_cache_info()
now cover only full_parse, puny_encode, and puny_decode, and the
domain / tld arguments to rurl_cache_config() no longer exist.punycoder (used for IDNA/Punycode encoding and decoding) is now on CRAN.
DESCRIPTION requires punycoder (>= 1.0.0).case_handling is now "lower_host" (was
"keep" for safe_parse_url(), safe_parse_urls(), get_clean_url(), and
the get_*() accessors, and "lower" for get_path()). This is the
RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold to
lowercase while the case-sensitive path is preserved. With the previous
defaults, hosts such as WWW.Example.COM and www.example.com did not fold
to one identity, and get_path() silently lowercased paths (two pages that
differ only by path casing collapsed to one). Pass case_handling = "keep"
to restore the previous reconstruction, or "lower" to lowercase the whole
URL including the path. (RURL-lzepdnmm)canonical_join() gains name_A / name_B arguments to set the output
original-URL column names explicitly. They default to NULL, preserving the
previous deparse(substitute()) behavior; supply them for stable names when
piping or passing anonymous inputs (e.g. canonical_join(df[df$x > 1, ], get_b())), which otherwise produced unstable column names. (RURL-fsygrelr)
canonical_join() gains a join_parse_status argument controlling which
parse statuses yield joinable keys. The default "ok" preserves the previous
behavior (only ok* statuses join); "ok_or_warning" additionally treats
the parseable-but-suspicious warning-* statuses (warning-no-tld,
warning-invalid-tld, warning-public-suffix) as joinable, at the cost of
more potential false-positive matches. (RURL-edqdrvfu)
Cache introspection and configuration. rurl_cache_info() reports the entry
count, enabled state, and any bound for each memoization cache
(full_parse, domain, tld). rurl_cache_config() enables or disables
individual caches and sets an optional max_full_parse bound on the
full-parse cache (default Inf, preserving the previous unbounded
behavior); when the bound is reached the cache is reset so peak memory stays
bounded. The domain and tld caches remain unbounded by design — they
grow with the number of unique hosts, not with URL/option combinations — and
can be disabled for workloads with very many unique hosts. (RURL-iuotpaqs)
safe_parse_url() now returns port as an integer (or NA_integer_), and
safe_parse_urls() no longer errors on URLs that contain an explicit port
(e.g. http://example.com:8080/path). Previously the scalar parser returned
the port as a character string and the vectorized parser aborted.
(RURL-fxyzanfg)http://[2001:db8::1]/) are now correctly detected
as IP hosts: is_ip_host is TRUE, parse_status is "ok", and no
TLD/domain derivation is attempted — matching how IPv4 hosts were already
handled. An over-escaped detection pattern previously prevented this.
(RURL-jpqjndld)subdomain_levels_to_keep = N (for N > 0) now keeps the N rightmost
subdomain labels as documented, instead of silently retaining all subdomains.
For example, safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1) now returns host domain.example.com (was
deep.sub.domain.example.com). N = 0 (strip all) is unchanged. Code that
relied on the previous no-op behavior for N > 0 will see different output.
(RURL-szumhumv)clean_url composition: it is a normalized canonical key built
from scheme, host, and path only. Port, query, fragment, and userinfo are
intentionally excluded, and with path_encoding = "decode" the path is shown
decoded (human-readable, not guaranteed URL-safe). This matches the existing
behavior and the key used by canonical_join() — no behavior change.
Corrected a lower_host description that implied userinfo could be retained
in clean_url, and fixed a README example whose input contained a literal
space (now percent-encoded) so it parses as documented. (RURL-jnboujtd)This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs.
case_handling and trailing_slash_handling parameters in safe_parse_url() and get_clean_url() provide greater control over URL formatting.canonical_join() for joining datasets on normalized URL keys.htp://.example.com:8080/path).curl::curl_parse_url() fails internally.This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains.
urltools is unavailablestringipsl package.update_psl.R script to fetch and process the PSL during development.@param, @return, etc.) for CRAN compliance.NAMESPACE and removed unnecessary functions like hello().get_*() functions are now vectorized and work on character vectors.curl and psl.mutate() and other tidy workflows.