Changes in version 2026-02-16 - Published first stable GitHub release tag: v1. - Release notes added in RELEASE_NOTES_v1.md. - GitHub release page: https://github.com/bart-turczynski/rurl/releases/tag/v1 - Package version for this release is 1.0.0 (see DESCRIPTION). Changes in version 1.4.0 Accessor improvements - get_path() gains path_normalization, index_page_handling, trailing_slash_handling, and path_encoding arguments, matching the corresponding options of safe_parse_url(). - get_scheme() gains scheme_relative_handling. - get_parse_status() gains source (mapped to tld_source) so warning statuses can be queried under a specific PSL section. - get_clean_url() and get_host() gain source (mapped to tld_source). - get_host() gains host_encoding. - get_domain(), get_tld(), and get_subdomain() gain host_encoding, mirroring get_host(). All new arguments default to the same values as safe_parse_url(), so existing calls are unaffected. Behavior change - The domain-family accessors (get_domain(), get_tld(), get_subdomain()) now follow host_encoding (default "keep") instead of always returning Unicode. Under "keep" the emitted domain/TLD/ subdomain mirrors the input host's own spelling: an A-label (xn--…) host yields A-label parts, a Unicode host yields Unicode parts. Pass host_encoding = "unicode" for the previous always-decoded output, or "idna" to force A-labels. This makes the domain accessors consistent with get_host(), whose host_encoding already defaulted to "keep". Internal - Parse-status string literals replaced by named constants (R/status-constants.R) and predicates (.is_ok_status(), .is_warning_status(), .is_joinable_status()). - Cache touchpoints in R/zzz.R now driven from a single .CACHE_REGISTRY instead of repeating cache names by hand. - Cleared the lintr/goodpractice findings across R/ and the tests (e.g. fixed = TRUE dot splits, condition-message construction, dropped unnecessary lambdas) with no behavior change. - .lintr now mirrors goodpractice's linter set, so a local lintr::lint_package() matches the goodpractice report; intentional test-idiom deviations are documented in the config header. - Restored 100% line coverage: added targeted tests for the .punycode_to_unicode(""), .host_is_ace(), and .cache_enabled() guard branches and the derive_parse_status() NA-host-dot fallback (and fixed an over-escaped regex literal that left the scheme-slash NA guard untested). The two genuinely unreachable www-prefix regex-capture fallbacks are now marked # nocov with justification. - Reduced the cyclomatic complexity of canonical_join() (47→7), get_subdomain() (26→6), rurl_cache_config() (23→5), and safe_parse_urls() (19→3) by extracting named sub-helpers (e.g. .cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(), .subdomain_labels(), .validate_max_full_parse(), .spu_coerce_original()). No behavior change; no function in the package now exceeds the goodpractice cyclocomp threshold of 15. Changes in version 1.3.0 Dependencies - Public Suffix List matching is now delegated to the pslr package (Imports: pslr (>= 1.0.1)). rurl no longer ships its own processed copy of the list (R/sysdata.rda) or its embedded matcher, and data-raw/update_psl.R has been removed. punycoder is now required at >= 1.1.0. Behavior changes (PSL correctness) The embedded matcher used through 1.2.0 was not fully spec-correct. Delegating to pslr fixes the following; outputs change accordingly: - Wildcard rules (*.) are now honored by TLD extraction. For example get_tld("a.b.kobe.jp") is now "b.kobe.jp" (was "kobe.jp"). - Exception rules (!) are now honored by TLD extraction. For example get_tld("www.ck") is now "ck" (was "www.ck"), and get_tld("foo.ck") is now "foo.ck" (was "ck"). - IDN hosts now resolve a registered domain in every section. For example get_domain("example.рф") is now "example.рф" (was NA). - safe_parse_url() / safe_parse_urls() now derive the domain field using the requested tld_source rather than always using the combined list, so domain and tld are consistent within a parse. Under tld_source = "private" (or "icann"), a host with no suffix in that section now has domain = NA; consequently subdomain_levels_to_keep is a no-op for such hosts (there is no registered domain to trim toward). The default tld_source = "all" is unaffected. - Hosts under an unknown TLD continue to return NA for both domain and TLD (rurl queries pslr with unknown = "na"), rather than treating an unknown single label as a public suffix. Cache changes - The per-host domain and tld memoization caches have been removed; pslr caches its own query results. rurl_cache_config() and rurl_cache_info() now cover only full_parse, puny_encode, and puny_decode, and the domain / tld arguments to rurl_cache_config() no longer exist. Changes in version 1.2.0 (2026-06-19) Dependencies - punycoder (used for IDNA/Punycode encoding and decoding) is now on CRAN. DESCRIPTION requires punycoder (>= 1.0.0). Behavior changes - The package-wide default for case_handling is now "lower_host" (was "keep" for safe_parse_url(), safe_parse_urls(), get_clean_url(), and the get_*() accessors, and "lower" for get_path()). This is the RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and host fold to lowercase while the case-sensitive path is preserved. With the previous defaults, hosts such as WWW.Example.COM and www.example.com did not fold to one identity, and get_path() silently lowercased paths (two pages that differ only by path casing collapsed to one). Pass case_handling = "keep" to restore the previous reconstruction, or "lower" to lowercase the whole URL including the path. (RURL-lzepdnmm) Changes in version 1.1.0 New features - canonical_join() gains name_A / name_B arguments to set the output original-URL column names explicitly. They default to NULL, preserving the previous deparse(substitute()) behavior; supply them for stable names when piping or passing anonymous inputs (e.g. canonical_join(df[df$x > 1, ], get_b())), which otherwise produced unstable column names. (RURL-fsygrelr) - canonical_join() gains a join_parse_status argument controlling which parse statuses yield joinable keys. The default "ok" preserves the previous behavior (only ok* statuses join); "ok_or_warning" additionally treats the parseable-but-suspicious warning-* statuses (warning-no-tld, warning-invalid-tld, warning-public-suffix) as joinable, at the cost of more potential false-positive matches. (RURL-edqdrvfu) - Cache introspection and configuration. rurl_cache_info() reports the entry count, enabled state, and any bound for each memoization cache (full_parse, domain, tld). rurl_cache_config() enables or disables individual caches and sets an optional max_full_parse bound on the full-parse cache (default Inf, preserving the previous unbounded behavior); when the bound is reached the cache is reset so peak memory stays bounded. The domain and tld caches remain unbounded by design — they grow with the number of unique hosts, not with URL/option combinations — and can be disabled for workloads with very many unique hosts. (RURL-iuotpaqs) Bug fixes - safe_parse_url() now returns port as an integer (or NA_integer_), and safe_parse_urls() no longer errors on URLs that contain an explicit port (e.g. http://example.com:8080/path). Previously the scalar parser returned the port as a character string and the vectorized parser aborted. (RURL-fxyzanfg) - Bracketed IPv6 hosts (e.g. http://[2001:db8::1]/) are now correctly detected as IP hosts: is_ip_host is TRUE, parse_status is "ok", and no TLD/domain derivation is attempted — matching how IPv4 hosts were already handled. An over-escaped detection pattern previously prevented this. (RURL-jpqjndld) Behavior changes (potentially breaking) - subdomain_levels_to_keep = N (for N > 0) now keeps the N rightmost subdomain labels as documented, instead of silently retaining all subdomains. For example, safe_parse_url("http://deep.sub.domain.example.com", subdomain_levels_to_keep = 1) now returns host domain.example.com (was deep.sub.domain.example.com). N = 0 (strip all) is unchanged. Code that relied on the previous no-op behavior for N > 0 will see different output. (RURL-szumhumv) Documentation - Documented clean_url composition: it is a normalized canonical key built from scheme, host, and path only. Port, query, fragment, and userinfo are intentionally excluded, and with path_encoding = "decode" the path is shown decoded (human-readable, not guaranteed URL-safe). This matches the existing behavior and the key used by canonical_join() — no behavior change. Corrected a lower_host description that implied userinfo could be retained in clean_url, and fixed a README example whose input contained a literal space (now percent-encoded) so it parses as documented. (RURL-jnboujtd) Changes in version 0.3.0 This release adds powerful capabilities for URL normalization and canonical dataset joining. It significantly improves robustness in handling malformed or inconsistent URLs. Highlights - New case_handling and trailing_slash_handling parameters in safe_parse_url() and get_clean_url() provide greater control over URL formatting. - Introduced canonical_join() for joining datasets on normalized URL keys. - Improved handling of non-standard or malformed schemes like htp://. - Fixed parsing for schemeless URLs with ports (e.g., example.com:8080/path). - More reliable fallback when curl::curl_parse_url() fails internally. - Corrected regular expressions for IPv6 parsing. Changes in version 0.2.0 - First version for a potential CRAN submission. - Fully tested across macOS, Windows, and Linux. - Achieved 100% unit test coverage. - Improved README and documentation. This release adds robust support for internationalized domain names (IDNs), improves punycode handling, and ensures accurate extraction of TLDs and registered domains. Highlights - Accurate TLD extraction for both ASCII and Unicode domains - Graceful fallback when urltools is unavailable - NFC normalization with stringi - 100% test coverage with edge cases and punycode validation - Improved internal helpers and clearer test diagnostics Changes in version 0.1.3 Improvements - Removed the dependency on the psl package. - Implemented an internal registered domain extraction using the Public Suffix List. - Added internal update_psl.R script to fetch and process the PSL during development. - Improved test coverage to 100%. - Cleaned up exports and internal helpers. - Updated ignores. - Tested on macOS, Windows, and Linux via rhub and win-builder. - CRAN checks pass with 0 errors/warnings and only standard notes. Documentation - README updated to reflect the use of the PSL and internal domain logic. - LICENSE and attribution clarified for MIT + Mozilla Public Suffix List. Changes in version 0.1.2 Stabilization & Coverage - Achieved 100% test coverage. - Added examples to all exported functions. - Improved documentation (@param, @return, etc.) for CRAN compliance. - Cleaned up NAMESPACE and removed unnecessary functions like hello(). - Refined URL parsing logic and improved output consistency. Changes in version 0.1.0 - All get_*() functions are now vectorized and work on character vectors. - Deprecated scalar-only behavior. - Internal parsing made more robust using curl and psl. - Ready for use in mutate() and other tidy workflows.