Changes in version 2026-02-16                      

  - Published first stable GitHub release tag: v1.
  - Release notes added in RELEASE_NOTES_v1.md.
  - GitHub release page:
    https://github.com/bart-turczynski/rurl/releases/tag/v1
  - Package version for this release is 1.0.0 (see DESCRIPTION).

                        Changes in version 1.4.0                        

Accessor improvements

  - get_path() gains path_normalization, index_page_handling,
    trailing_slash_handling, and path_encoding arguments, matching the
    corresponding options of safe_parse_url().
  - get_scheme() gains scheme_relative_handling.
  - get_parse_status() gains source (mapped to tld_source) so warning
    statuses can be queried under a specific PSL section.
  - get_clean_url() and get_host() gain source (mapped to tld_source).
  - get_host() gains host_encoding.
  - get_domain(), get_tld(), and get_subdomain() gain host_encoding,
    mirroring get_host().

All new arguments default to the same values as safe_parse_url(), so
existing calls are unaffected.

Behavior change

  - The domain-family accessors (get_domain(), get_tld(),
    get_subdomain()) now follow host_encoding (default "keep") instead
    of always returning Unicode. Under "keep" the emitted domain/TLD/
    subdomain mirrors the input host's own spelling: an A-label (xn--…)
    host yields A-label parts, a Unicode host yields Unicode parts. Pass
    host_encoding = "unicode" for the previous always-decoded output, or
    "idna" to force A-labels. This makes the domain accessors consistent
    with get_host(), whose host_encoding already defaulted to "keep".

Internal

  - Parse-status string literals replaced by named constants
    (R/status-constants.R) and predicates (.is_ok_status(),
    .is_warning_status(), .is_joinable_status()).
  - Cache touchpoints in R/zzz.R now driven from a single
    .CACHE_REGISTRY instead of repeating cache names by hand.
  - Cleared the lintr/goodpractice findings across R/ and the tests
    (e.g. fixed = TRUE dot splits, condition-message construction,
    dropped unnecessary lambdas) with no behavior change.
  - .lintr now mirrors goodpractice's linter set, so a local
    lintr::lint_package() matches the goodpractice report; intentional
    test-idiom deviations are documented in the config header.
  - Restored 100% line coverage: added targeted tests for the
    .punycode_to_unicode(""), .host_is_ace(), and .cache_enabled() guard
    branches and the derive_parse_status() NA-host-dot fallback (and
    fixed an over-escaped regex literal that left the scheme-slash NA
    guard untested). The two genuinely unreachable www-prefix
    regex-capture fallbacks are now marked # nocov with justification.
  - Reduced the cyclomatic complexity of canonical_join() (47→7),
    get_subdomain() (26→6), rurl_cache_config() (23→5), and
    safe_parse_urls() (19→3) by extracting named sub-helpers (e.g.
    .cj_validate_inputs()/.cj_resolve_sides()/.cj_build_join_df(),
    .subdomain_labels(), .validate_max_full_parse(),
    .spu_coerce_original()). No behavior change; no function in the
    package now exceeds the goodpractice cyclocomp threshold of 15.

                        Changes in version 1.3.0                        

Dependencies

  - Public Suffix List matching is now delegated to the pslr package
    (Imports: pslr (>= 1.0.1)). rurl no longer ships its own processed
    copy of the list (R/sysdata.rda) or its embedded matcher, and
    data-raw/update_psl.R has been removed. punycoder is now required at
    >= 1.1.0.

Behavior changes (PSL correctness)

The embedded matcher used through 1.2.0 was not fully spec-correct.
Delegating to pslr fixes the following; outputs change accordingly:

  - Wildcard rules (*.) are now honored by TLD extraction. For example
    get_tld("a.b.kobe.jp") is now "b.kobe.jp" (was "kobe.jp").
  - Exception rules (!) are now honored by TLD extraction. For example
    get_tld("www.ck") is now "ck" (was "www.ck"), and get_tld("foo.ck")
    is now "foo.ck" (was "ck").
  - IDN hosts now resolve a registered domain in every section. For
    example get_domain("example.рф") is now "example.рф" (was NA).
  - safe_parse_url() / safe_parse_urls() now derive the domain field
    using the requested tld_source rather than always using the combined
    list, so domain and tld are consistent within a parse. Under
    tld_source = "private" (or "icann"), a host with no suffix in that
    section now has domain = NA; consequently subdomain_levels_to_keep
    is a no-op for such hosts (there is no registered domain to trim
    toward). The default tld_source = "all" is unaffected.
  - Hosts under an unknown TLD continue to return NA for both domain and
    TLD (rurl queries pslr with unknown = "na"), rather than treating an
    unknown single label as a public suffix.

Cache changes

  - The per-host domain and tld memoization caches have been removed;
    pslr caches its own query results. rurl_cache_config() and
    rurl_cache_info() now cover only full_parse, puny_encode, and
    puny_decode, and the domain / tld arguments to rurl_cache_config()
    no longer exist.

                 Changes in version 1.2.0 (2026-06-19)                  

Dependencies

  - punycoder (used for IDNA/Punycode encoding and decoding) is now on
    CRAN. DESCRIPTION requires punycoder (>= 1.0.0).

Behavior changes

  - The package-wide default for case_handling is now "lower_host" (was
    "keep" for safe_parse_url(), safe_parse_urls(), get_clean_url(), and
    the get_*() accessors, and "lower" for get_path()). This is the
    RFC 3986 §6.2.2.1 normalization: the case-insensitive scheme and
    host fold to lowercase while the case-sensitive path is preserved.
    With the previous defaults, hosts such as WWW.Example.COM and
    www.example.com did not fold to one identity, and get_path()
    silently lowercased paths (two pages that differ only by path casing
    collapsed to one). Pass case_handling = "keep" to restore the
    previous reconstruction, or "lower" to lowercase the whole URL
    including the path. (RURL-lzepdnmm)

                        Changes in version 1.1.0                        

New features

  - canonical_join() gains name_A / name_B arguments to set the output
    original-URL column names explicitly. They default to NULL,
    preserving the previous deparse(substitute()) behavior; supply them
    for stable names when piping or passing anonymous inputs (e.g.
    canonical_join(df[df$x > 1, ], get_b())), which otherwise produced
    unstable column names. (RURL-fsygrelr)

  - canonical_join() gains a join_parse_status argument controlling
    which parse statuses yield joinable keys. The default "ok" preserves
    the previous behavior (only ok* statuses join); "ok_or_warning"
    additionally treats the parseable-but-suspicious warning-* statuses
    (warning-no-tld, warning-invalid-tld, warning-public-suffix) as
    joinable, at the cost of more potential false-positive matches.
    (RURL-edqdrvfu)

  - Cache introspection and configuration. rurl_cache_info() reports the
    entry count, enabled state, and any bound for each memoization cache
    (full_parse, domain, tld). rurl_cache_config() enables or disables
    individual caches and sets an optional max_full_parse bound on the
    full-parse cache (default Inf, preserving the previous unbounded
    behavior); when the bound is reached the cache is reset so peak
    memory stays bounded. The domain and tld caches remain unbounded by
    design — they grow with the number of unique hosts, not with
    URL/option combinations — and can be disabled for workloads with
    very many unique hosts. (RURL-iuotpaqs)

Bug fixes

  - safe_parse_url() now returns port as an integer (or NA_integer_),
    and safe_parse_urls() no longer errors on URLs that contain an
    explicit port (e.g. http://example.com:8080/path). Previously the
    scalar parser returned the port as a character string and the
    vectorized parser aborted. (RURL-fxyzanfg)
  - Bracketed IPv6 hosts (e.g. http://[2001:db8::1]/) are now correctly
    detected as IP hosts: is_ip_host is TRUE, parse_status is "ok", and
    no TLD/domain derivation is attempted — matching how IPv4 hosts were
    already handled. An over-escaped detection pattern previously
    prevented this. (RURL-jpqjndld)

Behavior changes (potentially breaking)

  - subdomain_levels_to_keep = N (for N > 0) now keeps the N rightmost
    subdomain labels as documented, instead of silently retaining all
    subdomains. For example,
    safe_parse_url("http://deep.sub.domain.example.com",
    subdomain_levels_to_keep = 1) now returns host domain.example.com
    (was deep.sub.domain.example.com). N = 0 (strip all) is unchanged.
    Code that relied on the previous no-op behavior for N > 0 will see
    different output. (RURL-szumhumv)

Documentation

  - Documented clean_url composition: it is a normalized canonical key
    built from scheme, host, and path only. Port, query, fragment, and
    userinfo are intentionally excluded, and with path_encoding =
    "decode" the path is shown decoded (human-readable, not guaranteed
    URL-safe). This matches the existing behavior and the key used by
    canonical_join() — no behavior change. Corrected a lower_host
    description that implied userinfo could be retained in clean_url,
    and fixed a README example whose input contained a literal space
    (now percent-encoded) so it parses as documented. (RURL-jnboujtd)

                        Changes in version 0.3.0                        

This release adds powerful capabilities for URL normalization and
canonical dataset joining. It significantly improves robustness in
handling malformed or inconsistent URLs.

Highlights

  - New case_handling and trailing_slash_handling parameters in
    safe_parse_url() and get_clean_url() provide greater control over
    URL formatting.
  - Introduced canonical_join() for joining datasets on normalized URL
    keys.
  - Improved handling of non-standard or malformed schemes like htp://.
  - Fixed parsing for schemeless URLs with ports (e.g.,
    example.com:8080/path).
  - More reliable fallback when curl::curl_parse_url() fails internally.
  - Corrected regular expressions for IPv6 parsing.

                        Changes in version 0.2.0                        

  - First version for a potential CRAN submission.
  - Fully tested across macOS, Windows, and Linux.
  - Achieved 100% unit test coverage.
  - Improved README and documentation.

This release adds robust support for internationalized domain names
(IDNs), improves punycode handling, and ensures accurate extraction of
TLDs and registered domains.

Highlights

  - Accurate TLD extraction for both ASCII and Unicode domains
  - Graceful fallback when urltools is unavailable
  - NFC normalization with stringi
  - 100% test coverage with edge cases and punycode validation
  - Improved internal helpers and clearer test diagnostics

                        Changes in version 0.1.3                        

Improvements

  - Removed the dependency on the psl package.
  - Implemented an internal registered domain extraction using the
    Public Suffix List.
  - Added internal update_psl.R script to fetch and process the PSL
    during development.
  - Improved test coverage to 100%.
  - Cleaned up exports and internal helpers.
  - Updated ignores.
  - Tested on macOS, Windows, and Linux via rhub and win-builder.
  - CRAN checks pass with 0 errors/warnings and only standard notes.

Documentation

  - README updated to reflect the use of the PSL and internal domain
    logic.
  - LICENSE and attribution clarified for MIT + Mozilla Public Suffix
    List.

                        Changes in version 0.1.2                        

Stabilization & Coverage

  - Achieved 100% test coverage.
  - Added examples to all exported functions.
  - Improved documentation (@param, @return, etc.) for CRAN compliance.
  - Cleaned up NAMESPACE and removed unnecessary functions like hello().
  - Refined URL parsing logic and improved output consistency.

                        Changes in version 0.1.0                        

  - All get_*() functions are now vectorized and work on character
    vectors.
  - Deprecated scalar-only behavior.
  - Internal parsing made more robust using curl and psl.
  - Ready for use in mutate() and other tidy workflows.