| Title: | Parse, Clean, and Normalize URLs |
|---|---|
| Description: | A lightweight toolkit for extracting structured information from URLs. Includes functions for parsing, normalizing protocols, extracting domains, and constructing clean URLs. Domain and public-suffix extraction is delegated to the 'pslr' package, which implements the Public Suffix List from <https://publicsuffix.org>. Punycode and IDNA encoding is handled by the 'punycoder' package. |
| Authors: | Bart Turczynski [aut, cre] (ORCID: <https://orcid.org/0000-0002-8788-7980>) |
| Maintainer: | Bart Turczynski <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.4.0 |
| Built: | 2026-06-21 21:12:40 UTC |
| Source: | https://github.com/bart-turczynski/rurl |
Performs a join between two data frames by canonicalizing URLs to a shared
"clean" format using safe_parse_urls and then matching on
that key.
This is suitable for large crawl exports.
canonical_join( data_A, data_B, col_A = "URL", col_B = "URL", suffix_A = "_A", suffix_B = "_B", name_A = NULL, name_B = NULL, join = c("inner", "left", "right", "full"), collision = c("first", "all", "error"), on_parse_error = c("keep", "drop", "error"), join_parse_status = c("ok", "ok_or_warning"), ... )canonical_join( data_A, data_B, col_A = "URL", col_B = "URL", suffix_A = "_A", suffix_B = "_B", name_A = NULL, name_B = NULL, join = c("inner", "left", "right", "full"), collision = c("first", "all", "error"), on_parse_error = c("keep", "drop", "error"), join_parse_status = c("ok", "ok_or_warning"), ... )
data_A |
A data frame containing URLs for the left side of the join. |
data_B |
A data frame containing URLs for the right side of the join. |
col_A |
Character string, the name of the column in |
col_B |
Character string, the name of the column in |
suffix_A |
Character string, suffix to append to |
suffix_B |
Character string, suffix to append to |
name_A |
Character string, the name of the output column holding the
original |
name_B |
Character string, the name of the output column holding the
original |
join |
Join type: |
collision |
How to handle duplicate canonical keys within inputs.
|
on_parse_error |
How to handle URLs that fail canonicalization.
|
join_parse_status |
Which parse statuses yield joinable canonical keys.
|
... |
Additional arguments forwarded to |
A data frame representing the join. The output includes:
The original URL columns (named via name_A / name_B,
or after the input expressions when those are NULL).
JoinKey: the canonicalized URL used for matching.
All other columns from data_A and data_B with
suffixes applied.
Returns an empty data frame with the expected structure if no matches are found or if inputs are invalid.
A <- data.frame( URL = c("http://Example.com/Page", "http://example.com/Other"), ValA = 1:2, stringsAsFactors = FALSE ) B <- data.frame( URL = c("https://www.example.com/Page/", "http://example.com/Miss"), ValB = c("x", "y"), stringsAsFactors = FALSE ) canonical_join( A, B, protocol_handling = "strip", www_handling = "strip", case_handling = "lower_host", trailing_slash_handling = "strip" )A <- data.frame( URL = c("http://Example.com/Page", "http://example.com/Other"), ValA = 1:2, stringsAsFactors = FALSE ) B <- data.frame( URL = c("https://www.example.com/Page/", "http://example.com/Miss"), ValB = c("x", "y"), stringsAsFactors = FALSE ) canonical_join( A, B, protocol_handling = "strip", www_handling = "strip", case_handling = "lower_host", trailing_slash_handling = "strip" )
This function returns the cleaned version of the URLs after applying
protocol, www, case, and trailing slash handling rules. The result is a
normalized canonical key composed of scheme, host, and path only; port,
query, fragment, and userinfo are intentionally excluded (use
get_port, get_query, get_fragment,
or get_userinfo for those).
get_clean_url( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), case_handling = "lower_host", trailing_slash_handling = "none", index_page_handling = "keep", path_normalization = "none", scheme_relative_handling = "keep", subdomain_levels_to_keep = NULL, host_encoding = "keep", path_encoding = "keep" )get_clean_url( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), case_handling = "lower_host", trailing_slash_handling = "none", index_page_handling = "keep", path_normalization = "none", scheme_relative_handling = "keep", subdomain_levels_to_keep = NULL, host_encoding = "keep", path_encoding = "keep" )
url |
A character vector containing URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
source |
Which PSL source to use: "all", "private", or "icann".
Subdomain trimming depends on which section is consulted, so pass
|
case_handling |
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
|
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
A character vector of cleaned URLs.
get_clean_url("Example.COM/Path") # Default lower_host: host folds, path kept get_clean_url( "Example.COM/Path", case_handling = "keep", trailing_slash_handling = "keep" ) get_clean_url( "Example.COM/Path/", case_handling = "upper", trailing_slash_handling = "strip" ) get_clean_url("http://example.com", www_handling = "strip") get_clean_url( "http://deep.sub.domain.example.com/path", subdomain_levels_to_keep = 0 ) # -> "http://example.com/path" get_clean_url( "http://www.deep.sub.domain.example.com/path", subdomain_levels_to_keep = 1, www_handling = "strip" ) # -> "http://domain.example.com/path" get_clean_url( "http://www.deep.sub.domain.example.com/path", subdomain_levels_to_keep = 1, www_handling = "keep" ) # -> "http://www.domain.example.com/path"get_clean_url("Example.COM/Path") # Default lower_host: host folds, path kept get_clean_url( "Example.COM/Path", case_handling = "keep", trailing_slash_handling = "keep" ) get_clean_url( "Example.COM/Path/", case_handling = "upper", trailing_slash_handling = "strip" ) get_clean_url("http://example.com", www_handling = "strip") get_clean_url( "http://deep.sub.domain.example.com/path", subdomain_levels_to_keep = 0 ) # -> "http://example.com/path" get_clean_url( "http://www.deep.sub.domain.example.com/path", subdomain_levels_to_keep = 1, www_handling = "strip" ) # -> "http://domain.example.com/path" get_clean_url( "http://www.deep.sub.domain.example.com/path", subdomain_levels_to_keep = 1, www_handling = "keep" ) # -> "http://www.domain.example.com/path"
Extracts the registered domain name from a URL (e.g., "example.com"). Relies on the Public Suffix List.
get_domain( url, protocol_handling = "keep", www_handling = "none", subdomain_levels_to_keep = NULL, source = c("all", "private", "icann"), host_encoding = c("keep", "idna", "unicode") )get_domain( url, protocol_handling = "keep", www_handling = "none", subdomain_levels_to_keep = NULL, source = c("all", "private", "icann"), host_encoding = c("keep", "idna", "unicode") )
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
source |
Which PSL source to use: "all", "private", or "icann". |
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
A character vector of domain names.
get_domain("http://www.example.co.uk/path")get_domain("http://www.example.co.uk/path")
Extracts the fragment component of a URL.
get_fragment(url, protocol_handling = "keep")get_fragment(url, protocol_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
A character vector of fragments.
get_fragment("http://example.com/path#section")get_fragment("http://example.com/path#section")
Extracts the host component of a URL.
get_host( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), subdomain_levels_to_keep = NULL, case_handling = c("lower", "keep", "upper", "lower_host"), host_encoding = c("keep", "idna", "unicode") )get_host( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), subdomain_levels_to_keep = NULL, case_handling = c("lower", "keep", "upper", "lower_host"), host_encoding = c("keep", "idna", "unicode") )
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
source |
Which PSL source to use: "all", "private", or "icann".
Subdomain trimming depends on which section is consulted, so pass
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
case_handling |
How to handle casing of the returned host. Defaults to "lower". |
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
A character vector of URL hosts.
get_host("http://sub.example.com:8080") get_host( "http://www.two.one.example.com", subdomain_levels_to_keep = 1 ) # Result: "www.one.example.com" get_host( "http://www.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1 ) # Result: "one.example.com" get_host( "http://www.two.one.example.com", www_handling = "keep", subdomain_levels_to_keep = 1 ) # Result: "www.one.example.com" get_host( "http://three.two.one.example.com", subdomain_levels_to_keep = 0 ) # Result: "example.com" get_host( "http://www.three.two.one.example.com", subdomain_levels_to_keep = 0 ) # Result: "www.example.com"get_host("http://sub.example.com:8080") get_host( "http://www.two.one.example.com", subdomain_levels_to_keep = 1 ) # Result: "www.one.example.com" get_host( "http://www.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1 ) # Result: "one.example.com" get_host( "http://www.two.one.example.com", www_handling = "keep", subdomain_levels_to_keep = 1 ) # Result: "www.one.example.com" get_host( "http://three.two.one.example.com", subdomain_levels_to_keep = 0 ) # Result: "example.com" get_host( "http://www.three.two.one.example.com", subdomain_levels_to_keep = 0 ) # Result: "www.example.com"
Get the parse status of URLs
get_parse_status( url, protocol_handling = "keep", www_handling = "none", subdomain_levels_to_keep = NULL, source = c("all", "private", "icann") )get_parse_status( url, protocol_handling = "keep", www_handling = "none", subdomain_levels_to_keep = NULL, source = c("all", "private", "icann") )
url |
A character vector of URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
source |
Which PSL source to use: "all", "private", or "icann".
Warning statuses such as |
A character vector with the parse status of each URL.
get_parse_status( c("http://example.com", "ftp://example.com", "mailto:[email protected]") ) get_parse_status(c("http://example.com", "not-a-url")) get_parse_status("http://example.com", source = "icann")get_parse_status( c("http://example.com", "ftp://example.com", "mailto:[email protected]") ) get_parse_status(c("http://example.com", "not-a-url")) get_parse_status("http://example.com", source = "icann")
Extracts the password component of a URL.
get_password(url, protocol_handling = "keep")get_password(url, protocol_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
A character vector of passwords.
get_password("ftp://alice:[email protected]/file.txt")get_password("ftp://alice:[email protected]/file.txt")
Extracts the path component of a URL.
get_path( url, protocol_handling = "keep", case_handling = c("lower_host", "keep", "lower", "upper"), trailing_slash_handling = c("none", "keep", "strip"), index_page_handling = c("keep", "strip"), path_normalization = c("none", "collapse_slashes", "dot_segments", "both"), path_encoding = c("keep", "encode", "decode") )get_path( url, protocol_handling = "keep", case_handling = c("lower_host", "keep", "lower", "upper"), trailing_slash_handling = c("none", "keep", "strip"), index_page_handling = c("keep", "strip"), path_normalization = c("none", "collapse_slashes", "dot_segments", "both"), path_encoding = c("keep", "encode", "decode") )
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
case_handling |
How to handle casing of the returned path. Defaults to "lower_host", which preserves the path's original casing (paths are case-sensitive per RFC 3986 §6.2.2.1). Use "lower"/"upper" to force a case. |
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
A character vector of URL paths.
get_path("http://example.com/some/path?query=1")get_path("http://example.com/some/path?query=1")
Extracts the port component of a URL.
get_port(url, protocol_handling = "keep")get_port(url, protocol_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
An integer vector of ports.
get_port("http://example.com:8080/path")get_port("http://example.com:8080/path")
Extracts the query component of a URL, optionally parsing it into a list.
get_query( url, protocol_handling = "keep", format = c("string", "list"), decode = TRUE )get_query( url, protocol_handling = "keep", format = c("string", "list"), decode = TRUE )
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
format |
Return format: "string" (default) or "list" for parsed elements. |
decode |
Logical; if TRUE and format="list", percent-decodes keys/values. |
A character vector (format="string") or list (format="list").
get_query("http://example.com/path?a=1&b=2") get_query("http://example.com/path?a=1&b=2", format = "list")get_query("http://example.com/path?a=1&b=2") get_query("http://example.com/path?a=1&b=2", format = "list")
Extracts the scheme (protocol) of a URL.
get_scheme(url, protocol_handling = "keep", scheme_relative_handling = "keep")get_scheme(url, protocol_handling = "keep", scheme_relative_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
A character vector of URL schemes.
get_scheme("https://example.com")get_scheme("https://example.com")
Extracts the subdomain component of a URL.
get_subdomain( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), include_www = FALSE, format = c("string", "labels"), host_encoding = c("keep", "idna", "unicode") )get_subdomain( url, protocol_handling = "keep", www_handling = "none", source = c("all", "private", "icann"), include_www = FALSE, format = c("string", "labels"), host_encoding = c("keep", "idna", "unicode") )
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
source |
Which PSL source to use: "all", "private", or "icann". |
include_www |
Logical; if FALSE (default), removes a leading www/www[0-9]* label only when it is the sole subdomain label. |
format |
Return format: "string" (default) or "labels" for a character vector of labels. |
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
A character vector (format="string") or list of label vectors (format="labels").
get_subdomain("http://www.blog.example.co.uk") get_subdomain("http://www.blog.example.co.uk", format = "labels")get_subdomain("http://www.blog.example.co.uk") get_subdomain("http://www.blog.example.co.uk", format = "labels")
Uses safe_parse_url internally to extract the TLD, benefiting from all memoization layers for improved performance.
get_tld( url, source = c("all", "private", "icann"), host_encoding = c("keep", "idna", "unicode") )get_tld( url, source = c("all", "private", "icann"), host_encoding = c("keep", "idna", "unicode") )
url |
A character vector of URLs. |
source |
Which TLD source to use: "all", "icann", or "private". |
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
A character vector of TLDs.
get_tld("example.com")get_tld("example.com")
Extracts the user component of a URL.
get_user(url, protocol_handling = "keep")get_user(url, protocol_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
A character vector of user names.
get_user("ftp://alice:[email protected]/file.txt")get_user("ftp://alice:[email protected]/file.txt")
Extracts the userinfo component of a URL (user or user:password).
get_userinfo(url, protocol_handling = "keep")get_userinfo(url, protocol_handling = "keep")
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
A character vector of userinfo values.
get_userinfo("ftp://alice:[email protected]/file.txt") get_userinfo("ftp://[email protected]/file.txt")get_userinfo("ftp://alice:[email protected]/file.txt") get_userinfo("ftp://[email protected]/file.txt")
Enables or disables individual caches and sets an optional bound on the
full_parse cache. Called with no arguments, it leaves the
configuration unchanged and returns the current state.
rurl_cache_config( full_parse = NULL, puny_encode = NULL, puny_decode = NULL, max_full_parse = NULL )rurl_cache_config( full_parse = NULL, puny_encode = NULL, puny_decode = NULL, max_full_parse = NULL )
full_parse |
Logical; enable/disable the full URL parse cache. |
puny_encode |
Logical; enable/disable the IDNA/Punycode encode cache. |
puny_decode |
Logical; enable/disable the Punycode decode cache. |
max_full_parse |
A single number ( |
Disabling a cache stops new writes to it (existing entries are left in
place until rurl_clear_caches is called). When
full_parse reaches max_full_parse entries, the entire
cache is cleared before the next new entry is stored, so its peak size never
exceeds the bound. This is a hard reset-watermark, not an LRU or FIFO
eviction policy: max_full_parse caps peak memory, but is not a
working-set size — once the bound is hit the cache empties completely and
rebuilds from scratch. The default of Inf preserves the historical
unbounded behavior. The
puny_encode and puny_decode caches are unbounded by design
(each stays small — bounded by the number of unique hosts/labels seen, not
URL+option combinations).
Invisibly, the updated rurl_cache_info data.frame.
rurl_cache_info, rurl_clear_caches
rurl_cache_config(max_full_parse = 10000) rurl_cache_config(puny_encode = FALSE) rurl_cache_config() # inspect current configurationrurl_cache_config(max_full_parse = 10000) rurl_cache_config(puny_encode = FALSE) rurl_cache_config() # inspect current configuration
Reports the number of entries currently held in each memoization cache, along with whether the cache is enabled and any configured entry bound.
rurl_cache_info()rurl_cache_info()
A data.frame with one row per cache (full_parse,
puny_encode, puny_decode) and columns entries,
enabled, and max_entries.
rurl_cache_config, rurl_clear_caches
get_domain("https://www.example.com") rurl_cache_info()get_domain("https://www.example.com") rurl_cache_info()
Clears the memoization caches used by rurl functions. This is useful if you need to free memory.
rurl_clear_caches()rurl_clear_caches()
Invisibly returns NULL.
rurl_clear_caches()rurl_clear_caches()
Vectorized wrapper around safe_parse_url that returns a
data.frame with one row per input URL.
safe_parse_urls( url, protocol_handling = c("keep", "none", "strip", "http", "https"), www_handling = c("none", "strip", "keep", "if_no_subdomain"), tld_source = c("all", "private", "icann"), case_handling = c("lower_host", "keep", "lower", "upper"), trailing_slash_handling = c("none", "keep", "strip"), index_page_handling = c("keep", "strip"), path_normalization = c("none", "collapse_slashes", "dot_segments", "both"), scheme_relative_handling = c("keep", "http", "https", "error"), subdomain_levels_to_keep = NULL, host_encoding = c("keep", "idna", "unicode"), path_encoding = c("keep", "encode", "decode") )safe_parse_urls( url, protocol_handling = c("keep", "none", "strip", "http", "https"), www_handling = c("none", "strip", "keep", "if_no_subdomain"), tld_source = c("all", "private", "icann"), case_handling = c("lower_host", "keep", "lower", "upper"), trailing_slash_handling = c("none", "keep", "strip"), index_page_handling = c("keep", "strip"), path_normalization = c("none", "collapse_slashes", "dot_segments", "both"), scheme_relative_handling = c("keep", "http", "https", "error"), subdomain_levels_to_keep = NULL, host_encoding = c("keep", "idna", "unicode"), path_encoding = c("keep", "encode", "decode") )
url |
A character vector of URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
tld_source |
Which TLD source to use for TLD extraction: "all", "icann", or "private". Defaults to "all". |
case_handling |
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
|
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
A data.frame with one row per URL and the same fields returned by
safe_parse_url. Invalid inputs return NA fields with
parse_status = "error".
safe_parse_urls(c("example.com", "https://www.example.com/path"))safe_parse_urls(c("example.com", "https://www.example.com/path"))