--- title: "Getting Started with rurl" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with rurl} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(rurl) ``` # Introduction The `rurl` package provides tools to parse, normalize, and extract information from URLs using a consistent and safe API. It is fully vectorized and delegates domain handling to the `pslr` package, which implements the [Public Suffix List](https://publicsuffix.org) for accurate domain and TLD extraction. # Safe URL Parsing Use `safe_parse_url()` to parse URLs robustly: ```{r} safe_parse_url("https://sub.example.co.uk/path?q=1") ``` The `protocol_handling` argument controls how schemes are handled: - `"keep"` (default; keeps the current protocol or prepends `http://` if missing) - `"none"` (doesn't add, remove, or change protocols) - `"strip"` (removes protocols) - `"http"` (changes protocols to `http://` or adds it if missing) - `"https"` (changes protocols to `https://` or adds it if missing) # Extracting URL Components ```{r} get_scheme("https://sub.example.com") get_host("https://sub.example.com") get_path("https://sub.example.com/path/to/page") ``` Each function works on vectors of URLs and gracefully handles `NA`. # Domain and TLD Parsing These functions rely on the Public Suffix List: ```{r} get_domain("https://a.b.example.co.uk") ``` Extracting TLDs from different sources: ```{r} get_tld("https://foo.blogspot.com") ``` Sources include: - `"all"` (default; will match to the longest available TLD) - `"private"` (only extract private TLDs) - `"icann"` (only extract ICANN TLDs) # Vectorization and Edge Cases All core functions support vectors and handle malformed inputs safely: ```{r} urls <- c("example.com", "http://example.com", NA) get_clean_url(urls) ``` # Advanced Host Manipulation with `subdomain_levels_to_keep` Several functions, including `safe_parse_url()`, `get_host()`, and `get_clean_url()`, support the `subdomain_levels_to_keep` argument. This allows for fine-grained control over how many subdomain levels are preserved in the host component of a URL, _after_ initial `www_handling` has been applied. - `NULL` (Default): No specific subdomain stripping is performed beyond `www_handling`. - `0`: All subdomains are stripped. If `www_handling` preserved or added 'www.', it remains (e.g., 'www.sub.example.com' becomes 'www.example.com'; 'sub.example.com' becomes 'example.com'). - `N > 0`: Keeps up to N levels of subdomains, counted from right-to-left (closest to the registered domain), in addition to any 'www.' prefix. Here are some examples demonstrating its effect on `get_host()`: ```{r} get_host( "http://www.three.two.one.example.com", subdomain_levels_to_keep = 0 ) # www_handling default is "none" # Expected: "www.example.com" get_host( "http://three.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 0 ) # Expected: "example.com" get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1) # Expected: "www.one.example.com" get_host( "http://three.two.one.example.com", www_handling = "strip", subdomain_levels_to_keep = 1 ) # Expected: "one.example.com" get_host( "http://www.three.two.one.example.com", www_handling = "keep", subdomain_levels_to_keep = 2 ) # Expected: "www.two.one.example.com" ``` And its effect on `get_clean_url()`: ```{r} get_clean_url( "http://www.deep.sub.example.com/some/path", subdomain_levels_to_keep = 0, www_handling = "keep" ) # yields http://www.example.com/some/path get_clean_url( "http://deep.sub.example.com/some/path", subdomain_levels_to_keep = 1 ) # yields http://sub.example.com/some/path ``` Note that `get_domain()` also accepts `subdomain_levels_to_keep`, but it does not change the *returned domain value*. The domain is derived from the host *before* this specific host modification occurs. The parameter influences the host component that might be used in other parts of the `safe_parse_url` output, such as the `clean_url`. # Summary - Vectorized functions for parsing and cleaning URLs - Uses the Public Suffix List for domain logic - Unicode/punycode support # See also `rurl` is built on two sibling packages that are also available standalone: - **[pslr](https://bart-turczynski.github.io/pslr/)** — Public Suffix List engine. Use it directly for eTLD and registrable-domain queries when you do not need full URL parsing. - **[punycoder](https://github.com/bart-turczynski/punycoder)** — Punycode and IDNA codec for internationalized domain names. Useful for host normalization and Unicode ↔ ACE encoding outside the URL context.