---
title: "Getting Started with rurl"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with rurl}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(rurl)
```

# Introduction

The `rurl` package provides tools to parse, normalize, and extract information
from URLs using a consistent and safe API.  
It is fully vectorized and delegates domain handling to the `pslr` package,
which implements the [Public Suffix List](https://publicsuffix.org) for
accurate domain and TLD extraction.

# Safe URL Parsing

Use `safe_parse_url()` to parse URLs robustly:

```{r}
safe_parse_url("https://sub.example.co.uk/path?q=1")
```

The `protocol_handling` argument controls how schemes are handled:

- `"keep"` (default; keeps the current protocol or prepends `http://` if missing)
- `"none"` (doesn't add, remove, or change protocols)
- `"strip"` (removes protocols)
- `"http"` (changes protocols to `http://` or adds it if missing)
- `"https"` (changes protocols to `https://` or adds it if missing)

# Extracting URL Components

```{r}
get_scheme("https://sub.example.com")
get_host("https://sub.example.com")
get_path("https://sub.example.com/path/to/page")
```

Each function works on vectors of URLs and gracefully handles `NA`.

# Domain and TLD Parsing

These functions rely on the Public Suffix List:

```{r}
get_domain("https://a.b.example.co.uk")
```

Extracting TLDs from different sources:

```{r}
get_tld("https://foo.blogspot.com")
```

Sources include:
- `"all"` (default; will match to the longest available TLD)
- `"private"` (only extract private TLDs)
- `"icann"` (only extract ICANN TLDs)

# Vectorization and Edge Cases

All core functions support vectors and handle malformed inputs safely:

```{r}
urls <- c("example.com", "http://example.com", NA)
get_clean_url(urls)
```

# Advanced Host Manipulation with `subdomain_levels_to_keep`

Several functions, including `safe_parse_url()`, `get_host()`,
and `get_clean_url()`,
support the `subdomain_levels_to_keep` argument. This allows for fine-grained
control over how many subdomain levels are preserved
in the host component of a URL,
_after_ initial `www_handling` has been applied.

- `NULL` (Default):
  No specific subdomain stripping is performed beyond `www_handling`.
- `0`: All subdomains are stripped. If `www_handling` preserved or added 'www.',
  it remains
  (e.g., 'www.sub.example.com' becomes 'www.example.com';
  'sub.example.com' becomes 'example.com').
- `N > 0`: Keeps up to N levels of subdomains,
  counted from right-to-left (closest to the registered domain),
  in addition to any 'www.' prefix.

Here are some examples demonstrating its effect on `get_host()`:

```{r}
get_host(
  "http://www.three.two.one.example.com",
  subdomain_levels_to_keep = 0
) # www_handling default is "none"
# Expected: "www.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 0
)
# Expected: "example.com"

get_host("http://www.three.two.one.example.com", subdomain_levels_to_keep = 1)
# Expected: "www.one.example.com"

get_host(
  "http://three.two.one.example.com",
  www_handling = "strip",
  subdomain_levels_to_keep = 1
)
# Expected: "one.example.com"

get_host(
  "http://www.three.two.one.example.com",
  www_handling = "keep",
  subdomain_levels_to_keep = 2
)
# Expected: "www.two.one.example.com"
```

And its effect on `get_clean_url()`:

```{r}
get_clean_url(
  "http://www.deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 0,
  www_handling = "keep"
)
# yields http://www.example.com/some/path

get_clean_url(
  "http://deep.sub.example.com/some/path",
  subdomain_levels_to_keep = 1
)
# yields http://sub.example.com/some/path
```

Note that `get_domain()` also accepts `subdomain_levels_to_keep`,
but it does not
change the *returned domain value*. The domain is derived from the host *before*
this specific host modification occurs.
The parameter influences the host component
that might be used in other parts of the `safe_parse_url` output,
such as the `clean_url`.

# Summary

- Vectorized functions for parsing and cleaning URLs
- Uses the Public Suffix List for domain logic
- Unicode/punycode support

# See also

`rurl` is built on two sibling packages that are also available standalone:

- **[pslr](https://bart-turczynski.github.io/pslr/)** — Public Suffix List engine. Use it directly for eTLD and registrable-domain queries when you do not need full URL parsing.
- **[punycoder](https://github.com/bart-turczynski/punycoder)** — Punycode and IDNA codec for internationalized domain names. Useful for host normalization and Unicode ↔ ACE encoding outside the URL context.