| Title: | Unicode and Punycode Domain Name Processing |
|---|---|
| Description: | High-performance Unicode and Punycode encoding/decoding for internationalized domain names. Provides RFC 3492 compliant conversion functions with a focus on URL processing and data analysis workflows. Addresses limitations in existing R packages for handling international domain names in web scraping and URL parsing applications. |
| Authors: | Bart Turczynski [aut, cre] |
| Maintainer: | Bart Turczynski <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-06-12 09:10:06 UTC |
| Source: | https://github.com/bart-turczynski/punycoder |
Provides high-performance functions for encoding and decoding internationalized domain names according to RFC 3492 (Punycode) and IDNA standards.
The punycoder package fills a critical gap in R's ecosystem for handling international domain names. It provides reliable, fast conversion between Unicode and ASCII representations of domain names.
Maintainer: Bart Turczynski [email protected]
Authors:
Bart Turczynski [email protected]
Useful links:
Report bugs at https://github.com/bart-turczynski/punycoder/issues
Determines whether a domain name contains Unicode characters that would require punycode encoding for ASCII compatibility.
is_idn(x)is_idn(x)
x |
Character vector of domain names to test |
A logical vector the same length as x, where TRUE
indicates the element contains non-ASCII Unicode characters.
is_punycode for detecting punycode domains,
puny_encode for encoding Unicode domains.
is_idn("caf\u00E9.com") # TRUE is_idn("example.com") # FALSE is_idn(c( "caf\u00E9.com", "\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444", "test.com" )) # c(TRUE, TRUE, FALSE)is_idn("caf\u00E9.com") # TRUE is_idn("example.com") # FALSE is_idn(c( "caf\u00E9.com", "\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444", "test.com" )) # c(TRUE, TRUE, FALSE)
Determines whether a given string or domain name is already encoded in punycode format (starts with xn– prefix).
is_punycode(x)is_punycode(x)
x |
Character vector to test |
A logical vector the same length as x, where TRUE
indicates the element contains a punycode-encoded label (xn– prefix).
is_idn for detecting Unicode domains,
puny_decode for decoding punycode domains.
is_punycode("xn--example") # TRUE is_punycode("example.com") # FALSE is_punycode(c("xn--caf-dma.com", "regular.com")) # c(TRUE, FALSE)is_punycode("xn--example") # TRUE is_punycode("example.com") # FALSE is_punycode(c("xn--caf-dma.com", "regular.com")) # c(TRUE, FALSE)
Parses URLs and returns a structured list with proper handling of internationalized domain names. This function provides both Unicode and ASCII representations of domain components.
parse_url(url, encode_domains = FALSE)parse_url(url, encode_domains = FALSE)
url |
Character vector of URLs to parse |
encode_domains |
Logical flag; encode parsed host names to ASCII. |
An object of class "punycoder_parsed_url" (a named list)
with components:
Character vector of URL schemes (e.g., "https").
Character vector of domain names.
Integer vector of port numbers.
Character vector of URL paths.
Character vector of query strings.
Character vector of fragment identifiers.
Each component has one element per input URL. Invalid URLs yield
NA components. For valid URLs without an explicit path,
path is returned as "".
url_encode, url_decode for URL
transformation with IDN handling.
# Parse URL with Unicode domain parse_url( "https://caf\u00E9.example.com:8080/path?query=value#fragment" ) # Parse multiple URLs urls <- c( "https://caf\u00E9.com/menu", "https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/info" ) parse_url(urls)# Parse URL with Unicode domain parse_url( "https://caf\u00E9.example.com:8080/path?query=value#fragment" ) # Parse multiple URLs urls <- c( "https://caf\u00E9.com/menu", "https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/info" ) parse_url(urls)
Print method for punycoder parsed URL results
## S3 method for class 'punycoder_parsed_url' print(x, ...)## S3 method for class 'punycoder_parsed_url' print(x, ...)
x |
A punycoder_parsed_url object |
... |
Additional arguments (ignored) |
Invisibly returns x.
Print method for punycoder validation results
## S3 method for class 'punycoder_validation' print(x, ...)## S3 method for class 'punycoder_validation' print(x, ...)
x |
A punycoder_validation object |
... |
Additional arguments (ignored) |
Invisibly returns x.
Converts ASCII punycode domain names back to their Unicode representation. This is the reverse operation of puny_encode and is useful for displaying human-readable domain names.
puny_decode(x, strict = getOption("punycoder.strict", TRUE))puny_decode(x, strict = getOption("punycoder.strict", TRUE))
x |
Character vector of ASCII punycode domains to decode |
strict |
Logical; whether to apply strict validation. Defaults to 'getOption("punycoder.strict", TRUE)'. |
A character vector the same length as x, with each element
containing the Unicode-decoded domain name. Elements corresponding to
NA inputs are NA_character_. In non-strict mode, domains
that fail decoding are also returned as NA_character_.
puny_encode for the reverse operation,
url_decode for full URL decoding.
# Basic decoding puny_decode("xn--caf-dma.com") puny_decode("xn--80adxhks.xn--p1ai") # Vectorized decoding ascii_domains <- c("xn--caf-dma.com", "xn--80adxhks.xn--p1ai") puny_decode(ascii_domains)# Basic decoding puny_decode("xn--caf-dma.com") puny_decode("xn--80adxhks.xn--p1ai") # Vectorized decoding ascii_domains <- c("xn--caf-dma.com", "xn--80adxhks.xn--p1ai") puny_decode(ascii_domains)
Converts Unicode domain names to their ASCII punycode representation following RFC 3492 standards. This function is essential for processing internationalized domain names (IDNs) in web scraping and URL analysis.
puny_encode(x, strict = getOption("punycoder.strict", TRUE))puny_encode(x, strict = getOption("punycoder.strict", TRUE))
x |
Character vector of Unicode domain names to encode |
strict |
Logical; whether to apply strict validation. Defaults to 'getOption("punycoder.strict", TRUE)'. |
A character vector the same length as x, with each element
containing the ASCII punycode-encoded domain name. Elements corresponding
to NA inputs are NA_character_. In non-strict mode, domains
that fail encoding are also returned as NA_character_.
puny_decode for the reverse operation,
url_encode for full URL encoding.
# Basic encoding puny_encode("caf\u00E9.com") puny_encode("\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444") # Vectorized encoding domains <- c( "caf\u00E9.com", "\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444", "\u5317\u4EAC.\u4E2D\u56FD" ) puny_encode(domains)# Basic encoding puny_encode("caf\u00E9.com") puny_encode("\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444") # Vectorized encoding domains <- c( "caf\u00E9.com", "\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444", "\u5317\u4EAC.\u4E2D\u56FD" ) puny_encode(domains)
Converts URLs containing ASCII punycode domain names back to their Unicode representation for display purposes. This function makes internationalized URLs human-readable.
url_decode(url, strict = getOption("punycoder.strict", TRUE))url_decode(url, strict = getOption("punycoder.strict", TRUE))
url |
Character vector of URLs with ASCII punycode domains |
strict |
Logical; whether to apply strict validation. Defaults to 'getOption("punycoder.strict", TRUE)'. |
A character vector the same length as url, with each element
containing the URL with its host portion decoded to Unicode. Only the
domain component is transformed; scheme, path, query, and fragment are
preserved. Elements corresponding to NA inputs are
NA_character_.
url_encode for the reverse operation,
puny_decode for domain-only decoding,
parse_url for URL component extraction.
# Basic URL decoding url_decode("https://xn--caf-dma.example.com/path") url_decode("https://xn--80adxhks.xn--p1ai/page") # Vectorized URL decoding ascii_urls <- c( "https://xn--caf-dma.com/menu", "https://xn--1qqw23a.xn--55qx5d/info" ) url_decode(ascii_urls)# Basic URL decoding url_decode("https://xn--caf-dma.example.com/path") url_decode("https://xn--80adxhks.xn--p1ai/page") # Vectorized URL decoding ascii_urls <- c( "https://xn--caf-dma.com/menu", "https://xn--1qqw23a.xn--55qx5d/info" ) url_decode(ascii_urls)
Converts URLs containing Unicode domain names to their ASCII representation while preserving the rest of the URL structure. This function is essential for preparing URLs for systems that require ASCII-only domain names.
url_encode(url, strict = getOption("punycoder.strict", TRUE))url_encode(url, strict = getOption("punycoder.strict", TRUE))
url |
Character vector of URLs with potential Unicode domains |
strict |
Logical; whether to apply strict validation. Defaults to 'getOption("punycoder.strict", TRUE)'. |
A character vector the same length as url, with each element
containing the URL with its host portion ASCII-encoded. Only the domain
component is transformed; scheme, path, query, and fragment are preserved.
Elements corresponding to NA inputs are NA_character_.
url_decode for the reverse operation,
puny_encode for domain-only encoding,
parse_url for URL component extraction.
# Basic URL encoding url_encode("https://caf\u00E9.example.com/path?query=value") url_encode( "https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/page" ) # Vectorized URL encoding urls <- c( "https://caf\u00E9.com/menu", "https://\u5317\u4EAC.\u4E2D\u56FD/info" ) url_encode(urls)# Basic URL encoding url_encode("https://caf\u00E9.example.com/path?query=value") url_encode( "https://\u043C\u043E\u0441\u043A\u0432\u0430.\u0440\u0444/page" ) # Vectorized URL encoding urls <- c( "https://caf\u00E9.com/menu", "https://\u5317\u4EAC.\u4E2D\u56FD/info" ) url_encode(urls)
Validates domain names according to RFC standards, checking for proper format, length restrictions, and character requirements. Supports both Unicode and ASCII domain names.
validate_domain(x, strict = getOption("punycoder.strict", TRUE))validate_domain(x, strict = getOption("punycoder.strict", TRUE))
x |
Character vector of domain names to validate |
strict |
Logical; whether to apply strict validation. Defaults to 'getOption("punycoder.strict", TRUE)'. |
An object of class "punycoder_validation" (a named list)
with components:
Character vector of the input domain names.
Logical vector indicating whether each domain is valid.
List of character vectors, each containing error messages for the corresponding domain (empty for valid domains).
puny_encode for encoding validated domains.
validate_domain("example.com") validate_domain("caf\u00E9.example.com") long_label <- paste(rep("x", 250), collapse = "") validate_domain(c("valid.com", "invalid..com", long_label))validate_domain("example.com") validate_domain("caf\u00E9.example.com") long_label <- paste(rep("x", 250), collapse = "") validate_domain(c("valid.com", "invalid..com", long_label))