uchardet library is the encoding detector, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.

uchardet package solves 3 types of tasks:

  • strings encoding detection (R character vector);
  • bytes encoding detection (R raw vector);
  • detection of the file encoding without importing it into working environment;

The uchardet package includes demo files. You could get their paths with this command:

dir(system.file("examples", package = "uchardet"), recursive = TRUE, full.names = TRUE)

Character encoding detection

To detect encoding of the strings you should use detect_str_enc() function. It is vectorized and accepts the character vector. Missing values will be skipped.

Simple example. Detection of the ASII symbols:

detect_str_enc("Hello, useR!")
#> [1] "ASCII"

All strings in R could be only in three encodings - UTF-8, Latin1 and native. It means that we could not read file with WINDOWS-1252 encoding and got/print it with the same encoding, it will be converted into one of the basic encodings (usually ASCII or UTF-8).

Due to this limitations if we want to test detect_str_enc() we should convert string with UTF-8 encoding into another encoding and then use the detect_str_enc().

Let’s define function for reading demo file to UTF-8 string:

read_char <- function(path, enc) {
    # get file path
  file <- system.file("examples", path, package = "uchardet")
  # create the file connection with the encoding
  con <- file(file, encoding = enc)
  # close connection on exit
  on.exit(close(con))
  # read file content
  paste(readLines(con, warn = FALSE), collapse = "\n")
}

Detction of the UTF-8 character string.

# read file into the working env
zh_utf8 <- read_char("zh/big5.txt", "BIG-5")
# print content
print(zh_utf8)
#> [1] "繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文"
# check the encoding of the created object
Encoding(zh_utf8)
#> [1] "UTF-8"
# detection result
detect_str_enc(zh_utf8)
#> [1] "UTF-8"

Detection of the unusual encodings:

# convert zh_utf8 from UTF-8 into unusual encodings
zh_big5 <- iconv(zh_utf8, "UTF-8", "BIG-5")
print(zh_big5)
#> [1] "\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5"

zh_gb <- iconv(zh_utf8, "UTF-8", "GB18030")
print(zh_gb)
#> [1] "\xb7\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xce\xc4"

# detect encoding
detect_str_enc(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8"   "BIG5"    "GB18030"

Basic Encoding() function returns unknown encoding:

Encoding(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8"   "unknown" "unknown"

Raw bytes encoding detection

Sometimes file can’t be read as a string, for example, when it includes embedded nul (\000). In such cases it would be right to read the file as raw byte vector and detect encoding with detect_raw_enc() function.

Detection of the ASII symbols from the byte vector:

detect_raw_enc(charToRaw("Hello, useR!"))
#> [1] "ASCII"

Let’s define the function for reading demo files as raw bytes vector:

read_raw <- function(path) {
  # get file path
  file <- system.file("examples", path, package = "uchardet")
  # read file to raw vector
  readBin(file, raw(), file.size(file))
}

# print first 5 bytes
read_raw("de/iso-8859-1.txt")[1:5]
#> [1] 49 53 4f 20 38

Unusual encodings (each file has it’s own encoding) detection:

detect_raw_enc(read_raw("de/iso-8859-1.txt"))
#> [1] "ISO-8859-1"
detect_raw_enc(read_raw("de/windows-1252.txt"))
#> [1] "WINDOWS-1252"
detect_raw_enc(read_raw("fr/utf-16.be"))
#> [1] "UTF-16"
detect_raw_enc(read_raw("zh/big5.txt"))
#> [1] "BIG5"

Files encoding detection

Function detect_file_enc() will be helpful for detection files encoding without importing these files into the working environment. detect_file_enc() uses the sliding window with the 65536 bytes width, in result there is no need to import the entire file.

Function is vectorized and accepts the character vector of file paths. Non existing files will be skipped.

# paths to examples files
ex_path <- system.file("examples", package = "uchardet")
ex_files <- Sys.glob(file.path(ex_path, "*", "*"))
# detect encoding
res <- detect_file_enc(ex_files)

Let’s compare results with the original files encodings:

# regex pattern
pattern <- ".*/examples/((.*)/(.*)\\.(?:.*))$"
proto <- list(file = character(1L), lang = character(1L), original = character(1L))
cmp <- strcapture(pattern, ex_files, proto)
cmp$lang <- toupper(cmp$lang)
cmp$original <- toupper(cmp$original)
cmp$uchardet <- res
head(cmp, n = 15)
#>                        file lang          original          uchardet
#> 1         ar/iso-8859-6.txt   AR        ISO-8859-6        ISO-8859-6
#> 2              ar/utf-8.txt   AR             UTF-8             UTF-8
#> 3       ar/windows-1256.txt   AR      WINDOWS-1256      WINDOWS-1256
#> 4       bg/windows-1251.txt   BG      WINDOWS-1251      WINDOWS-1251
#> 5             cs/ibm852.txt   CS            IBM852            IBM852
#> 6         cs/iso-8859-2.txt   CS        ISO-8859-2        ISO-8859-2
#> 7  cs/mac-centraleurope.txt   CS MAC-CENTRALEUROPE MAC-CENTRALEUROPE
#> 8              cs/utf-8.txt   CS             UTF-8             UTF-8
#> 9       cs/windows-1250.txt   CS      WINDOWS-1250      WINDOWS-1250
#> 10       da/iso-8859-15.txt   DA       ISO-8859-15       ISO-8859-15
#> 11        da/iso-8859-1.txt   DA        ISO-8859-1       ISO-8859-15
#> 12             da/utf-8.txt   DA             UTF-8             UTF-8
#> 13      da/windows-1252.txt   DA      WINDOWS-1252      WINDOWS-1252
#> 14        de/iso-8859-1.txt   DE        ISO-8859-1        ISO-8859-1
#> 15      de/windows-1252.txt   DE      WINDOWS-1252      WINDOWS-1252