text

This module provides support for loading, manipulating, and comparing unicode text data. It works by storing characters with their Unicode 'codepointvalue. In practice, this means that every character is a 64-bit integer, so atextvalue will use substantially more memory than the equivalent encodedstring` value.

The advantages of text over string representations for Unicode are:

  • comparisons, equality checks, etc. actually work for Unicode text and are not encoding-dependent.

  • direct access to codepoint values.

The advantages of string representations for Unicode are:

  • compactness.

  • reading/writing to files via the standard io library.

Strings and Unicode

LUA has limited built-in support for Unicode text. string values are "8-bit clean", which means it is an array of 8-bit characters. This is also how binary data from files is usually loaded, as 8-bit 'bytes'. Unicode characters can be up to 32-bits, so there are several standard ways to represent Unicode characters using 8-bit characters. Without going into detail, the most common encodings are called 'UTF-8' and 'UTF-16'. There are two variations of 'UTF-16', depending on the hardware architecture, known as 'big-endian' and 'little-endian'.

The built-in functions for string, such as match, gsub and even len will not work as expected when a string contains Unicode text. As such, this library fills some of the gaps for common operations when working with Unicode text.

Examples

You can convert to and from string and text values like so:

local text = require("cp.text")
​
local simpleString = "foobar"
local simpleText = text(stringValue)
local utf8String = "a丽𐐷" -- contains non-ascii characters, defaults to UTF-8.
local unicodeText = text "a丽𐐷" -- contains non-ascii characters, converts from a UTF-8 string.
local utf8String = tostring(unicodeText) -- `tostring` will default to UTF-8 encoding
local utf16leString = unicodeText:encode(text.encoding.utf16le) -- or you can be more specific

Note that text values are not in any specific encoding, since they are stored as 64-bit integer code-points rather than 8-bit characers.

Submodules

API Overview

  • Constants - Useful values which cannot be changed

  • Functions - API calls offered directly by the extension

    • ​is​

  • Constructors - API calls which return an object, typically one that offers API methods

  • Methods - API calls which can only be made on an object returned by a constructor

API Documentation

Constants

​encoding​

Signature

cp.text.encoding

Type

Constant

Description

The list of supported encoding formats:

Functions

​is​

Signature

cp.text.is(value) -> boolean

Type

Function

Description

Checks if the provided value is a text instance.

Parameters

​

Returns

​

Constructors

​char​

Signature

cp.text.char(...) -> text

Type

Constructor

Description

Returns the list of one or more codepoint items into a text value, concatenating the results.

Parameters

​

Returns

​

​fromCodepoints​

Signature

cp.text.fromCodepoints(codepoints[, i[, j]]) -> text

Type

Constructor

Description

Returns a new text instance representing the specified array of codepoints. Since i and j default to the first

Parameters

​

Returns

​

Notes

​

​fromFile​

Signature

cp.text.fromFile(path[, encoding]) -> text

Type

Constructor

Description

Returns a new text instance representing the text loaded from the specified path. If no encoding is specified,

Parameters

​

Returns

​

​fromString​

Signature

cp.text.fromString(value[, encoding]) -> text

Type

Constructor

Description

Returns a new text instance representing the string value of the specified value. If no encoding is specified,

Parameters

​

Returns

​

Notes

​

Methods

​encode​

Signature

cp.text:encode([encoding]) -> string

Type

Method

Description

Returns the text as an encoded string value.

Parameters

​

​find​

Signature

cp.text:find(pattern [, init [, plain]])

Type

Method

Description

Looks for the first match of pattern in the string value. If it finds a match, then find returns the indices of value where this occurrence starts and ends; otherwise, it returns nil. A third, optional numerical argument init specifies where to start the search; its default value is 1 and can be negative. A value of true as a fourth, optional argument plain turns off the pattern matching facilities, so the function does a plain "find substring" operation, with no characters in pattern being considered "magic". Note that if plain is given, then init must be given as well.

Returns

​

​len​

Signature

cp.text:len() -> number

Type

Method

Description

Returns the number of codepoints in the text.

Parameters

​

Returns

​

​match​

Signature

cp.text:match(pattern[, start]) -> ...

Type

Method

Description

Looks for the first match of the pattern in the text value. If it finds one, then match returns the captures from the pattern; otherwise it returns nil. If pattern specifies no captures, then the whole match is returned. A third, optional numerical argument init specifies where to start the search; its default value is 1 and can be negative.

Parameters

​

Returns

​

​sub​

Signature

cp.text:sub(i [, j]) -> cp.text

Type

Method

Description

Returns the substring of this text that starts at i and continues until j; i and j can be negative.