value. In practice, this means that every character is a 64-bit integer, so a
textvalue will use substantially more memory than the equivalent encoded
string` value.text
over string
representations for Unicode are:string
representations for Unicode are:io
library.string
values are "8-bit clean", which means it is an array of 8-bit characters. This is also how binary data from files is usually loaded, as 8-bit 'bytes'. Unicode characters can be up to 32-bits, so there are several standard ways to represent Unicode characters using 8-bit characters. Without going into detail, the most common encodings are called 'UTF-8' and 'UTF-16'. There are two variations of 'UTF-16', depending on the hardware architecture, known as 'big-endian' and 'little-endian'.string
, such as match
, gsub
and even len
will not work as expected when a string contains Unicode text. As such, this library fills some of the gaps for common operations when working with Unicode text.string
and text
values like so:text
values are not in any specific encoding, since they are stored as 64-bit integer code-points
rather than 8-bit characers.cp.text.encoding
cp.text.is(value) -> boolean
text
instance.value
- The value to checktrue
if the value is a text
instance.cp.text.char(...) -> text
...
- The list of codepoint integers.cp.text
value for the list of codepoint values.cp.text.fromCodepoints(codepoints[, i[, j]]) -> text
text
instance representing the specified array of codepoints. Since i
and j
default to the firstcodepoints
- The array of codepoint integers.i
- The starting index to read from codepoints. Defaults to 1
.j
- The ending index to read from codepoints. Default to -1
.text
instance.i
and j
. If so, it will count back from then end of the codepoints
array.cp.text.fromFile(path[, encoding]) -> text
text
instance representing the text loaded from the specified path. If no encoding is specified,value
- The value to turn into a unicode text instance.encoding
- One of the falues from text.encoding
: utf8
, utf16le
, or utf16be
. Defaults to utf8
.text
instance.cp.text.fromString(value[, encoding]) -> text
text
instance representing the string value of the specified value. If no encoding is specified,value
- The value to turn into a unicode text instance.encoding
- One of the falues from text.encoding
: utf8
, utf16le
, or utf16be
. Defaults to utf8
.text
instance.text(value)
is the same as calling text.fromString(value, text.encoding.utf8)
, so simple text can be initialized via local x = text "foo"
when the .lua
file's encoding is UTF-8.cp.text:encode([encoding]) -> string
string
value.encoding
- The encoding to use when converting. Defaults to cp.text.encoding.utf8
.cp.text:find(pattern [, init [, plain]])
value
. If it finds a match, then find returns the indices of value
where this occurrence starts and ends; otherwise, it returns nil
. A third, optional numerical argument init
specifies where to start the search; its default value is 1
and can be negative. A value of true
as a fourth, optional argument plain turns off the pattern matching facilities, so the function does a plain "find substring" operation, with no characters in pattern being considered "magic". Note that if plain is given, then init
must be given as well.cp.text:len() -> number
cp.text:match(pattern[, start]) -> ...
pattern
in the text value. If it finds one, then match returns the captures from the pattern; otherwise it returns nil
. If pattern specifies no captures, then the whole match is returned. A third, optional numerical argument init
specifies where to start the search; its default value is 1
and can be negative.pattern
- The text pattern to process.start
- If specified, indicates the starting position to process from. Defaults to 1
.nil
.cp.text:sub(i [, j]) -> cp.text
i
and continues until j
; i
and j
can be negative.