ūüĒ°

Elixir Unicode to String

ūüí°

I would like this blog post and other Elixir related blog post to be treated as a living organism. It will continue to grow as I gain new insight into the following topic.

ūüí°

When introducing a new idea or definition I will be try to provide explanation from many angles as possible. I will be using the [ADEPT](https://betterexplained.com/articles/adept-method/) as a guidance.

What is unicode?

"The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world."

 Source

In Plain English & Example:

It's an internationally agreed upon way of associating integers to a character counterpart. I.E - In USA the code point 65 will represent the character 'A' and in Japan the same code point 65 will also represent the character 'A'.

What does standard mean within the context of programming?

"Software standards consist of certain terms, concepts, data formats, document styles and techniques agreed upon by software creators so that their software can understand the files and data created by a different computer program. To be considered a standard, a certain protocol needs to be accepted and incorporated by a group of developers who contribute to the definition and maintenance of the standard."

 Source

Does unicode mean translation?

No, it's simply a internationally agreed upon way to map an integer (aka codepoint) to a character. Under unicode standard the code point 65 will map to the character 'A'; therefore, USA, Japan and other countries would have to adhere to that mapping.

What is code point?

The notion of a code point is used for abstraction, to distinguish both: the number from an encoding as a sequence of bits, and the abstract character from a particular graphical representation (glyph).

Plain English & Example:

  • It is an integer value, that maps to a character within the Unicode standards.
  • E.G¬†-¬†2947¬†is the code point for the Tamil character¬†'ŗģÉ'¬†in the Unicode standard. In Unicode, hex is mainly used to represent the code point¬†'ŗģÉ',¬†so the Tamil character will commonly be referred to as¬†/u0B83¬†in elixir¬†"\u{0B83}"

What is hexadecimal digit?

Hexadecimal (also base 16, or hex**) is a positional system that represents numbers using a base of 16

What is encoding & UTF-8?

It's the implementation used to convert code point integer to bytes.

 Source

Elixir uses UTF-8 to encode its strings, which means that code points are encoded as a series of 8-bit bytes.

What is a bit?

A single unit of data that can be either 1 or 0

 Source

What is 8-bit? What is byte?

A byte is 8 bits

 Source

Example

11110000

What is a bitstring

A bitstring is a fundamental data type in Elixir, denoted with the <<>> syntax.

What is a binary?

A binary is a bitstring where the number of bits is divisible by 8. A binary is a set of bytes.

How does <<>> constructor work?

A bitstring is a contiguous sequence of bits in memory.

What happens to bitstring when the size of bytes is less than the integer?

Any value that exceeds what can be stored by the number of bits provisioned is truncated, the left-most bit is ignored and the value becomes truncated

What will the result be for the following <<10::3>>?

  • A:¬†<<0::1, 1::1, 0::1>>

Is every bitstring a binary? Is binary a bitstring?

  • No because you have bitstring of any size (by default it's 1 byte).
  • Yes, because all binary's are divisible by 8 which is a valid size of bitstring

What is a string?

"A string is a UTF-8 encoded binary"

Plain English:

  • A string is a sequence of code points that is stored using UTF-8 protocol.
  • Each unit of storage is 8 bits - binary.

What is the rule used to encode a string to UTF-8?

See below section for answer.

How to encode 'A' to UTF-8?

  1. Determine the character's code point, 'A' = 65
  2. Convert code point from decimal to binary: 65 = 1000001
  3. Then determine if it will require 1,2,3 or 4 bytes to represent the binary. 1 byte
  4. Then determine the encoding format for the above byte. 0 _ _ _ _ _ _ _ _. So our final binary representation with the encoding format is 01000001. We added a leading 0.
  5. Then convert the new binary to hex 01000001 = 41

What defines a valid string? Why aren't all binaries a valid string?

  • UTF-8 encoding is used to define a valid string. If any binary that doesn't adhere to the specific UTF-8 formatting it is invalid.
  • Due to the UTF-8 standard encoding rules, not every binary is a valid string.

Problem Set:

Q: What is the hex for the following characterťôĖ? The code point is¬†38486

Q: In elixir what will the following expression print.

  1. iex(1)> 'A'
    • Prints the character¬†'A'.
  2. iex(1)> ?A
    • Print the code point for character 'A'.
  3. iex(1)> 65
    • Print the integer¬†65.
  4. iex(1)> [65]
    • Print character 'A', since a list of integer will be interpreted as code points.
  5. iex(1)> [[65]]
    • Print a list of character. In this case, a list containing¬†['A'].
  6. iex(1)> 'CAT' == [?C, ?A, ?T]
    • true¬†will be printed. The left is a sequence of character, which is the same the right hand side. Based on our finding from question #4.
  7. 'CAT' == [67, 65, 84]
    • true¬†will be printed for the same reason as above.
  8. 'CAT' == [[67],[65],[84]]
    • false¬†will be printed. The right hand side represents a nested list. Similar to question #5.
  9. "CAT" == [67, 65, 84]
    • false¬†will be printed. The left is a string and the right is a list of characters.
  10. String.to_charlist("CAT") == [67, 65, 84]
    • true¬†will be printed. returns the code points,¬†source
  11. String.codepoints("CAT") == [67, 65, 84]
    1. false will be printed . String.codepoints represents all code points as strings, source

Fun facts:

Q: What is the difference between Mb and MB?

"Mb stands for Megabits which is equal to 1,000,000 bits""MB stands for Megabytes which is equal to 8,000,000 bits"

Resources: