How Many Bytes Are in an Emoji?

2024-09-30

What Is a Byte?

A byte is made up of 8 bits. And in the old days, it represented a character. If you use a single-byte character set such as WE8MSWIN1252, WE8ISO8859P15 or similar, it still is.

What Is a Character?

We can find definitions for example on Wikipedia and in the documentation for the Oracle Database. The current Unicode standard (version 16) defines 154,998 characters. The CodeChart.pdf is part of the standard and describes all those characters on 3113 pages.

Why Does It Matter?

A character is a part of a string and we store strings in the database. The SQL standard defines a <character string type> with an optional <character maximum length> in octets (bytes) or characters. In other words, the size of a character string is defined by the number of characters it contains. In the Oracle Database, we also have a hard limit of 4000 or 32767 bytes for data types like char, varchar or varchar2 depending on the max_string_size parameter. So we should know exactly what the size of a string data type means.

Querying UTF-8 Characters

The following query provides details about a bunch of UTF-8 characters. I ran the query in my Oracle Database 23.5 with an AL32UTF8 character set.

with
   data (name, value) as (values
      ('Latin capital letter A',                       'A'),
      ('Dollar sign',                                  '$'),
      ('Copyright sign',                               '©'),
      ('Pound sign',                                   '£'),
      ('Euro sign',                                    '€'),
      ('Double exlamation mark',                       '‼'),
      ('Trademark sign',                               '™'),
      ('Grinning face',                                '😀'),
      ('Double exclamation mark (emoji)',              '‼️'),
      ('Trademark sign (emoji)',                       '™️'),
      ('Information',                                  'ℹ️'),
      ('No entry',                                     '⛔'),
      ('Woman',                                        '👩'),
      ('Woman with white hair, medium-dark skin tone', '👩🏾‍🦳'),
      ('Man',                                          '👨'),
      ('Girl',                                         '👧'),
      ('Boy',                                          '👦'),
      ('Family',                                       '👨‍👩‍👧‍👦'),
      ('Kiss: woman, man, medium-light skin tone',     '👩🏼‍❤️‍💋‍👨🏼')
   )
select name,
       value,
       length(value)  as len_in_chars,
       lengthb(value) as len_in_bytes,
       substr(dump(value, 16), instr(dump(value, 16), ':') + 2) as bytes_as_hex_list
  from data;

NAME                                         VALUE LEN_IN_CHARS LEN_IN_BYTES BYTES_AS_HEX_LIST                                                         
-------------------------------------------- ----- ------------ ------------ --------------------------------------------------------------------------------------------------------
Latin capital letter A                       A                1            1 41                                                                        
Dollar sign                                  $                1            1 24                                                                        
Copyright sign                               ©                1            2 c2,a9                                                                     
Pound sign                                   £                1            2 c2,a3                                                                     
Euro sign                                    €                1            3 e2,82,ac                                                                  
Double exlamation mark                       ‼                1            3 e2,80,bc                                                                  
Trademark sign                               ™                1            3 e2,84,a2                                                                  
Grinning face                                😀               1            4 f0,9f,98,80                                                               
Double exclamation mark (emoji)              ‼️               2            6 e2,80,bc,ef,b8,8f                                                         
Trademark sign (emoji)                       ™️               2            6 e2,84,a2,ef,b8,8f                                                         
Information                                  ℹ️               2            6 e2,84,b9,ef,b8,8f                                                         
No entry                                     ⛔               1            3 e2,9b,94                                                                  
Woman                                        👩               1            4 f0,9f,91,a9                                                               
Woman with white hair, medium-dark skin tone 👩🏾‍🦳               4           15 f0,9f,91,a9,f0,9f,8f,be,e2,80,8d,f0,9f,a6,b3
Man                                          👨               1            4 f0,9f,91,a8                                                               
Girl                                         👧               1            4 f0,9f,91,a7                                                               
Boy                                          👦               1            4 f0,9f,91,a6                                                               
Family                                       👨‍👩‍👧‍👦               7           25 f0,9f,91,a8,e2,80,8d,f0,9f,91,a9,e2,80,8d,f0,9f,91,a7,e2,80,8d,f0,9f,91,a6
Kiss: woman, man, medium-light skin tone     👩🏼‍❤️‍💋‍👨🏼              10           35 f0,9f,91,a9,f0,9f,8f,bc,e2,80,8d,e2,9d,a4,ef,b8,8f,e2,80,8d,f0,9f,92,8b,e2,80,8d,f0,9f,91,a8,f0,9f,8f,bc

19 rows selected.

Value vs. Character

The len_in_chars column should make it clear that the thing represented in the value column is not a character. It’s a grapheme – the smallest functional unit of a writing system. A grapheme can be built by more than one character.

Monospaced Font

I’m using a monospaced font for code in this blog. However, some graphemes are wider than others. The fixed-width font no longer works as expected. This is why the result is not nicely formatted. Using spaces to format a result grid does not work anymore. Most emojis use more than two spaces but less than three spaces.

Same Looking Graphemes

The ‼ (Double exclamation mark) can be represented with 3 bytes or 6 bytes. The additional three bytes are efb88f. It’s a variation selector U+FE0F. It marks a “normal” character as emoji. As a result, we expect the emoji to look different. However, the representation depends on the font and in this case on the browser. Just because two graphemes look the same does not mean they are identical.

Codepoint to Bytes

A UTF-8 character is defined as a code point. Wikipedia describes quite well how a code point is converted into a byte sequence. I’ve created a PL/SQL package to convert a code point to bytes and vice versa. It’s available as Gist on GitHub.

Here’s an example of how to use it:

select utf8.codepoint_to_bytes('U+FE0F') as to_bytes,
       utf8.bytes_to_codepoint('EFB88F') as to_codepoint;

TO_BYTES TO_CODEPOINT
-------- ------------
EFB88F   U+FE0F

Skin tones and Joined Emojies

For various emojis, you can define a skin tone. This increases the size of a grapheme. The 👩🏾‍🦳 (Woman with white hair, medium-dark skin tone) consists of the following 4 characters:

👩 (Woman): U+1F469, 4 bytes
🏾 (Medium-Dark Skin Tone Modifier): U+1F3FE, 4 bytes
Zero Width Joiner: U+200D, 3 bytes
🦳 (White Hair): U+1F9B3, 4 bytes

Emoji Breakdown via SQL

The following query shows the characters of the largest three emojis in my example set.

with
   data (seq, emoji, bytes) as (values
      (1, '👩🏾‍🦳', json[4, 4, 3, 4]),
      (2, '👨‍👩‍👧‍👦', json[4, 3, 4, 3, 4, 3, 4]),
      (3, '👩🏼‍❤️‍💋‍👨🏼', json[4, 4, 3, 3, 3, 3, 4, 3, 4, 4])
   )
select d.emoji,
       j.seq as part,
       substr(d.emoji, j.seq, 1) as e_char,
       sum(j.bytes) over (partition by d.seq order by j.seq) as e_char_len,
       substr(dump(substr(d.emoji, j.seq, 1), 16), 14) as e_char_bytes
  from data d,
       json_table(
          d.bytes, '$[*]'
          columns (
             seq for ordinality,
             bytes number path '$'
          )
       ) j
 order by d.seq, j.seq;

EMOJI       PART E_CHAR E_CHAR_LEN E_CHAR_BYTES
----- ---------- ------ ---------- ------------
👩🏾‍🦳             1 👩              4 f0,9f,91,a9 
👩🏾‍🦳             2 🏾               8 f0,9f,8f,be 
👩🏾‍🦳             3 ‍               11 e2,80,8d    
👩🏾‍🦳             4 🦳             15 f0,9f,a6,b3 
👨‍👩‍👧‍👦             1 👨              4 f0,9f,91,a8 
👨‍👩‍👧‍👦             2 ‍                7 e2,80,8d    
👨‍👩‍👧‍👦             3 👩             11 f0,9f,91,a9 
👨‍👩‍👧‍👦             4 ‍               14 e2,80,8d    
👨‍👩‍👧‍👦             5 👧             18 f0,9f,91,a7 
👨‍👩‍👧‍👦             6 ‍               21 e2,80,8d    
👨‍👩‍👧‍👦             7 👦             25 f0,9f,91,a6 
👩🏼‍❤️‍💋‍👨🏼             1 👩              4 f0,9f,91,a9 
👩🏼‍❤️‍💋‍👨🏼             2 🏼               8 f0,9f,8f,bc 
👩🏼‍❤️‍💋‍👨🏼             3 ‍               11 e2,80,8d    
👩🏼‍❤️‍💋‍👨🏼             4 ❤              14 e2,9d,a4    
👩🏼‍❤️‍💋‍👨🏼             5 ️               17 ef,b8,8f    
👩🏼‍❤️‍💋‍👨🏼             6 ‍               20 e2,80,8d    
👩🏼‍❤️‍💋‍👨🏼             7 💋             24 f0,9f,92,8b 
👩🏼‍❤️‍💋‍👨🏼             8 ‍               27 e2,80,8d    
👩🏼‍❤️‍💋‍👨🏼             9 👨             31 f0,9f,91,a8 
👩🏼‍❤️‍💋‍👨🏼            10 🏼              35 f0,9f,8f,bc 

21 rows selected.

Depending on your browser the result may or may not show the skin tone modifier emoji.

Conclusion

Theoretically, a grapheme can be built by an unlimited number of characters. As far as I know, 10 characters and 35 bytes for the kiss emoji 👩🏼‍❤️‍💋‍👨🏼 with two persons and a skin tone modifier are currently the maximum. This makes sizing a column with emojis a bit challenging.

Going for the maximum byte size is still a bad idea IMO. We would lose important information about our data. Even if the number of characters is not a perfect fit for a string containing emojis, it still gives the consumers an idea of how large a string can be. This helps when writing reports and similar.

I reckon columns with emojis are still the exception in an Oracle Database. However, it is good to know that an emoji can take up to 35 times more bytes than a regular Latin character.