Extract Graphemes

Extract Graphemes – iloveunicode.com

Extract Graphemes

Your extracted graphemes will appear here…

Extract Unicode Grapheme Clusters Instantly

Splitting strings in programming languages often destroys complex characters. A “Family Emoji” (👨‍👩‍👧‍👦) isn’t one character in memory—it is seven. This tool uses the UAX #29 Unicode Text Segmentation algorithm to correctly identify and extract Grapheme Clusters (user-perceived characters), keeping accents, ZWJ sequences, and Emojis intact.

Input Source
Unicode String
Output Target
Grapheme Array
Standard
Unicode 15.0+
Privacy
Client-Side

How to Segment Text Correctly

  • 1
    Input Text: Paste text containing complex scripts (Hindi, Arabic) or multi-part Emojis (e.g., 🤦🏼‍♂️) into the input box.
  • 2
    Analyze: Our algorithm parses Combining Marks and Zero Width Joiners to find true boundaries.
  • 3
    Extract: Get a clean list of individual graphemes, ready for iteration in Python, JavaScript, or Swift.
🔧 Troubleshooting Tip: If your programming language reports a string length of 2 for a single character (like ‘𝕏’), you are counting Surrogate Pairs (UTF-16 code units), not Graphemes. This tool reveals the actual character count.

Why Standard Splitting Fails

In many legacy systems, a string is just a sequence of bytes. If you try to split the flag of Scotland (🏴󠁧󠁢󠁳󠁣󠁴󠁿) using a standard split method, you might get a black flag and several invisible tag characters.

This happens because the flag is composed of a Base Emoji + ZWJ + Tag Characters. A Grapheme Extractor respects these “glue” characters, ensuring the visual symbol remains a single indivisible unit during processing.

Naive Split vs. Grapheme Split

Input: “ñ” (n + ˜) Standard .split() Our Grapheme Tool
Result [‘n’, ‘˜’] (Broken) [‘ñ’] (Correct)
Visual Accent detached from letter Accent remains attached
Length Count 2 Items 1 Item

Frequently Asked Questions

Q. What is a Grapheme Cluster?

A Grapheme Cluster is what a human perceives as a single character. For example, “A” is a grapheme. “A” + “ring accent” (Å) is also a single grapheme, even if it is stored as two separate pieces of data in memory.

Q. Does this handle Skin Tones?

Yes. Emojis with skin tone modifiers (e.g., 👍🏽) consist of a base emoji + a modifier code point. This tool keeps them bonded together as one unit.

More Conversion Tools

Leave a Reply

Your email address will not be published. Required fields are marked *