Extract Graphemes
Extract Unicode Grapheme Clusters Instantly
Splitting strings in programming languages often destroys complex characters. A “Family Emoji” (👨👩👧👦) isn’t one character in memory—it is seven. This tool uses the UAX #29 Unicode Text Segmentation algorithm to correctly identify and extract Grapheme Clusters (user-perceived characters), keeping accents, ZWJ sequences, and Emojis intact.
How to Segment Text Correctly
- Input Text: Paste text containing complex scripts (Hindi, Arabic) or multi-part Emojis (e.g., 🤦🏼♂️) into the input box.
- Analyze: Our algorithm parses Combining Marks and Zero Width Joiners to find true boundaries.
- Extract: Get a clean list of individual graphemes, ready for iteration in Python, JavaScript, or Swift.
Why Standard Splitting Fails
In many legacy systems, a string is just a sequence of bytes. If you try to split the flag of Scotland (🏴) using a standard split method, you might get a black flag and several invisible tag characters.
This happens because the flag is composed of a Base Emoji + ZWJ + Tag Characters. A Grapheme Extractor respects these “glue” characters, ensuring the visual symbol remains a single indivisible unit during processing.
Naive Split vs. Grapheme Split
| Input: “ñ” (n + ˜) | Standard .split() | Our Grapheme Tool |
|---|---|---|
| Result | [‘n’, ‘˜’] (Broken) | [‘ñ’] (Correct) |
| Visual | Accent detached from letter | Accent remains attached |
| Length Count | 2 Items | 1 Item |
Frequently Asked Questions
Q. What is a Grapheme Cluster?
A Grapheme Cluster is what a human perceives as a single character. For example, “A” is a grapheme. “A” + “ring accent” (Å) is also a single grapheme, even if it is stored as two separate pieces of data in memory.
Q. Does this handle Skin Tones?
Yes. Emojis with skin tone modifiers (e.g., 👍🏽) consist of a base emoji + a modifier code point. This tool keeps them bonded together as one unit.