Welcome to the world of glycan text parsing! If you’ve ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format.
That’s where glyparse comes to the rescue! 🚀
Think of glyparse as your universal glycan
translator — it can read glycan structures written in many
different “languages” and convert them all into a unified format that
your computer can understand and work with.
Note: All functions in glyparse return
glyrepr::glycan_structure objects. If you are unfamiliar
with glyrepr, you can read the documentation here.
Before we dive in, let’s see what we’re dealing with. Here’s the same N-glycan core structure written in different formats:
| Format | Example | Where You’ll See It |
|---|---|---|
| IUPAC-condensed | Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc |
Literature, UniCarbKB |
| IUPAC-short | Mana3(Mana6)Manb4GlcNAcb4GlcNAc |
Literature, UniCarbKB |
| IUPAC-extended | alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc |
Literature, UniCarbKB |
| GlycoCT | Complex multi-line format | Literature, GlycomeDB |
| WURCS | WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1... |
Literature, GlyTouCan |
| Linear Code | Ma3(Ma6)Mb4GNb4GNb |
Literature |
| pGlyco | (N(N(H(H(H))))) |
pGlyco software results |
| StrucGP | A2B2C1D1E2fedcba |
StrucGP software results |
Confusing, right? 😵💫 glyparse understands them all!
glyparse provides seven specialized parsers, each
optimized for a specific format:
parse_iupac_condensed(): The most
common formatparse_iupac_short(): Compact
literature formatparse_iupac_extended(): Verbose formal
formatparse_glycoct(): Database standard
formatparse_wurcs(): Modern standardized
formatparse_linear_code(): Linear Code
formatparse_pglyco_struc(): pGlyco software
formatparse_strucgp_struc(): StrucGP
software formatAll parsers follow the same pattern:
glyrepr::glycan_structure
object that you can analyzeauto_parse()Don’t know what you’re dealing with? Give it to
auto_parse()! This function tries to identify the format
automatically and use the appropriate parser. Even input with mixed
formats is supported.
x <- c(
"Gal(b1-3)GalNAc(b1-",
"(N(F)(N(H(H(N))(H(N(H))))))",
"WURCS=2.0/3,3,2/[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/1-2-3/a4-b1_b3-c1"
)
auto_parse(x)
#> <glycan_structure[3]>
#> [1] Gal(b1-3)GalNAc(b1-
#> [2] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [3] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 3Let’s start with the IUPAC formats.
This format is widely used in scientific literature and databases like UniCarbKB.
Want to know more about IUPAC-condensed format? Check this out!
# Single structure
iupac_condensed <- "Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-4)Gal(b1-4)Glc(a1-"
parse_iupac_condensed(iupac_condensed)
#> <glycan_structure[1]>
#> [1] Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-4)Gal(b1-4)Glc(a1-
#> # Unique structures: 1# Multiple structures at once
glycans <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(b1-", # O-glycan core 1
"Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-" # O-glycan core 2
)
parse_iupac_condensed(glycans)
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(b1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-
#> # Unique structures: 3This compact format is popular in research papers because it saves space:
# The same structures in short format
iupac_short <- c(
"Mana3(Mana6)Manb4GlcNAcb4GlcNAcb-",
"Galb3GalNAcb-",
"Neu5Aca3Galb3(GlcNAcb6)GalNAcb-"
)
parse_iupac_short(iupac_short)
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(b1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(b1-
#> # Unique structures: 3Notice how much more compact this is! The parser is smart enough to infer common linkage positions (like Neu5Ac always being a2-linked).
This verbose format includes full chemical names and stereochemistry:
GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It’s more complex but extremely precise:
WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan:
If you work with glycoproteomics, you might encounter pGlyco’s parenthetical notation:
pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))"
parse_pglyco_struc(pglyco)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1This cryptic notation actually represents a complex N-glycan:
StrucGP uses a letter-based encoding system:
glyparse transforms the chaos of glycan text formats
into order. No matter where your glycan data comes from, databases,
literature, or software tools, you can now parse it into
glyrepr::glycan_structure() for further analysis. In fact,
glyread package uses these parsing functions internally
when reading output from common glycopeptide identification
softwares.
Next steps:
glyrepr package for structure
manipulationglymotif for motif analysis of your parsed
structuresglyexp for experimental data analysisglycoverse ecosystem!Happy parsing! 🧬✨