Skip to content

BigBang-Proton is a LLM pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton achieves language-guided scientific computing through Theory-Experiment learning, Binary Patch Encoding and Monte Carlo Attention.

License

Notifications You must be signed in to change notification settings

supersymmetry-technologies/BigBang-Proton

Repository files navigation

超对称

📚 Resources

BigBang-Proton: The First Unified Architecture for Scientific Multi-Task Learning

Next-Word Prediction Is Scientific Multi-Task Learner

performance_overall_0916

BigBang-Proton is the first generalist architecture designed from the ground up to unify language, equations, DNA, sensor signals, time series, images, and experimental numerical data into a single auto-regressive, next-word-prediction framework, enabling true scientific multi-task learning across physics, chemistry, biology, materials science, and Earth systems.

Built upon the foundation of BigBang-Neutron, Proton introduces three radical innovations that break the mold of traditional LLMs:


Core Innovations

1. Binary Patch Encoding (No Tokenizer, No BPE)

Get rid of tokenization. BigBang-Proton inherits BigBang-Neutron's Binray Patch Encoding which encodes everything — text, numbers, formulas, DNA, sensor streams — as raw binary sequences, grouped into patches. This eliminates the catastrophic failure of BPE on numerical data and enables perfect 50-digit arithmetic, precise genome modeling, and lossless scientific data ingestion. Binary Patch Encoding was proved to be highly effective in encoding large scale experimental numerical data especially in Big Science experimental data analysis. In BigBang-Proton, Binary Patch Encoding further demontrates its capabilities in encoding mixture of text with large scale numerical datasets and other modelities. This lays the foundations for ultimate unified architcture for material world foundational model.

2. Theory-Experiment Learning Paradigm

Science isn’t just theory — it’s theory + experiment. Proton treats them as two aligned modalities: theoretical text (papers, equations, hypotheses) and experimental data (numeerical daata, tables, time series, measurements). Like image-text alignment in multimodal models, Proton learns to ground abstract theory in concrete experimental reality , all within a single context window.

3. Monte Carlo Attention: Exponential Context, Linear Cost

Traditional Transformers hit a wall at 1M tokens. Monte Carlo Attention breaks it. By Inter-Patch Delegation Mechanism, through delegating “representative” tokens between patches layer-by-layer, context length grows exponentially with depth — 10³⁰ bytes at 20 layers, 10⁸⁰ (the baryon count of the observable universe) at 60 layers — while compute remains linear in patch size. Structure learning, not chain-of-thought, is the path to AGI.


Performance Summary

  • 100% accuracy on 50-digit arithmetic (no external calculator)
  • ✅ Matches specialized SOTA models in:
    • Particle physics jet tagging
    • Inter-atomic potential simulation (MAE on par with top GNN models in matbench)
    • Genome & protein structure prediction
    • Spatiotemporal water quality forecasting
  • ✅ Generates pseudo-structures of jets, crystals, and DNA — learning the “shape” of science
  • ✅ Achieve language-guided scientific computing, solves tasks via next-patch-prediction, unifying classification, regression, and generation

Why It Matters

In high level, today’s AI is domain-specific or task-specific in science: one model for material, another for proteins, another for weather. BigBang-Proton proves that a single, task-agnostic architecture can integrate them all, by learning the universal language of the material world.

hypothesis 1

About

BigBang-Proton is a LLM pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton achieves language-guided scientific computing through Theory-Experiment learning, Binary Patch Encoding and Monte Carlo Attention.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages