Revolutionizing Large Language Models: Harnessing Mixture of Experts and 3D In-Memory Computing for Optimal Scalability


This webpage was generated automatically; to view the article in its initial site, you can click the link below:
https://www.nature.com/articles/s43588-024-00753-x
and if you wish to eliminate this article from our site, kindly get in touch with us


  • Jiang, A. Q. et al. Mixtral of specialists. Preprint at (2024).

  • Touvron, H. et al. Llama 2: open foundation and fine-tuned conversational models. Preprint at (2024).

  • Gemini Team Google et al. Gemini: a collection of highly proficient multimodal models. Preprint at (2023).

  • Brown, T. B. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).

  • Kaplan, J. et al. Scaling laws for neural language models. Preprint at (2020).

  • Hoffmann, J. et al. An empirical examination of compute-optimal large language model training. In Proc. Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35 (Curran Associates, 2022).

  • Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 11324–11436 (2023).

    MATH

    Google Scholar

  • Jordan, M. & Jacobs, R. Hierarchical mixtures of specialists and the EM algorithm. In Proc. 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan) Vol. 2, 1339–1344 (IEEE, 1993); https://doi.org/10.1109/IJCNN.1993.716791

  • Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local specialists. Neural Comput. 3, 79–87 (1991).

    Article
    MATH

    Google Scholar

  • Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-specialists layer. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=B1ckMDqlg

  • Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 5232–5270 (2022).

    MathSciNet
    MATH“`html

    Google Scholar
     

  • Raffel, C. et al. Delving into the boundaries of transfer learning utilizing a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020).

    MathSciNet 
    MATH 

    Google Scholar
     

  • Du, N. et al. GLaM: effective scaling of language models with a mixture-of-experts approach. In Proc. 39th International Conference on Machine Learning, Proceedings of Machine Learning Research Vol. 162, 5547–5569 (PMLR, 2022).

  • Clark, A. et al. Standardized scaling laws for routed language models. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 4057–4086 (PMLR, 2022).

  • Ludziejewski, J. et al. Scaling laws for fine-grained mixtures of experts. In Proc. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models (PMLR, 2024); https://openreview.net/forum?id=Iizr8qwH7J

  • Csordás, R., Irie, K. & Schmidhuber, J. Estimating two-layer feedforward networks for optimized transformers. In Proc. Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 674–692 (ACL, 2023); https://doi.org/10.18653/v1/2023.findings-emnlp.49

  • Reuther, A. et al. A survey of AI and ML accelerators, along with trends. In Proc. 2022 IEEE High Performance Extreme Computing Conference (HPEC) 1–10 (IEEE, 2022); https://doi.org/10.1109/HPEC55821.2022.9926331

  • Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory components and their applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).

    Article 

    Google Scholar
     

  • Lanza, M. et al. Memristive technologies for data retention, computing, encryption, and radio-frequency transmission. Science 376, eabj9979 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Mannocci, P. et al. In-memory computing utilizing novel memory devices: current status and prospects.
    “`APL Mach. Learn 1, 010902 (2023).

    Article 

    Google Scholar
     

  • Huang, Y. et al. Memristor-based hardware accelerators for artificial intelligence. Nat. Rev. Electr. Eng. 1, 286–299 (2024).

    Article 
    MATH 

    Google Scholar
     

  • Le Gallo, M. et al. A 64-core mixed-signal in-memory computation chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).

    Article 
    MATH 

    Google Scholar
     

  • Ambrogio, S. et al. An analog-AI chip for energy-conserving speech recognition and transcription. Nature 620, 768–775 (2023).

    Article 
    MATH 

    Google Scholar
     

  • Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Zhang, W. et al. Edge learning employing a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).

    “`html
    Article
    MATH

    Google Scholar

  • Wen, T.-H. et al. Integration of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).

    Article
    MATH

    Google Scholar

  • Fick, L., Skrzyniarz, S., Parikh, M., Henry, M. B. & Fick, D. Analog matrix processor for edge AI real-time video analysis. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 65, 260–262 (IEEE, 2022); https://doi.org/10.1109/ISSCC42614.2022.9731773

  • Arnaud, F. et al. High-density embedded PCM cell in 28 nm FDSOI technology tailored for automotive micro-controller applications. In Proc. 2020 IEEE International Electron Devices Meeting (IEDM) 24.2.1–24.2.4 (IEEE, 2020); https://doi.org/10.1109/IEDM13553.2020.9371934

  • Lee, S. et al. A 1 Tb 4b/cell 64-stacked-WL 3D NAND flash memory demonstrating 12 MB/s programming throughput. In Proc. 2018 IEEE International Solid-State Circuits Conference (ISSCC) 340–342 (IEEE, 2018); https://doi.org/10.1109/ISSCC.2018.8310323

  • Park, J.-W. et al. A 176-stacked 512 Gb 3b/cell 3D-NAND flash with a density of 10.8 Gb/mm2 and a peripheral circuit beneath the cell array architecture. In Proc. 2021 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 64, 422–423 (IEEE, 2021); https://doi.org/10.1109/ISSCC42613.2021.9365809

  • Lee, S.-T. & Lee, J.-H. Neuromorphic computing utilizing NAND flash memory architecture with pulse width modulation technique. Front. Neurosci. 14, 571292 (2020).

    Article
    MATH

    Google Scholar

  • Bavandpour, M., Sahay, S., Mahmoodi, M. R. & Strukov, D. B. 3D-aCortex: a highly compact energy-efficient neurocomputing platform leveraging commercial 3D-NAND flash memories. Neuromorphic Comput. Eng. 1, 014001 (2021).

    Article
    “““html

    Google Scholar
     

  • Shim, W. & Yu, S. Architectural innovation of 3D NAND-based compute-in-memory structures for GB-scale deep neural networks. IEEE Electron Device Lett. 42, 160–163 (2020).

    Article 
    MATH 

    Google Scholar
     

  • Hsieh, C.-C. et al. Demonstration via chip of a high-density (43 Gb) and high-bandwidth (300 Gb/s) 3D NAND powered in-memory search accelerator for Ternary Content Addressable Memory (TCAM) and Hamming distance proximity searches. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023); https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185361

  • Huo, Q. et al. A computing-in-memory macro utilizing three-dimensional resistive random-access memory technology. Nat. Electron. 5, 469–477 (2022).

    Article 
    MATH 

    Google Scholar
     

  • Jain, S. et al. A diverse and programmable compute-in-memory accelerator structure for analog-AI employing a dense 2-D mesh configuration. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 114–127 (2023).

    Article 
    MATH 

    Google Scholar
     

  • Cui, C. et al. An overview of multimodal large language models for self-driving vehicles. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops 958–979 (IEEE, 2024).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of profound bidirectional transformers for linguistic comprehension. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (ACL, 2019); https://doi.org/10.18653/v1/N19-1423

  • Kim, W., Son, B. & Kim, I. ViLT: vision-and-language transformer devoid of convolution or region supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang,
    “`T.) 5583–5594 (PMLR, 2021); https://proceedings.mlr.press/v139/kim21k.html

  • Alayrac, J.-B. et al. Flamingo: a visual language framework for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 23716–23736 (Curran Associates, 2022); https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf

  • Pope, R. et al. Effectively increasing transformer inference. In Proc. Machine Learning and Systems Vol. 5 (eds Song, D. et al.) 606–624 (Curran Associates, 2023).; https://proceedings.mlsys.org/paper_files/paper/2023/file/c4be71ab8d24cdfb45e3d06dbfca2780-Paper-mlsys2023.pdf

  • Choquette, J., Gandhi, W., Giroux, O., Stam, N. & Krashinsky, R. NVIDIA A100 Tensor Core GPU: efficiency and innovation. IEEE Micro 41, 29–35 (2021).

    Article 

    Google Scholar
     

  • Radford, A. et al. Linguistic models are unsupervised multitask performers. Semantic Scholar (2019).

  • Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture methodologies. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Byj72udxe

  • Vasilopoulos, A. et al. Leveraging the state dependency of conductArticle 
    MATH 

    Google Scholar
     

  • Bernstein, D. & Rodeh, M. Comprehensive instruction scheduling for superscalar architectures. In Proc. ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation PLDI ’91 241–255 (ACM, 1991); https://doi.org/10.1145/113445.113466

  • Joshi, V. et al. Precise deep neural network inference utilizing computational phase-change memory. Nat. Commun. 11, 2473 (2020).

    Article 
    MATH 

    Google Scholar
     

  • Kudo, T. & Richardson, J. SentencePiece: a straightforward and language-agnostic subword tokenizer and detokenizer for neural text manipulation. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Blanco, E. & Lu, W.) 66–71 (Association for Computational Linguistics, 2018).

  • Tillet, P., Kung, H. T. & Cox, D. Triton: a mid-level language and compiler for tiled neural network computations. In Proc. 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages 10–19 (ACM, 2019); https://doi.org/10.1145/3315508.3329973

  • Le Gallo, M. et al. Implementing the IBM analog in-memory hardware acceleration kit for neural network training and inference. APL Mach. Learn. 1, 041102 (2023).

    Article 
    MATH 

    Google Scholar
     

  • Büchel, J. et al. AIHWKIT-lightning: an adaptable HW-aware training toolkit for analog in-memory computation. In Proc. Advances in Neural Information Processing Systems 2024 Workshop, Machine Learning with new Compute Paradigms (Curran Associates, 2024); https://openreview.net/forum?id=QNdxOgGmhR

  • Büchel, J. et al. Gradient descent-driven programming of analog in-memory computing cores. In Proc. 2022 International Electron Devices Meeting (IEDM) 33.1.1–33.1.4 (IEEE, 2022); https://doi.org/10.1109/IEDM45625.2022.10019486

  • Büchel, J. Original data for figures in ‘Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing’. Zenodo (2024).

  • Büchel, J. IBM/analog-moe: code publication. Zenodo (2024).

  • Büchel, J. & Vasilopolous, A. IBM/3D-CiM-LLM-Inference-Simulator: code publication. Zenodo (2024).

  • Goda, A. Accomplishments in 3D NAND technology and forthcoming scaling outlooks. IEEE Trans. Electron Devices 67, 1373–1381 (2020).

    Article 
    MATH 

    Google Scholar
     

  • Lacaita, A. L., Spinelli, A. S. & Compagnoni, C. M. Compact solid-state storage: a lengthy journey to accomplishment. In Proc. 2021 IEEE Latin America Electron Devices Conference (LAEDC) 1–4 (IEEE, 2021); https://doi.org/10.1109/LAEDC51812.2021.9437865

  • Shoeybi, M. et al. Megatron-LM: instructing multi-billion parameter linguistic models via model parallelism. Preprint at (2020).


  • This page was generated automatically; to view the article in its original setting, please visit the link below:
    https://www.nature.com/articles/s43588-024-00753-x
    and if you wish to get this article removed from our site, kindly contact us

    Leave a Reply

    Your email address will not be published. Required fields are marked *