Sir Doge's picture

Sir Doge

PlanetDOGE

·

DogeCodez

AI & ML interests

Artificial intelligence research

Recent Activity

reacted to ajibawa-2023's post with 👍 about 2 hours ago

PHP-Code-Large Dataset: https://huggingface.co/datasets/ajibawa-2023/PHP-Code-Large PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem. By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks. PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

reacted to ajibawa-2023's post with 🔥 about 2 hours ago

Python-Code-Large Dataset: https://huggingface.co/datasets/ajibawa-2023/Python-Code-Large Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem. By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks. Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

reacted to ajibawa-2023's post with 🔥 about 2 hours ago

Cpp-Code-Large Dataset: https://huggingface.co/datasets/ajibawa-2023/Cpp-Code-Large Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem. By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks. Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.

View all activity

Organizations

spaces 2

KAI 7B Instruct Chat

SD TinyIguana Generation

models 0

None public yet

datasets 0

None public yet