github code dataset

codeparrot/github-code · Datasets at Hugging Face huggingface.co › datasets › github-code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created ...

github/CodeSearchNet: Datasets, tools, and benchmarks for ... github.com › github › CodeSearchNet

11 апр. 2023 г. · CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. Code of Conduct · Instructions · README.md · MIT License

A collection of datasets for machine learning for big code - GitHub github.com › CUHK-ARISE › ml4code-dataset

A collection of datasets (and other resources) for big code analysis. If you want to contribute to this list, please send a pull request.

codeparrot/github-code-clean · Datasets at Hugging Face huggingface.co › datasets › github-code-clean

This is a cleaner version of Github-code dataset, we add the following filters: Average line length < 100; Alpha numeric characters fraction > ...

src-d/datasets - GitHub github.com › src-d › datasets

This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.

awesomedata/awesome-public-datasets - GitHub github.com › awesomedata › awesome-public-datasets

This is a list of topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. README.rst · Issues 68 · Pull requests 61 · Actions

BigCode Dataset - GitHub github.com › bigcode-project › bigcode-dataset

This repository gathers all the code used to build the BigCode datasets such as The Stack as well as the preprocessing necessary used for model training.

csebuetnlp/CoDesc: A large dataset of 4.2m Java source code ... github.com › csebuetnlp › CoDesc

CoDesc is a noise removed, large parallel dataset of source codes and corresponding natural language descriptions. This dataset is procured from several similar ...

GitHub Dataset - Kaggle www.kaggle.com › datasets › nikhil25803 › git...

This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and ...

DiverseVul: A New Vulnerable Source Code Dataset ... - GitHub github.com › wagner-group › diversevul

A new vulnerable source code dataset for deep learning based vulnerability detection (RAID 2023) https://surrealyz.github.io/files/pubs/raid23-diversevul.pdf

Запросы по теме

python code dataset

code snippet dataset

codeparrot github code clean

codesearchnet

github huggingface/datasets

the stack dataset

source code classification dataset

code generation dataset huggingface