This is a cleaner version of Github-code dataset, we add the following filters: Average line length < 100; Alpha numeric characters fraction > ... |
We're on a journey to advance and democratize artificial intelligence through open source and open science. |
In this step by step guide, we'll learn how to train a large GPT-2 model called CodeParrot, entirely from scratch. |
ATYUN(AiTechYun),这是一个更清洁的版本Github-code dataset ,我们添加了以下过滤条件:平均行长度小于100字母数字字符比例大于0.25删除自动生成的文件(关键词搜索)删除 ... |
Supercharged Frontend Development: Create pixel perfect UI 10x faster - CodeParrot. |
PythonCoder is a code generation model only trained on Python dataset (codeparrot/codeparrot-clean) . It is a custom model with context window of 1024 ... |
Iterable dataset that returns constant length chunks of tokens from stream of text files. Args: tokenizer (Tokenizer): The processor used ... Не найдено: clean | Нужно включить: clean |
CodeParrot has 4 repositories available. Follow their code on GitHub. |
1 Million tokenized Python files from the Codeparrot dataset in Lance format. |
We're on a journey to advance and democratize artificial intelligence through open source and open science. |
Novbeti > |
Axtarisha Qayit Anarim.Az Anarim.Az Sayt Rehberliyi ile Elaqe Saytdan Istifade Qaydalari Anarim.Az 2004-2023 |