Github-Stars-Predictor

授权协议 MIT License
开发语言 Python
所属分类 神经网络/人工智能、 机器学习/深度学习
软件类型 开源软件
地区 不详
投 递 者 漆雕奇
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

Github Repo Stars Predictor

Overview

It's a github repo star predictor that tries to predict the stars ofany github repository having greater than 100 stars. It predicts based onthe owner/organization's status and activities (commits, forks, comments,branches, update rate, etc.) on the repository. Different types of models(Gradient boost, Deep neural network, etc) have been tested successfullyon the dataset we fetched from github apis.

Dataset

We used the github REST api and GraphQL api to collect data of repositorieshaving more than 100 stars. The data is available in the dataset directoryWe were able to collect the data faster using the Digital Ocean's multipleservers. So we thanks Digital Ocean for providingfree credits to students to use servers. For the details on dataset featuresrefer the summary section below.

Tools used

  • Python 2.7
  • Jupyter Notebook
  • NumPy
  • Sklearn
  • Pandas
  • Keras
  • Cat Boost
  • Matplot Lib
  • seaborn

We also used Google Colab's GPU notebooks.So we thank to Google for starting thier colab project forproviding GPUs

Code details

Below is a brief description for the Code files/folder in repo.

Bash Script

  • settingUpDOServer.sh
    filepath: scripts/bash/settingUpDOServer.sh
    This is used for configuring the digital ocean server

NodeJs scripts

  • getting_repos_v2.js
    filepath: scripts/nodejs/getting_repos_v2.js
    This script fetches the basic info of repos having more than 100 stars using the Github REST API

  • githubGraphQLApiCallsDO_V2.js
    filepath: scripts/nodejs/githubGraphQLApiCallsDO_V2.js
    This script fetches the complete info of the repositories that were fetched by the abovescript and uses the Github GraphQL API. It follows the approach of fetching the dataat the fixed rate defined in env file (eg. 730ms per request)

  • githubGraphQLApiCallsDO_V3.js
    filepath: scripts/nodejs/githubGraphQLApiCallsDO_V3.js
    This script fetches the complete info of the repositories that were fetched by the abovescript and uses the Github GraphQL API. It follows the approach of requesting data fornext repo after receiving the response of the already sent request.

Python scripts

  • json_to_csv.py
    filepath: scripts/python/json_to_csv.py
    This script converts the json data fetched from Github's GraphQL API in the above script to theequivalent csv file.

  • merge.py
    filepath: scripts/python/merge.py
    This scripts merges all the data in multiple csv files to a single csv file

Jupyter Notebooks

  • VisualizePreprocess.ipynb
    filepath: notebooks/VisualizePreprocess.ipynb
    We have done the feature engineering task in this notebook. It visualises the data and correspondinglycreates new features, modifies existing features and removes redundant features. For detailson features created, check the summary below

  • training_models.ipynb
    filepath: notebooks/training_models.ipynb
    In this notebook, we trained different models with hyper parameter tuning on our dataset and compared their result in the end.For details on models trained, their prediction scores, etc. check the summary below.

Summary

In this project we have tried to predict the number of starsof a github repository that have more than 100 stars. For this we havetaken the github repository data from github REST api and GraphQL api.After generating the dataset we visualized and did some feature engineeringwith the dataset and after that , finally we come up to the stage where weapplied various models and predicted the model's scores on training andtest data.

Feature Engineering

There are total of 49 features before pre-processing. After pre-processing (adding new features, removal of redundant features andmodifying existing features) the count changes to 54. All the features are listed below.Some features after pre-processing may not be clear. Please refer to the VisualizePreprocess.ipynb notebook for details.

Original Features

column 1 column 2 column 3
branches commits createdAt
description diskUsage followers
following forkCount gistComments
gistStar gists hasWikiEnabled
iClosedComments iClosedParticipants iOpenComments
iOpenParticipants isArchived issuesClosed
issuesOpen license location
login members organizations
prClosed prClosedComments prClosedCommits
prMerged prMergedComments prMergedCommits
prOpen prOpenComments prOpenCommits
primaryLanguage pushedAt readmeCharCount
readmeLinkCount readmeSize readmeWordCount
releases reponame repositories
siteAdmin stars subscribersCount
tags type updatedAt
websiteUrl

Features after pre-processing

column 1 column 2 column 3
branches commits createdAt
diskUsage followers following
forkCount gistComments gistStar
gists hasWikiEnabled iClosedComments
iClosedParticipants iOpenComments iOpenParticipants
issuesClosed issuesOpen members
organizations prClosed prClosedComments
prClosedCommits prMerged prMergedComments
prMergedCommits prOpen prOpenComments
prOpenCommits pushedAt readmeCharCount
readmeLinkCount readmeSize readmeWordCount
releases repositories subscribersCount
tags type updatedAt
websiteUrl desWordCount desCharCount
mit_license nan_license apache_license
other_license remain_license JavaScript
Python Java Objective
Ruby PHP other_language

Models Trained

  • Gradient Boost Regressor
  • Cat Boost Regressor
  • Random Forest Regressor
  • Deep Neural Network

Evaluation Metrics

  • R^2 score

Results

  • nodejs项目,简单易用: https://github.com/yyx990803/starz

  • 以我的price-monitor项目为例: https://github.com/qqxx6661/price-monitor star: https://github.com/qqxx6661/price-monitor/stargazers fork: https://github.com/qqxx6661/price-monitor/network

  • 转载:https://mp.weixin.qq.com/s?__biz=MzI1MjU5MjMzNA==&mid=2247484731&idx=1&sn=b15fbee5910b36341bf366860ee5df53&scene=21#wechat_redirect 这次给大家带来的是ENCODE project的御用比对软件STAR,ENCODE项目是一个由美国国家人类基因组研究所(NHGRI

  • 以下排序不单纯看star数。js相关的排除、PhP相关的排除、冷门语言(编程榜前十名之外)排除。 标*的为曾经最热门但目前已经没落的项目,虽然star数很高但不推荐使用。 一.最佳项目 以下收录的项目价值很高,每一个都推动了计算机历史的进程,不能用star衡量,所以star数不列出 项目 描述 pytorch 仅次于Tensorflow的深度学习框架 spring-framework Javawe

  • 以下排序不单纯看star数。js相关的排除、PhP相关的排除、冷门语言(编程榜前十名之外)排除。 标*的为曾经最热门但目前已经没落的项目,虽然star数很高但不推荐使用。 一.最佳项目 以下收录的项目价值很高,每一个都推动了计算机历史的进程,不能用star衡量,所以star数不列出 项目 描述 tensorflow 目前最流行的深度学习框架 linux Linux内核源码 spring-boot

  • https://github.com/search?l=PHP&q=+stars%3A%3E0&ref=searchresults&type=Repositories 转载于:https://www.cnblogs.com/arvintang/p/5994615.html

 相关资料
  • This extension allows you to organize your Github stars with tags. You can then export those tags that you created to bookmarks in your Chrome browser or to a JSON file. Installation You can download

  • 点击某颗星星进行打分。 [Code4App.com]

  • 所以现在我们已经介绍了 GitHub 的大部分功能与工作流程,但是任意一个小组或项目都会去自定义,因为他们想要创造或扩展想要整合的服务。 对我们来说很幸运的是,GitHub 在许多方面都真的很方便 Hack。 在本节中我们将会介绍如何使用 GitHub 钩子系统与 API 接口,使 GitHub 按照我们的设想来工作。 钩子 GitHub 仓库管理中的钩子与服务区块是 GitHub 与外部系统交互

  • 你可以在 Github 上为项目创建远程仓库。创建公开的远程仓库是免费的,私有仓库要收费。 任务 在 Github 网站申请一个帐号。 https://github.com 配置帐号的 ssh-key。 https://github.com/settings/keys ssh-key 在 Github 个人帐户里配置使用了 ssh-key,以后你往你的 Github 远程仓库推送的时候就不需要输入

  • 代码仓库 我们在GitHub上进行Tengine项目的开发:https://github.com/alibaba/tengine。 可以用git检出最新的Tengine代码: 参与开发 我们非常欢迎也很鼓励您在Tengine的项目的GitHub上报告issue或者pull request。 如果您还不熟悉GitHub的Fork and Pull开发模式,您可以阅读GitHub的文档(https:/

  • 类属性 $allowSignup 是否在登录页显示注册,默认false