当前位置：首页 > 软件库 > 神经网络/人工智能 > 机器学习/深度学习 >

stock-market-scraper

授权协议 MIT License

开发语言 Python

所属分类神经网络/人工智能、机器学习/深度学习

软件类型开源软件

地区不详

投递者嵇光临

操作系统跨平台

开源组织无

适用人群未知

软件官网

软件文档

官方下载

软件概览

stock-market-scraper

Always wanted to get live updated historical data of your favourite stocks?

Say no more!

stock-market-scraper is a command line tool which downloads all historical stock data in both csv and json formats from Yahoo Finance. This is for educational and reasearch purposes only.

Don't overuse this script. It puts loads on Yahoo Finance servers.

This README is for mass downloading with Python Script only. For only selected stock downloads, see the Jupyter Notebook.

Photo by Igor Kozak for 10Clouds on dribbble

Getting Started

Supported Sites

Currently supports Yahoo Finance only.

Configuring download list

It currently just downloads all stock data for all over the world from Yahoo Finance.

I am working on command line argument version where you will be able to download selected stocks with this snippet too.

Dependencies Installation

This script can run on multiple Operating Systems. Follow the instructions mentioned below, according to your OS.

Linux/Debian :

Since most (if not all) Linux/Debian OS come with python pre-installed, you don't have to install python manually. Make sure you're using python >= 3.5 though.

We need pip to install any external dependenc(ies). So, open any terminal and type in pip list and if it shows some data, then it is fine. But, if it shows error, like pip not found or something along this line, then you need to install pip. Just type this command in terminal :

sudo apt-get install python-pip

If you're on Fedora, CentOS/RHEL, openSUSE, Arch Linux, then you simply need to follow THIS TUTORIAL to install pip.

If this still doesn't work, then you'll manually need to install pip. Doing so is an easy one time job and you can follow THIS TUTORIAL to do so.

Download this requirements.text file and put it in some directory/folder.
Open terminal again and browse to the directory where you downloaded your requiremenets.txt file and run this command :

pip install -r requirements.txt

It should install the required external libraries.

Windows :

If you're on windows, then follow these steps :

Install Python >= 3.5. Download the desired installer from download Python.
Add it in the system path (if not already added).
Download this requirements.text file and put it in some directory/folder.
Open Command Prompt and browse to the directory where you downloaded your requiremenets.txt file and run this command :

pip install -r requirements.txt

It should install the required external libraries.

Now, install Node.Js as well and make sure it's in your path.

Well, if everything came up good without any error(s), then you're good to go!

Mac OS X :

Mac OS X users will have to fetch their version of Python and Pip.

Python installation guide : Install python on mac os X
Pip installation guide : Installing pip on mac os X

After downloading and installing these, you need to add PIP & Python in your path. Follow THIS LITTLE GUIDE to install both, Python & pip successfully.

Python Support

Supports python >= 3.5

Usage

Follow the instructions according to your OS :

Windows

After you've saved this script in a directory/folder, you need to open command prompt and browse to that directory and then execute the script. Let's do it step by step :

Open the folder where you've downloaded the files of this repository.
Hold down the SHIFT key and while holding down the SHIFT key, RIGHT CLICK and select Open Command Prompt Here from the options that show up.
Now, in the command prompt, type this :

python stock-market-scraper.py

Linux/Debian

After you've saved this script in a directory/folder, you need to open command prompt and browse to that directory and then execute the script. Let's do it step by step :

Open a terminal, Ctrl + Alt + T is the shortcut to do so (if you didn't know).
Now, change the current working directory of the terminal to the one where you've downloaded this repository.
Now, in the Terminal, type this :

python stock-market-scraper.py

Save Location

Comics will be saved on the same directory you clone this repository. Here is how:

-     --SomeDirectory (Where you cloned the repository)
        |--stock-market-scraper
        |  |--requirements.txt
        |  |--.gitignore
        |  |--_config.yml
        |  |--stock-market-scraper.py
        |  |--stock-market-scraper.ipnyb
        |  |--readme.md
-       |--historic_data
        |  |--json
        |  |  |--(>63000) files.json
        |  |--csv
        |  |  |--(>61000) files.csv

Let's see the scraping idea

Yahoo has gone to a Reactjs front end which means if you analyze the request headers from the client to the backend you can get the actual JSON they use to populate the client side stores.

Hosts:

query1.finance.yahoo.com HTTP/1.0
query2.finance.yahoo.com HTTP/1.1 difference between HTTP/1.0 & HTTP/1.1

If you plan to use a proxy or persistent connections use query2.finance.yahoo.com. But for the purposes of this post the host used for the example URLs is not meant to imply anything about the path it's being used with.

We will use HTTP/1.1

Fundamental Data

/v10/finance/quoteSummary/AAPL?modules= (Full list of modules below)

(substitute your symbol for: AAPL)

Inputs for the `?modules=` query:

 'assetProfile',
 'incomeStatementHistory',
 'incomeStatementHistoryQuarterly',
 'balanceSheetHistory',
 'balanceSheetHistoryQuarterly',
 'cashflowStatementHistory',
 'cashflowStatementHistoryQuarterly',
 'defaultKeyStatistics',
 'financialData',
 'calendarEvents',
 'secFilings',
 'recommendationTrend',
 'upgradeDowngradeHistory',
 'institutionOwnership',
 'fundOwnership',
 'majorDirectHolders',
 'majorHoldersBreakdown',
 'insiderTransactions',
 'insiderHolders',
 'netSharePurchaseActivity',
 'earnings',
 'earningsHistory',
 'earningsTrend',
 'industryTrend',
 'indexTrend',
 'sectorTrend' ]

Example URL:

https://query1.finance.yahoo.com/v10/finance/quoteSummary/AAPL?modules=assetProfile%2CearningsHistory

Querying for: assetProfile and earningsHistory

The %2C is the Hex representation of , and needs to be inserted between each module you request. details about the hex encoding bit (if you care)

Options contracts

/v7/finance/options/AAPL (current expiration)
/v7/finance/options/AAPL?date=1579219200 (January 17, 2020 expiration)

Example Full URL:

https://query2.finance.yahoo.com/v7/finance/options/AAPL (current expiration)
https://query2.finance.yahoo.com/v7/finance/options/AAPL?date=1579219200 (January 17, 2020 expiration)

Any valid future expiration represented as a UNIX timestamp can be used in the ?date= query. If you query for the current expiration the JSON response will contain a list of all the valid expirations that can be used in the ?date= query. (here is a post explaining converting human readable dates to unix timestamp in Python)

Price

/v8/finance/chart/AAPL?symbol=AAPL&period1=0&period2=9999999999&interval=3mo

Intervals:

&interval=3mo 3 months, going back until initial trading date.
&interval=1d 1 day, going back until initial trading date.
&interval=5m 5 minuets, going back 80(ish) days.
&interval=1m 1 minuet, going back 4-5 days.

How far back you can go with each interval is a little confusing and seems inconsistent. My assumption is that internally yahoo is counting in trading days and my naive approach was not accounting for holidays. Although that's a guess and YMMV.

period1=: unix timestamp representation of the date you wish to start at. Values below the initial trading date will be rounded up to the initial trading date.

period2=: unix timestamp representation of the date you wish to end at. Values greater than the last trading date will be rounded down to the most recent timestamp available.

Note: If you query with a period1= (start date) that is too far in the past for the interval you've chosen, yahoo will return prices in the 3mo interval regardless of what interval you requested.

Add pre & post market data

&includePrePost=true

Add dividends & splits

&events=div%2Csplit

Example full query:

https://query1.finance.yahoo.com/v8/finance/chart/AAPL?symbol=AAPL&period1=0&period2=9999999999&interval=1d&includePrePost=true&events=div%2Csplit

The above request will return all price data for ticker AAPL on a 1 day interval including pre and post market data as well as dividends and splits.

Note: the values used in the price example url for period1= & period2= are to demonstrate the respective rounding behavior of each input.

The above article is taken from here.

Dividents and Splits

Yahoo adjusts all historical prices to reflect a stock split. For example, ISRG was trading around $1000 prior to 2017/10/06. Then on 2017/10/06, it underwent a 3-for-1 stock split. As you can see, Yahoo's historical prices divided all prices by 3 (both prior to and after 2017/10/06):

For dividends, let's say stock ABC closed at 200 on December 18. Then on December 19, the stock increases in price by $2 but it pays out a $1 dividend. In Yahoo's historical prices for XYZ, you will see that it closed at 200 on Dec 18 and 201 on Dec 19. Yahoo factors in the dividend in the "Adj Close" column for all the previous days. So the Close for Dec 18 would be 200, but the Adj Close would be 199.

For example, on 2017/09/15, SPY paid out a $1.235 dividend. Yahoo's historical prices say that SPY's closing price on 2017/09/14 was 250.09, but the Adj Close is 248.85, which is $1.24 lower. The Adjusted Close for the previous days was reduced by the dividend amount.

The above article is taken from here.

Now let's get back to some Code to get historic prices of stocks

Import some modules:

urllib: To get url data
json: To handle json files
time: To put the program in sleep for some time
os: To walk through different directories
difflib: To get close matches of strings. Helps to find correct stock from the input user gives.
itertools: To repeat same variable to pass in multithreading funciton.
pandas: To handle matrix and csv file
datetime: To change unix timestamp to normal date and time. Yahoo query uses unix timestamp

import urllib.request, json , time, os, difflib, itertools
import pandas as pd
from multiprocessing.dummy import Pool
from datetime import datetime
try:
    import httplib
except:
    import http.client as httplib

Let's make a code snippet which can tell if we have working internet connection or not

def check_internet():
    conn = httplib.HTTPConnection("www.google.com", timeout=5)
    try:
        conn.request("HEAD", "/")
        conn.close()
        # print("True")
        return True
    except:
        conn.close()
        # print("False")
        return False

Now see below, I have opened an arbitrary stock Igarashi Motors. In URL can you see the ticker for the stock? It is IGARASHI.BO

How to get the ticker, I will show you later.

First let us make a function that can pull json data from yahoo about that stock like below. (I will discuss about the function parameters later)

We will be using query2

Now write down the function which will `get_historic_price` for given `query_url`.

It will save the stock data as `json` and `csv` inside a folder named "historic_data"

def get_historic_price(query_url,json_path,csv_path):
    
    stock_id=query_url.split("&period")[0].split("symbol=")[1]

    if os.path.exists(csv_path+stock_id+'.csv') and os.stat(csv_path+stock_id+'.csv').st_size != 0:
        print("<<<  Historical data of "+stock_id+" already exists")
        return
    
    while not check_internet():
        print("Could not connect, trying again in 5 seconds...")
        time.sleep(5)

    try:
        with urllib.request.urlopen(query_url) as url:
            parsed = json.loads(url.read().decode())
    
    except:
        print("|||  Historical data of "+stock_id+" doesn't exist")
        return
    
    else:
        if os.path.exists(json_path+stock_id+'.json') and os.stat(json_path+stock_id+'.json').st_size != 0:
            os.remove(json_path+stock_id+'.json')
        
        with open(json_path+stock_id+'.json', 'w') as outfile:
            json.dump(parsed, outfile, indent=4)
        
        try:
            Date=[]
            for i in parsed['chart']['result'][0]['timestamp']:
                Date.append(datetime.utcfromtimestamp(int(i)).strftime('%d-%m-%Y'))

            Low=parsed['chart']['result'][0]['indicators']['quote'][0]['low']
            Open=parsed['chart']['result'][0]['indicators']['quote'][0]['open']
            Volume=parsed['chart']['result'][0]['indicators']['quote'][0]['volume']
            High=parsed['chart']['result'][0]['indicators']['quote'][0]['high']
            Close=parsed['chart']['result'][0]['indicators']['quote'][0]['close']
            Adjusted_Close=parsed['chart']['result'][0]['indicators']['adjclose'][0]['adjclose']

            df=pd.DataFrame(list(zip(Date,Low,Open,Volume,High,Close,Adjusted_Close)),columns =['Date','Low','Open','Volume','High','Close','Adjusted Close'])

            if os.path.exists(csv_path+stock_id+'.csv'):
                os.remove(csv_path+stock_id+'.csv')
            df.to_csv(csv_path+stock_id+'.csv', sep=',', index=None)
            print(">>>  Historical data of "+stock_id+" saved")
        
        except:
            print(">>>  Historical data of "+stock_id+" could not be saved")
        
        return

First we have to set where the `json` and `csv` files will be saved which have been passed to the function `get_historic_price()`

json_path = os.getcwd()+os.sep+".."+os.sep+"historic_data"+os.sep+"json"+os.sep
csv_path = os.getcwd()+os.sep+".."+os.sep+"historic_data"+os.sep+"csv"+os.sep

Then we have to check if these directory exists, if not, then we will use `os.mkdir`

if not os.path.isdir(json_path):
    os.makedirs(json_path)
if not os.path.isdir(csv_path):
    os.makedirs(csv_path)

Getting tickers

Now as promised I will be showing how to find historical data. See below, I have opened historical data of Igarashi Motors. Here you can see max time period from which we can pull data for the stock. It stores period as unix timestamp in the query.

Now let's make the query. First set

period1 = 0
period2 = 9999999999
interval = 1d

See the image below, it's period1 is greater than 0 and period2 is lesser than 9999999999. This produces maximum span period from which data can be pulled.

Then we need to open our csv file where `yahoo finance tickers` are saved. This is in the `Assets` folder

How did I get this? Well here is the direct link to download the yahoo ticker list (last updated September 2017). It would be helpful for the author if you visit his website page, as his income is through advertisements, and it takes lots of hours to create this type of ticker list.

All right, moving on.

Let's now make the funciton to shrink the ticker list.

ticker_file_path = "Assets"+os.sep+"Yahoo Ticker Symbols - September 2017.xlsx"
temp_df = pd.read_excel(ticker_file_path)
print("Total stocks:",len(temp_df))
temp_df.head(10)

Total stocks: 106331

	Yahoo Stock Tickers	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	http://investexcel.net	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Ticker	Name	Exchange	Category Name	Country	NaN	NaN	NaN
3	OEDV	Osage Exploration and Development, Inc.	PNK	NaN	USA	NaN	NaN	Samir Khan
4	AAPL	Apple Inc.	NMS	Electronic Equipment	USA	NaN	NaN	simulationconsultant@gmail.com
5	BAC	Bank of America Corporation	NYQ	Money Center Banks	USA	NaN	NaN	NaN
6	AMZN	Amazon.com, Inc.	NMS	Catalog & Mail Order Houses	USA	NaN	NaN	This ticker symbol list was downloaded from
7	T	AT&T Inc.	NYQ	Telecom Services - Domestic	USA	NaN	NaN	http://investexcel.net/all-yahoo-finance-stock...
8	GOOG	Alphabet Inc.	NMS	Internet Information Providers	USA	NaN	NaN	and was updated on 2nd September 2017
9	MO	Altria Group, Inc.	NYQ	Cigarettes	USA	NaN	NaN	NaN

See the above list is messy, it contains garbage informations. So refining it we get

temp_df = temp_df.drop(temp_df.columns[[5, 6, 7]], axis=1)
headers = temp_df.iloc[2]
df  = pd.DataFrame(temp_df.values[3:], columns=headers)
print("Total stocks:",len(df))
df.head(10)

Total stocks: 106328

2	Ticker	Name	Exchange	Category Name	Country
0	OEDV	Osage Exploration and Development, Inc.	PNK	NaN	USA
1	AAPL	Apple Inc.	NMS	Electronic Equipment	USA
2	BAC	Bank of America Corporation	NYQ	Money Center Banks	USA
3	AMZN	Amazon.com, Inc.	NMS	Catalog & Mail Order Houses	USA
4	T	AT&T Inc.	NYQ	Telecom Services - Domestic	USA
5	GOOG	Alphabet Inc.	NMS	Internet Information Providers	USA
6	MO	Altria Group, Inc.	NYQ	Cigarettes	USA
7	DAL	Delta Air Lines, Inc.	NYQ	Major Airlines	USA
8	AA	Alcoa Corporation	NYQ	Aluminum	USA
9	AXP	American Express Company	NYQ	Credit Services	USA

Now create the query urls for the stock `ticker`s. This will bring the query pages, where yahoo finance holds it's historical stock data.

Example query is like this: https://query1.finance.yahoo.com/v8/finance/chart/ticker?symbol=ticker&period1=0&period2=9999999999&interval=1d&includePrePost=true&events=div%2Csplit

query_urls=[]
for ticker in df['Ticker']:
    query_urls.append("https://query1.finance.yahoo.com/v8/finance/chart/"+ticker+"?symbol="+ticker+"&period1=0&period2=9999999999&interval=1d&includePrePost=true&events=div%2Csplit")

Now get to the stock datas with multithreading.

with Pool(processes=10) as pool:
    pool.starmap(get_historic_price, zip(query_urls, itertools.repeat(json_path), itertools.repeat(csv_path)))
print("<|>  Historical data of all stocks saved")

<<<  Historical data of SBIN.NS already exists, Updating data...
<<<  Historical data of IGARASHI.NS already exists, Updating data...
<<<  Historical data of TATAMOTORS.NS already exists, Updating data...
<<<  Historical data of TCS.NS already exists, Updating data...
>>>  Historical data of TCS.NS saved
>>>  Historical data of IGARASHI.NS saved
>>>  Historical data of TATAMOTORS.NS saved
>>>  Historical data of SBIN.NS saved
All downloads completed !

So like this you can update data everyday by yourself

Future plans

Short term

Add command line arguments for ease of use.

Long term

Add more websites to download from.

Bugs

None

Changelog

[Added] New website Yahoo Finance

Opening An Issue/Requesting A Site

If your're planning to open an issue for the script or ask for a new feature or anything that requires opening an Issue, then please do keep these things in mind.

Reporting Issues

If you're going to report an issue, please follow this syntax :
Command You Gave : What was the command that you used to invoke the issue?
Expected Behaviour : After giving the above command, what did you expect shoud've happened?
Actual Behaviour : What actually happened?
Error Log : Error Log is mandatory.

Suggesting A Feature

If you're here to make suggestions, please follow the basic syntax to post a request :
Subject : Something that briefly tells us about the feature.
Long Explanation : Describe in details what you want and how you want.

Source

License

MIT

stock-market-scraper

stock-market-scraper

Always wanted to get live updated historical data of your favourite stocks?

Say no more!

This README is for mass downloading with Python Script only. For only selected stock downloads, see the Jupyter Notebook.

Table of Contents

Getting Started

Supported Sites

Configuring download list

Dependencies Installation

Linux/Debian :

Windows :

Mac OS X :

Python Support

Usage

Windows

Linux/Debian

Save Location

Let's see the scraping idea

Hosts:

Fundamental Data

Inputs for the ?modules= query:

Example URL:

Options contracts

Example Full URL:

Price

Intervals:

Add pre & post market data

Add dividends & splits

Example full query:

Dividents and Splits

Now let's get back to some Code to get historic prices of stocks

Import some modules:

Let's make a code snippet which can tell if we have working internet connection or not

Now write down the function which will get_historic_price for given query_url.

It will save the stock data as json and csv inside a folder named "historic_data"

First we have to set where the json and csv files will be saved which have been passed to the function get_historic_price()

Then we have to check if these directory exists, if not, then we will use os.mkdir

Getting tickers

Then we need to open our csv file where yahoo finance tickers are saved. This is in the Assets folder

Let's now make the funciton to shrink the ticker list.

See the above list is messy, it contains garbage informations. So refining it we get

Now create the query urls for the stock tickers. This will bring the query pages, where yahoo finance holds it's historical stock data.

Now get to the stock datas with multithreading.

So like this you can update data everyday by yourself

Future plans

Short term

Long term

Bugs

Changelog

Opening An Issue/Requesting A Site

Reporting Issues

Suggesting A Feature

Source

License

同类工具

相关阅读

相关文章

相关问答

相关文档

Inputs for the `?modules=` query:

Now write down the function which will `get_historic_price` for given `query_url`.

It will save the stock data as `json` and `csv` inside a folder named "historic_data"

First we have to set where the `json` and `csv` files will be saved which have been passed to the function `get_historic_price()`

Then we have to check if these directory exists, if not, then we will use `os.mkdir`

Then we need to open our csv file where `yahoo finance tickers` are saved. This is in the `Assets` folder

Now create the query urls for the stock `ticker`s. This will bring the query pages, where yahoo finance holds it's historical stock data.