DevOps, Cloud, Big Data, NoSQL, Python & Linux tools. All programs have --help
.
See Also:
repos which contains hundreds more scripts and programs for Cloud, Big Data, SQL, NoSQL, Web and Linux.
Hari Sekhon
Cloud & Big Data Contractor, United Kingdom
make update
if updating and not just git pull
as you will often need the latest library submodule and possibly new upstream libraries.All programs and their pre-compiled dependencies can be found ready to run on DockerHub.
List all programs:
docker run harisekhon/pytools
Run any given program:
docker run harisekhon/pytools <program> <args>
installs git, make, pulls the repo and build the dependencies:
curl -L https://git.io/python-bootstrap | sh
or manually:
git clone https://github.com/harisekhon/devops-python-tools pytools
cd pytools
make
To only install pip dependencies for a single script, you can just type make and the filename with a .pyc
extension instead of .py
:
make anonymize.pyc
Make sure to read Detailed Build Instructions further down for more information.
Some Hadoop tools with require Jython, see Jython for Hadoop Utils for details.
All programs come with a --help
switch which includes a program description and the list of command line options.
Environment variables are supported for convenience and also to hide credentials from being exposed in the process list eg. $PASSWORD
, $TRAVIS_TOKEN
. These are indicated in the --help
descriptions in brackets next to each option and often have more specific overrides with higher precedence eg. $AMBARI_HOST
, $HBASE_HOST
take priority over $HOST
.
anonymize.py
- anonymizes your configs / logs from files or stdin (for pasting to Apache Jira tickets or mailing lists)
anonymize_custom.conf
- put regex of your Name/Company/Project/Database/Tables to anonymize to <custom>
<fqdn>
, <password>
, <custom>
)--ip-prefix
leaves the last IP octect to aid in cluster debugging to still see differentiated nodes communicating with each other to compare configs and log communications--hash-hostnames
- hashes hostnames to look like Docker temporary container ID hostnames so that vendors support teams can differentiate hosts in clustersanonymize_parallel.sh
- splits files in to multiple parts and runs anonymize.py
on each part in parallel before re-joining back in to a file of the same name with a .anonymized
suffix. Preserves order of evaluation important for anonymization rules, as well as maintaining file content order. On servers this parallelization can result in a 30x speed up for large log filesfind_duplicate_files.py
- finds duplicate files in one or more directory trees via multiple methods including file basename, size, MD5 comparison of same sized files, or bespoke regex capture of partial file basenamefind_active_server.py
- finds fastest responding healthy server or active master in high availability deployments, useful for scripting against clustered technologies (eg. Elasticsearch, Hadoop, HBase, Cassandra etc). Multi-threaded for speed and highly configurable - socket, http, https, ping, url and/or regex content match. See further down for more details and sub-programs that simplify usage for many of the most common cluster technologieswelcome.py
- cool spinning welcome message greeting your username and showing last login time and user to put in your shell's .profile
(there is also a perl version in my DevOps Perl Tools repo)aws_users_access_key_age.py
- lists all users access keys, status, date of creation and age in days. Optionally filters for active keys and older than N days (for key rotation governance)aws_users_unused_access_keys.py
- lists users access keys that haven't been used in the last N days or that have never been used (these should generally be removed/disabled). Optionally filters for only active keysaws_users_last_used.py
- lists all users and their days since last use across both passwords and access keys. Optionally filters for users not used in the last N days to find old accounts to removeaws_users_pw_last_used.py
- lists all users and dates since their passwords were last used. Optionally filters for users with passwords not used in the last N daysgcp_service_account_credential_keys.py
- lists all GCP service account credential keys for a given project with their age and expiry details, optionally filtering by non-expiring, already expired, or will expire within N daysdocker_registry_show_tags.py
/ dockerhub_show_tags.py
/ quay_show_tags.py
- shows tags for docker repos in a docker registry or on DockerHub or Quay.io - Docker CLI doesn't support this yet but it's a very useful thing to be able to see live on the command line or use in shell scripts (use -q
/--quiet
to return only the tags for easy shell scripting). You can use this to pre-download all tags of a docker image before running tests across versions in a simple bash for loop, eg. docker_pull_all_tags.sh
dockerhub_search.py
- search DockerHub with a configurable number of returned results (older official docker search
was limited to only 25 results), using --verbose
will also show you how many results were returned to the termainal and how many DockerHub has in total (use -q / --quiet
to return only the image names for easy shell scripting). This can be used to download all of my DockerHub images in a simple bash for loop eg. docker_pull_all_images.sh
and can be chained with dockerhub_show_tags.py
to download all tagged versions for all docker images eg. docker_pull_all_images_all_tags.sh
dockerfiles_check_git*.py
- check Git tags & branches align with the containing Dockerfile's ARG *_VERSION
spark_avro_to_parquet.py
- PySpark Avro => Parquet converterspark_parquet_to_avro.py
- PySpark Parquet => Avro converterspark_csv_to_avro.py
- PySpark CSV => Avro converter, supports both inferred and explicit schemasspark_csv_to_parquet.py
- PySpark CSV => Parquet converter, supports both inferred and explicit schemasspark_json_to_avro.py
- PySpark JSON => Avro converterspark_json_to_parquet.py
- PySpark JSON => Parquet converterxml_to_json.py
- XML to JSON converterjson_to_xml.py
- JSON to XML converterjson_to_yaml.py
- JSON to YAML converterjson_docs_to_bulk_multiline.py
- converts json files to bulk multi-record one-line-per-json-document format for pre-processing and loading to big data systems like Hadoop and MongoDB, can recurse directory trees, and mix json-doc-per-file / bulk-multiline-json / directories / standard input, combines all json documents and outputs bulk-one-json-document-per-line to standard output for convenient command line chaining and redirection, optionally continues on error, collects broken records to standard error for logging and later reprocessing for bulk batch jobs, even supports single quoted json while not technically valid json is used by MongoDB and even handles embedded double quotes in 'single quoted json'yaml_to_json.py
- YAML to JSON converter (because some APIs like GitLab CI Validation API require JSON)validate_*.py
further down for all these formats and moreambari_blueprints.py
- Blueprint cluster templating and deployment tool using Ambari API
ambari_blueprints/
directory for a variety of Ambari blueprint templates generated by and deployable using this toolambari_ams_*.sh
- query the Ambari Metrics Collector API for a given metrics, list all metrics or hostsambari_cancel_all_requests.sh
- cancel all ongoing operations using the Ambari APIambari_trigger_service_checks.py
- trigger service checks using the Ambari APIHadoop HDFS:
hdfs_find_replication_factor_1.py
- finds HDFS files with replication factor 1, optionally resetting them to replication factor 3 to avoid missing block alerts during datanode maintenance windowshdfs_time_block_reads.jy
- HDFS per-block read timing debugger with datanode and rack locations for a given file or directory tree. Reports the slowest Hadoop datanodes in descending order at the end. Helps find cluster data layer bottlenecks such as slow datanodes, faulty hardware or misconfigured top-of-rack switch ports.hdfs_files_native_checksums.jy
- fetches native HDFS checksums for quicker file comparisons (about 100x faster than doing hdfs dfs -cat | md5sum)hdfs_files_stats.jy
- fetches HDFS file stats. Useful to generate a list of all files in a directory tree showing block size, replication factor, underfilled blocks and small fileshive_schemas_csv.py
/ impala_schemas_csv.py
- dumps all databases, tables, columns and types out in CSV format to standard outputThe following programs can all optionally filter by database / table name regex:
hive_foreach_table.py
/ impala_foreach_table.py
- execute any query or statement against every Hive / Impala tablehive_tables_row_counts.py
/ impala_tables_row_counts.py
- outputs tables row counts. Useful for reconciliation between cluster migrationshive_tables_column_counts.py
/ impala_tables_column_counts.py
- outputs tables column counts. Useful for finding unusually wide tableshive_tables_row_column_counts.py
/ impala_tables_row_column_counts.py
- outputs tables row and column counts. Useful for finding unusually big tableshive_tables_row_counts_any_nulls.py
/ impala_tables_row_counts_any_nulls.py
- outputs tables row counts where any field is NULL. Useful for reconciliation between cluster migrations or catching data quality problems or subtle ETL bugshive_tables_null_columns.py
/ impala_tables_null_columns.py
- outputs tables columns containing only NULLs. Useful for catching data quality problems or subtle ETL bugshive_tables_null_rows.py
/ impala_tables_null_rows.py
- outputs tables row counts where all fields contain NULLs. Useful for catching data quality problems or subtle ETL bugshive_tables_metadata.py
/ impala_tables_metadata.py
- outputs for each table the matching regex metadata DDL property from describe tablehive_tables_locations.py
/ impala_tables_locations.py
- outputs for each table its data locationhbase_generate_data.py
- inserts random generated data in to a given HBase table, with optional skew support with configurable skew percentage. Useful for testing region splitting, balancing, CI tests etc. Outputs stats for number of rows written, time taken, rows per sec and volume per sec written.hbase_show_table_region_ranges.py
- dumps HBase table region ranges information, useful when pre-splitting tableshbase_table_region_row_distribution.py
- calculates the distribution of rows across regions in an HBase table, giving per region row counts and % of total rows for the table as well as median and quartile row counts per regionshbase_table_row_key_distribution.py
- calculates the distribution of row keys by configurable prefix length in an HBase table, giving per prefix row counts and % of total rows for the table as well as median and quartile row counts per prefixhbase_compact_tables.py
- compacts HBase tables (for off-peak compactions). Defaults to finding and iterating on all tables or takes an optional regex and compacts only matching tables.hbase_flush_tables.py
- flushes HBase tables. Defaults to finding and iterating on all tables or takes an optional regex and flushes only matching tables.hbase_regions_by_*size.py
- queries given RegionServers JMX to lists topN regions by storeFileSize or memStoreSize, ascending or descendinghbase_region_requests.py
- calculates requests per second per region across all given RegionServers or average since RegionServer startup, configurable intervals and count, can filter to any combination of reads / writes / total requests per second. Useful for watching more granular region stats to detect region hotspottinghbase_regionserver_requests.py
- calculates requests per regionserver second across all given regionservers or average since regionserver(s) startup(s), configurable interval and count, can filter to any combination of read, write, total, rpcScan, rpcMutate, rpcMulti, rpcGet, blocked per second. Useful for watching more granular RegionServer stats to detect RegionServer hotspottinghbase_regions_least_used.py
- finds topN biggest/smallest regions across given RegionServers than have received the least requests (requests below a given threshold)opentsdb_import_metric_distribution.py
- calculates metric distribution in bulk import file(s) to find data skew and help avoid HBase region hotspottingopentsdb_list_metrics*.sh
- lists OpenTSDB metric names, tagk or tagv via OpenTSDB API or directly from HBase tables with optionally their created date, sorted ascendingpig-text-to-elasticsearch.pig
- bulk index unstructured files in Hadoop to Elasticsearchpig-text-to-solr.pig
- bulk index unstructured files in Hadoop to Solr / SolrCloud clusterspig_udfs.jy
- Pig Jython UDFs for Hadoopfind_active_server.py
- returns first available healthy server or active master in high availability deployments, useful for chaining with single argument tools. Configurable tests include socket, http, https, ping, url and/or regex content match, multi-threaded for speed. Designed to extend tools that only accept a single --host
option but for which the technology has later added multi-master support or active-standby masters (eg. Hadoop, HBase) or where you want to query cluster wide information available from any online peer (eg. Elasticsearch)
find_active_hadoop_namenode.py
- returns active Hadoop Namenode in HDFS HAfind_active_hadoop_resource_manager.py
- returns active Hadoop Resource Manager in Yarn HAfind_active_hbase_master.py
- returns active HBase Master in HBase HAfind_active_hbase_thrift.py
- returns first available HBase Thrift Server (run multiple of these for load balancing)find_active_hbase_stargate.py
- returns first available HBase Stargate rest server (run multiple of these for load balancing)find_active_apache_drill.py
- returns first available Apache Drill nodefind_active_cassandra.py
- returns first available Apache Cassandra nodefind_active_impala*.py
- returns first available Impala node of either Impalad, Catalog or Statestorefind_active_presto_coordinator.py
- returns first available Presto Coordinatorfind_active_kubernetes_api.py
- returns first available Kubernetes API serverfind_active_oozie.py
- returns first active Oozie serverfind_active_solrcloud.py
- returns first available Solr / SolrCloud nodefind_active_elasticsearch.py
- returns first available Elasticsearch nodetravis_last_log.py
- fetches Travis CI latest running / completed / failed build log for given repo - useful for quickly getting the log of the last failed build when CCMenu or BuildNotify applets turn redtravis_debug_session.py
- launches a Travis CI interactive debug build session via Travis API, tracks session creation and drops user straight in to the SSH shell on the remote Travis build, very convenient one shot debug launcher for Travis CIselenium_hub_browser_test.py
- checks Selenium Grid Hub / Selenoid is working by calling browsers such as Chrome and Firefox to fetch a given URL and content/regex match the resultvalidate_*.py
- validate files, directory trees and/or standard input streams
.avro
, .csv
, json
, parquet
, .ini
/.properties
, .ldif
, .xml
, .yml
/.yaml
)The automated build will use 'sudo' to install required Python PyPI libraries to the system unless running as root or it detects being inside a VirtualEnv. If you want to install some of the common Python libraries using your OS packages instead of installing from PyPI then follow the Manual Build section below.
Enter the pytools directory and run git submodule init and git submodule update to fetch my library repo:
git clone https://github.com/harisekhon/devops-python-tools pytools
cd pytools
git submodule init
git submodule update
sudo pip install -r requirements.txt
Download the DevOps Python Tools and Pylib git repos as zip files:
https://github.com/HariSekhon/devops-python-tools/archive/master.zip
https://github.com/HariSekhon/pylib/archive/master.zip
Unzip both and move Pylib to the pylib
folder under DevOps Python Tools.
unzip devops-python-tools-master.zip
unzip pylib-master.zip
mv -v devops-python-tools-master pytools
mv -v pylib-master pylib
mv -vf pylib pytools/
Proceed to install PyPI modules for whichever programs you want to use using your usual procedure - usually an internal mirror or proxy server to PyPI, or rpms / debs (some libraries are packaged by Linux distributions).
All PyPI modules are listed in the requirements.txt
and pylib/requirements.txt
files.
Internal Mirror example (JFrog Artifactory or similar):
sudo pip install --index https://host.domain.com/api/pypi/repo/simple --trusted host.domain.com -r requirements.txt
Proxy example:
sudo pip install --proxy hari:mypassword@proxy-host:8080 -r requirements.txt
The automated build also works on Mac OS X but you'll need to install Apple XCode (on recent Macs just typing git
is enough to trigger Xcode install).
I also recommend you get HomeBrew to install other useful tools and libraries you may need like OpenSSL for development headers and tools such as wget (these are installed automatically if Homebrew is detected on Mac OS X):
bash-tools/setup/install_homebrew.sh
brew install openssl wget
If failing to build an OpenSSL lib dependency, just prefix the build command like so:
sudo OPENSSL_INCLUDE=/usr/local/opt/openssl/include OPENSSL_LIB=/usr/local/opt/openssl/lib ...
You may get errors trying to install to Python library paths even as root on newer versions of Mac, sometimes this is caused by pip 10 vs pip 9 and downgrading will work around it:
sudo pip install --upgrade pip==9.0.1
make
sudo pip install --upgrade pip
make
The 3 Hadoop utility programs listed below require Jython (as well as Hadoop to be installed and correctly configured)
hdfs_time_block_reads.jy
hdfs_files_native_checksums.jy
hdfs_files_stats.jy
Run like so:
jython -J-cp $(hadoop classpath) hdfs_time_block_reads.jy --help
The -J-cp $(hadoop classpath)
part dynamically inserts the current Hadoop java classpath required to use the Hadoop APIs.
See below for procedure to install Jython if you don't already have it.
This will download and install jython to /opt/jython-2.7.0:
make jython
Jython is a simple download and unpack and can be fetched from http://www.jython.org/downloads.html
Then add the Jython install bin directory to the $PATH or specify the full path to the jython
binary, eg:
/opt/jython-2.7.0/bin/jython hdfs_time_block_reads.jy ...
Strict validations include host/domain/FQDNs using TLDs which are populated from the official IANA list is done via my PyLib library submodule - see there for details on configuring this to permit custom TLDs like .local
, .intranet
, .vm
, .cloud
etc. (all already included in there because they're common across companies internal environments).
If you end up with an error like:
./dockerhub_show_tags.py centos ubuntu
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:765)
It can be caused by an issue with the underlying Python + libraries due to changes in OpenSSL and certificates. One quick fix is to do the following:
sudo pip uninstall -y certifi &&
sudo pip install certifi==2015.04.28
Run make update
. This will git pull and then git submodule update which is necessary to pick up corresponding library updates.
If you update often and want to just quickly git pull + submodule update but skip rebuilding all those dependencies each time then run make update-no-recompile
(will miss new library dependencies - do full make update
if you encounter issues).
Continuous Integration is run on this repo with tests for success and failure scenarios:
To trigger all tests run:
make test
which will start with the underlying libraries, then move on to top level integration tests and functional tests using docker containers if docker is available.
Patches, improvements and even general feedback are welcome in the form of GitHub pull requests and issue tickets.
DevOps Bash Tools - 550+ DevOps Bash Scripts, Advanced .bashrc
, .vimrc
, .screenrc
, .tmux.conf
, .gitconfig
, CI configs & Utility Code Library - AWS, GCP, Kubernetes, Docker, Kafka, Hadoop, SQL, BigQuery, Hive, Impala, PostgreSQL, MySQL, LDAP, DockerHub, Jenkins, Spotify API & MP3 tools, Git tricks, GitHub API, GitLab API, BitBucket API, Code & build linting, package management for Linux / Mac / Python / Perl / Ruby / NodeJS / Golang, and lots more random goodies
SQL Scripts - 100+ SQL Scripts - PostgreSQL, MySQL, AWS Athena, Google BigQuery
Templates - dozens of Code & Config templates - AWS, GCP, Docker, Jenkins, Terraform, Vagrant, Puppet, Python, Bash, Go, Perl, Java, Scala, Groovy, Maven, SBT, Gradle, Make, GitHub Actions Workflows, CircleCI, Jenkinsfile, Makefile, Dockerfile, docker-compose.yml, M4 etc.
Kubernetes configs - Kubernetes YAML configs - Best Practices, Tips & Tricks are baked right into the templates for future deployments
The Advanced Nagios Plugins Collection - 450+ programs for Nagios monitoring your Hadoop & NoSQL clusters. Covers every Hadoop vendor's management API and every major NoSQL technology (HBase, Cassandra, MongoDB, Elasticsearch, Solr, Riak, Redis etc.) as well as message queues (Kafka, RabbitMQ), continuous integration (Jenkins, Travis CI) and traditional infrastructure (SSL, Whois, DNS, Linux)
DevOps Perl Tools - 25+ DevOps CLI tools for Hadoop, HDFS, Hive, Solr/SolrCloud CLI, Log Anonymizer, Nginx stats & HTTP(S) URL watchers for load balanced web farms, Dockerfiles & SQL ReCaser (MySQL, PostgreSQL, AWS Redshift, Snowflake, Apache Drill, Hive, Impala, Cassandra CQL, Microsoft SQL Server, Oracle, Couchbase N1QL, Dockerfiles, Pig Latin, Neo4j, InfluxDB), Ambari FreeIPA Kerberos, Datameer, Linux...
HAProxy Configs - 80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, Cloudera, Hortonworks, MapR, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, ZooKeeper, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, SSH, RabbitMQ, Redis, Riak, Rancher etc.
Dockerfiles - 50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Mesos, Consul, Riak, OpenTSDB, Jython, Advanced Nagios Plugins & DevOps Tools repos on Alpine, CentOS, Debian, Fedora, Ubuntu, Superset, H2O, Serf, Alluxio / Tachyon, FakeS3
PyLib - Python library leveraged throughout the programs in this repo as a submodule
Perl Lib - Perl version of above library
You might also be interested in the following really nice Jupyter notebook for HDFS space analysis created by another Hortonworks guy Jonas Straub:
webssh参考 安装 pip install paramiko==2.4.1 vue webssh 参考 vue全家桶指引 vue cli 官方文档 orm模型 安装 element报错 下载镜像 docker pull vuejs/ci 启动镜像 docker run -itd -p 8889:8080 -v /root/src:/root/src vuejs/ci 配置npm源 npm c
Installing ceph.x86_64 1:9.2.0-0.el7 and its dependencies failed on centos 7. File /home/jenkins-build/build/workspace/ceph-build-next/ARCH/x86_64/DIST/centos7/venv/bin/python is mssing [root@control0
官方网站:http://www.sonarqube.org/ 1、准备 #1:安装数据库:使用 5.6 版本,不支持 5.5 的版本 #2:Mysql 数据库创建及授权: #yum install vim gcc gcc-c++ wget autoconf net-tools lrzsz iotop lsof iotop bash-completion curl policycoreutils
devops系统部署 部署环境 主机名称 IP地址 部署软件 集成服务器 192.168.126.129 gitlab、jenkins、harbor、Docker18.06.1-ce、mvn、node.js k8s-master 192.168.126.130 kube-apiserver、kube-controller-manager、kube-scheduler、Docker18.06.1-c
我在ubuntu18.04上使用ansible2.8在多个playbooks上看到了上述消息。为了简单起见,我在单节点Drupal服务器上使用了这个基本的playbook来复制它。https://github.com/geerlingguy/ansible-for-devops/tree/master/drupal;这个剧本在早期版本的ubuntu上运行得很好,但是在18.04版本上就不行了,我知
我有一个用秘密定义的Azure密钥库,我可以使用“Azure密钥库”任务在devops构建管道中访问它。现在我需要将秘密变量传递给python内联脚本。 由于它是加密的,python无法直接读取该值。我如何解密并传递它们。这个秘密持有Databricks的访问令牌。我们正在尝试使用DevOps管道创建Databricks集群。 我的Yaml有以下任务 Azure KeyVault Powershe
概述 DevOps 是 开发 和 运维 这两个词的缩写。DevOps 是一套最佳实践方法论,旨在应用和服务的生命周期中促进 IT 专业人员(开发人员、运维人员和支持人员)之间的协作和交流,最终实现: 持续整合 - 从开发到运维和支持的轻松切换 持续部署 - 持续发布,或尽可能经常的发布 持续反馈 - 在应用和服务生命周期的各个阶段寻求来自利益相关方的反馈 DevOps 改变了员工的工作思维方式;D
Hari Sekhon - DevOps Bash Tools git.io/bash-tools 550+ DevOps Shell Scripts and Advanced Bash environment. Fast, Advanced Systems Engineering, Automation, APIs, shorter CLIs, etc. Heavily used in many
上节课和大家介绍了Gitlab CI 结合 Kubernetes 进行 CI/CD 的完整过程。这节课结合前面所学的知识点给大家介绍一个完整的示例:使用 Jenkins + Gitlab + Harbor + Helm + Kubernetes 来实现一个完整的 CI/CD 流水线作业。 其实前面的课程中我们就已经学习了 Jenkins Pipeline 与 Kubernetes 的完美结合,我们
翻译:Ranger Tsao 简介 Docker 是一个可以将应用部署在其中的轻量级、隔离的容器。应用程序并行运行在隔离的 Linux 容器中。如果从未使用过 Docker ,可以根据官方教程,轻松入门 Vert.x 提供两个 Docker 镜像给开发人员运行部署程序,分别是: vertx/vertx3 基础镜像,需要进行一些扩展才可以运行程序 vertx/vertx3-exec 给宿主系统提供
ks-devops 是基于 Kubernetes 的 DevOps 平台。 特性 开箱即用的 CI/CD 管道 用于使用 Kubernetes 进行 DevOps 的内置自动化工具包 使用 Jenkins Pipelines 在 Kubernetes 之上实现 DevOps 通过 CLI 管理管道