my collection of bioinformatics one liners that is useful in my day-to-day work
I also added some of my own tricks
05/21/2015.
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
cat test.fq | awk 'NR%4 == 2 {$0="xxx"$0}{print}'
@D00365:1187:HMM2FBCX2:1:1103:1258:2132 1:N:0:CGTGCAGA
xxxTATTACCAGATGAGAGCATGGTTAGG
+
DDDDDIIIIIIIHIIIIIIIIIIIII
@D00365:1187:HMM2FBCX2:1:1103:1472:2136 1:N:0:CGTGCAGA
xxxAACCATGAGTGTCCCGCTGGCATCGC
+
DDDADGHHIIHIIGIHHHFCHHIIII
@D00365:1187:HMM2FBCX2:1:1103:1822:2139 1:N:0:CGTGCAGA
xxxGTGCATATCATGTAGCGTATTATACT
+
DDDDDIIIIIIIIIIIIIIIIIIIII
@D00365:1187:HMM2FBCX2:1:1103:1943:2145 1:N:0:CGTGCAGA
xxxGATTCAGTCTCCAACCTCTCCTTTGT
+
DDDDDHIIIIIIIIIIHIIIIHIIII
@D00365:1187:HMM2FBCX2:1:1103:1917:2147 1:N:0:CGTGCAGA
xxxCCTTCGACAAGTTGTCAGGTGCGGTC
+
DDDDDHIIIIIIIIIIIIIIGIIHHH
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}
awk '/^>/{s=++d".fa"} {print > s}' multi.fa
cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz
cat file.fq | echo $((`wc -l`/4))
awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
fold -w 60 file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq
cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t'
'\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq
samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"
samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'
awk '{print >> $1; close($1)}' input_file
cat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print > f".bed"}'
#or
awk '{print $0 >> $1".bed"}' example.bed
cat my.vcf | awk '$0~"^#" { print $0; next } { print $0 | "sort -k1,1V -k2,2n" }'
for file in *gz
do zcat $file > ${file/bed.gz/bed}
cat my_file | sed -n 'l'
cat -A
~.
rsync -av from_dir to_dir
## copy every file inside the frm_dir to to_dir
rsync -av from_dir/ to_dir
##re-copy the files avoiding completed ones:
rsync -avhP /from/dir /to/dir
mkdir $(date +%F)
du -h --max-depth=1
du -ch
du -sh .
df -h
csvcut -n
top -M
free -mg
awk '!a[$1,$2]++' input_file
sort -u -k1,2 file
It will sort based on unique first and second column
less -S
fold -w 60
cat file.txt | column -t | less -S
-t $'\t'
awk ' NR ==1 || ($10 > 1 && $11 > 0 && $18 > 0.001)' input_file
sed /^$/d
sed $d
awk to join files based on several columns
my github repo
### select lines from a file based on columns in another file
## http://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches
awk -F"\t" 'NR==FNR{a[$1$2$3]++;next};a[$1$2$3] > 0' file2 file1
Finally learned about the !$ in unix: take the last thing (word) from the previous command.echo hello, world; echo !$
gives 'world'
Create a script of the last executed command:echo "!!" > foo.sh
Reuse all parameter of the previous command line:!*
find bam in current folder (search recursively) and copy it to a new directory using 5 CPUsfind . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir
ls -X
will group files by extension.
loop through all the chromosomes
for i in {1..22} X Y
do
echo $i
done
for i in in {01..22}
will expand to 01 02 ...
change every other newline to tab:
paste
is used to concatenate corresponding lines from files: paste file1 file2 file3 .... If one of the "file" arguments is "-", then lines are read from standard input. If there are 2 "-" arguments, then paste takes 2 lines from stdin. And so on.
cat test.txt
0 ATTTTATTNGAAATAGTAGTGGG
0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG
1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG
2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT
3 ACCTCAACCTCNTAAATAACTAA
cat test.txt| paste - -
0 ATTTTATTNGAAATAGTAGTGGG 0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG 1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG 2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT 3 ACCTCAACCTCNTAAATAACTAA
ORS: output record seperator in awk
var=condition?condition_if_true:condition_if_false is the ternary operator.
cat test.txt| awk 'ORS=NR%2?"\t":"\n"'
0 ATTTTATTNGAAATAGTAGTGGG 0 CTCCCAAAATACTAAAATTATAA
1 TTTTAGTTATTTANGAGGTTGAG 1 CNTAATCTTAACTCACTACAACC
2 TTATAATTTTAGTATTTTGGGAG 2 CATATTAACCAAACTAATCTTAA
3 GGTTAATATGGTGAAATTTAAT 3 ACCTCAACCTCNTAAATAACTAA
We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:
cat data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
blah_C2 ACTTTATATATT<=14
blah_C3 ACTTATATATATATA>14
blah_C4 ACTTATATATATATA>14
blah_C5 ACTTTATATATT<=14
awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
blah_C1 blah_C1
blah_C2 blah_C2
blah_C4 blah_C4
blah_C5 blah_C5
You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:
awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}' data/test.tsv
BEGIN:blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:blah_C2 ACTTTATATATT
BEGIN:blah_C3 ACTTATATATATATA
BEGIN:blah_C4 ACTTATATATATATA
BEGIN:blah_C5 ACTTTATATATT
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
see post
linearize.awk:
/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}
paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa )| tr "\t" "\n"
grep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/^--$/d' > out.fq
# or
zcat reads.fq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "AAGTTGATAACGGACTAGCCTTATTTT" {print $1, $2, $3, $4}' \
| gzip > filtered.fq.gz
cat file.tsv | head -1 | tr "\t" "\n" | wc -l
csvcut -n -t file.tsv (from csvkit)
awk '{print NF; exit}' file.tsv
awk -F "\t" 'NR == 1 {print NF}' file.tsv
cat myfasta.txt
>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
cat my_info.txt
info1
info2
info3
info4
paste <(cat my_info.txt) <(cat myfasta.txt| paste - - | cut -c2-) | awk '{printf(">%s_%s\n%s\n",$1,$2,$3);}'
>info1_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
cat file.tsv | head -1 | tr "\t" "\n" | wc -l
##(from csvkit)
csvcut -n -t file.
## emulate csvcut -n -t
less files.tsv | head -1| tr "\t" "\n" | nl
awk -F "\t" 'NR == 1 {print NF}' file.tsv
awk '{print NF; exit}'
see https://www.biostars.org/p/53212/
The fasta header is like >7 dna:chromosome chromosome:GRCh37:7:1:159138663:1
convert to >7
:
cat Homo_sapiens_assembly19.fasta | gawk '/^>/ { b=gensub(" dna:.+", "", "g", $0); print b; next} {print}' > Homo_sapiens_assembly19_reheader.fasta
mkdir blah && cd $_
http://crazyhottommy.blogspot.com/2016/10/cutting-out-500-columns-from-26g-file.html
#! /bin/bash
set -e
set -u
set -o pipefail
#### Author: Ming Tang (Tommy)
#### Date 09/29/2016
#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file
# show help
show_help(){
cat << EOF
This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another
file. The column names must be one per line. The output will be stdout. For small files < 2G, one
can load it into R and do it easily, but when the file is big > 10G. R is quite cubersome.
Using unix commands on the other hand is better because files do not have to be loaded into memory at once.
e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB.
usage: ${0##*/} -f < a dataframe > -c < colNames> -d <delimiter of the file>
-h display this help and exit.
-f the file you want to extract columns from. must contain a header with column names.
-c a file with the one column name per line.
-d delimiter of the dataframe: , or \t. default is tab.
e.g.
for tsv file:
${0##*/} -f mydata.tsv -c colnames.txt -d $'\t' or simply ommit the -d, default is tab.
for csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out.
${0##*/} -f mydata.csv -c colnames.txt -d ,
EOF
}
## if there are no arguments provided, show help
if [[ $# == 0 ]]; then show_help; exit 1; fi
while getopts ":hf:c:d:" opt; do
case "$opt" in
h) show_help;exit 0;;
f) File2extract=$OPTARG;;
c) colNames=$OPTARG;;
d) delim=$OPTARG;;
'?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;;
esac
done
## set up the default delimiter to be tab, Note the way I specify tab
delim=${delim:-$'\t'}
## get the number of columns in the data frame that match the column names in the colNames file.
## change the output to 2,5,6,22,... and get rid of the last comma so cut -f can be used
cols=$(head -1 "${File2extract}" | tr "${delim}" "\n" | grep -nf "${colNames}" | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')
## cut out the columns
cut -d"${delim}" -f"${cols}" "${File2extract}"
or use csvtk from Shen Wei:
csvtk cut -t -f $(paste -s -d , list.txt) data.tsv
awk '{print $0 "\t" FILENAME}' *bed
# add chr
sed 's/^/chr/' my.bed
# or
awk 'BEGIN {OFS = "\t"} {$1="chr"$1; print}'
# remove chr
sed 's/^chr//' my.bed
awk '{print NF}' test.tsv | sort -nu | head -n 1
https://www.biostars.org/p/134331/
BAM="yourFile.bam"
REF="reference.fasta"
samtools view -H $BAM | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf $REF -r \"{}\" $BAM | bcftools call -cv > \"{}\".vcf"
This is better than tr "\n" "\t"
because somtimes I do not want to convert the last newline to tab.
cat myfile.txt | paste -s
I usually do it in R, but like the quick solution.
awk 'FNR==1 && NR!=1{next;}{print}' *.csv
# or
awk '
FNR==1 && NR!=1 { while (/^<header>/) getline; }
1 {print}
' file*.txt >all.txt
cut -f1-4 F5.hg38.enhancers.expression.usage.matrix | head
CNhs11844 CNhs11251 CNhs11282 CNhs10746
chr10:100006233-100006603 1 0 0
chr10:100008181-100008444 0 0 0
chr10:100014348-100014634 0 0 0
chr10:100020065-100020562 0 0 0
chr10:100043485-100043744 0 0 0
chr10:100114218-100114567 0 0 0
chr10:100148595-100148922 0 0 0
chr10:100182422-100182522 0 0 0
chr10:100184498-100184704 0 0 0
sed '1 s/^/enhancer\t/' F5.hg38.enhancers.expression.usage.matrix | cut -f1-4 | head
enhancer CNhs11844 CNhs11251 CNhs11282
chr10:100006233-100006603 1 0 0
chr10:100008181-100008444 0 0 0
chr10:100014348-100014634 0 0 0
chr10:100020065-100020562 0 0 0
chr10:100043485-100043744 0 0 0
chr10:100114218-100114567 0 0 0
chr10:100148595-100148922 0 0 0
chr10:100182422-100182522 0 0 0
chr10:100184498-100184704 0 0 0
cat my.vcf | awk -F '\t' '{if($0 ~ /\#/) print; else if($7 == "PASS") print}' > my_PASS.vcf
## column5
awk '{gsub(pattern,replace,$5)}1' in.file
## http://bioinf.shenwei.me/csvtk/usage/#replace
csvtk replace -f 5 -p pattern -r replacement
https://www.linkedin.com/pulse/move-running-process-screen-bruce-werdschinski/
1. Suspend: Ctrl+z
2. Resume: bg
3. Disown: disown %1
4. Launch screen
5. Find pid: prep BLAH
6. Reparent process: reptyr ###
# input
blabla_1 A,B,C,C
blabla_2 A,E,G
blabla_3 R,Q,A,B,C,R,Q
# output
blabla_1 3
blabla_2 3
blabla_3 5
awk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F[i]]++) n--; print $1, n}' file
https://twitter.com/David_McGaughey/status/1106371758142173185
Create TSS bed from GTF in one line:
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' > tss.bed
or 5kb flanking tss
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | grep protein_coding | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+5000} else {print $1, $5-5000, $5}}' > promoters.bed
caveat: some genes are at the end of the chromosomes, add or minus 5000 may go beyond the point, use bedtools slop
with a genome size file to avoid that.
download fetchChromSizes
from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
fetchChromSizes hg19 > chrom_size.txt
zcat gencode.v29lift37.annotation.gtf.gz | awk '$3=="gene" {print $0}' | awk -v OFS="\t" '{if ($7=="+") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' | bedtools slop -i - -g chrom_size.txt -b 5000 > promoter_5kb.bed
reverse column 3 and put it to column5
awk -v OFS="\t" '{"echo "$3 "| rev" | getline $5}{print $0}'
#or use perl reverse second column
perl -lane 'BEGIN{$,="\t"}{$rev=reverse $F[2];print $F[0],$F[1],$rev,$F[3]}
realpath file.txt
readlink -f file.txt
https://github.com/Piezoid/pugz
Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression.
#! /bin/bash
set -euo pipefail
module load singularity
# Need a unique /tmp for this job for /tmp/rstudio-rsession & /tmp/rstudio-server
WORKDIR=/liulab/${USER}/singularity_images
mkdir -m 700 -p ${WORKDIR}/tmp2
mkdir -m 700 -p ${WORKDIR}/tmp
PASSWORD='xyz' singularity exec --bind "${WORKDIR}/tmp2:/var/run/rstudio-server" --bind "${WORKDIR}/tmp:/tmp" --bind="/liulab/${USER}" geospatial_4.0.2.simg rserver --www-port 8888 --auth-none=0 --auth-pam-helper-path=pam-helper --www-address=127.0.0.1
Add the following on the top of your ~/.ssh/config
to prevent drop off the ssh session
Host *
ServerAliveInterval 60
I use screen
/tmux
and also mosh as well.
one-to-one关联类似于many-to-one关联,不同之处在于该列将被设置为唯一。 例如,地址对象可以与单个员工对象相关联。 定义RDBMS表 (Define RDBMS Tables) 考虑一种情况,我们需要将员工记录存储在EMPLOYEE表中,该表具有以下结构 - create table EMPLOYEE ( id INT NOT NULL auto_increment,
One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。 例如: 自然状态码为:000,001,010,011,100,101 独热编码为:000001,000010,000100,001000,010000,100000 可以这样理解,对于每一个特征,如果它有m个可能值,那么经过独热编码后,就变成了m
One OS 简介 One OS是一个专注于可靠性和简易可用的小型实时系统。它通过采取形式化方法来确保系统的可靠性。所有的实时操作系统必备的功能它都具备,但是并不在此基础上提供更多可选组件以确保内核的精炼性。这样,得到的内核就是一个最小化的内核,可以很方便地对它进行形式化验证。同时,它还可以作为客户操作系统运行在虚拟机监视器上。 本系统比一个全功能系统的相比要小得多,而且理解起来应该也相对容易得多
One - 一个极简的基于swoole常驻内存框架。 hello world 安装 composer create-project lizhichao/one-app appcd appphp App/swoole.php 测试 curl http://127.0.0.1:8081/ 主要功能 RESTful路由 中间件 websocket/tcp/http……任意协议路由 ORM模型 统一的se
One-JAR 可以解决如何把一个依赖于多个其它 jar 文件的应用程序发布成一个单一的可执行 Jar 文件。它使用一个可定制的类装载器 (classloader) 来打开在主 Jar 中的 Jar 文件包。
用于简化分布式开发:统一的错误代码、可区分版本的接口配置(保留老版本接口)、文件服务可配置化切换、前后端分离、token(时效性、一次性、顺序性)、配置管理、服务发现、断路器、智能路由、微代理、控制总线、全局锁、决策竞选、分布式会话和集群状态。 提供有多个组合即用的模块:用户中心、权限管理、配置管理、流程管理。。。 提供的快速开发的支持:代码生成、lombok集成、前台端分