MITIE是在dlib机器学习库之上开发的NLP工具包,支持分布式词嵌入和结构化SVM。提供英语,西班牙语,德语的预训练语言模型。MITIT核心代码使用C++编写,支持Python,R,Java,C,MATLAB的集成。
#clone 源码
git clone https://github.com/mit-nlp/MITIE.git
cd MITIE
python setup.py install
you can use mitie package in your python code.
from mitie import *
cd MITIE/tools/wordrep
mkdir build
cd build
cmake ..
make
-MITIE/tools/wordrep/build
- CMakeCache.txt
- CMakeFiles
- cmake_install.cmake
- dlib_build
- Makefile
- mitie_build
- wordrep
cd ~
mkdir temp
cd temp
mv MITIE/tools/wordrep/build/wordrep .
mkdir zh
cp ~/text.txt ./zh
wordrep -e ./zh
上面的操作中,首先新建一个目录temp作为这次训练的工作目录,然后将wordrep拷贝到temp目录下。
然后新建目录zh用来存放训练的语料,语料的格式如下所示(一行一句,分词之后以空格隔开):
你 可 不要 小看 我
你 的 身高 是 多少
我 只有 二十厘米 高 , 现在 还 在 买 半价票 呢
你 有 多 高
我 个子 不高 呢 , 只有 二十厘米 , 出门 要 把 我 抱 在 怀里
训练完毕之后的目录结构为:
/temp
- substring_set.dat
- top_word_counts.dat
- total_word_feature_extractor.dat
- wordrep
- zh/
- substrings.txt
- top_words.txt
- word_morph_feature_extractor.dat
- word_vects.dat
from mitie import *
print ("loading Total Word Feature Extractor...")
twfe = total_word_feature_extractor('zh/total_word_feature_extractor.dat')
# Get fingerprint of feature dictionary
print ("Fingerprint of feature dictionary", twfe.fingerprint)
print ()
# Get number of dimensions of feature vectors
print ("Number of dimensions of feature vectors", twfe.num_dimensions)
print ()
# Get number of words in the dictionary
print ("Number of words in the dictionary", twfe.num_words_in_dictionary)
print ()
# Get list of words in the dictionary
words=twfe.get_words_in_dictionary()
print ("First 10 words in dictionary", words[0:200])
print ()
# Get features for one word
feats = twfe.get_feature_vector("我")
print ("First 5 features of word 'home'", feats[0:])
# The total word feature extractor will generate feature vectors for words not
# in its dictionary as well. It does this by looking at word morphology.
feats = twfe.get_feature_vector("_word_not_in_dictionary_")
print ("First 5 features of word '_word_not_in_dictionary_'", feats[0:50])
输出为:
loading Total Word Feature Extractor...
Fingerprint of feature dictionary 1810187658478185215
Number of dimensions of feature vectors 271
Number of words in the dictionary 200000
First 10 words in dictionary [b'\x1f', b'!', b'"', b'#', b'##', b'###', b'####', b'#####', b'######', b'#######', b'########', b'#########', b'##########', b'###########', b'############', b'#############', b'##############', b'###############', b'################', b'#################', b'##################', b'###################', b'############.#', b'##########.##', b'#########.##', b'#########X', b'########.#', b'########.##', b'#######.#', b'#######.##', b'#######A', b'######.#', b'######.##', b'######W', b'#####.#', b'#####.##', b'#####.###', b'#####A', b'#####F##', b'#####G', b'#####Hz', b'#####K', b'#####KG', b'#####M#', b'#####kg', b'#####km', b'#####m', b'#####m#', b'#####mm', b'#####rpm', b'####.#', b'####.##', b'####.###', b'####.####', b'####A', b'####A###', b'####AA######', b'####AA##Z###', b'####ABG', b'####AGN', b'####B', b'####BASE', b'####Base', b'####C', b'####CB######', b'####CCTV', b'####CPU', b'####D', b'####DN', b'####E', b'####F', b'####G', b'####GB', b'####GE', b'####GS', b'####GSO', b'####GT', b'####GTS', b'####GTX', b'####H', b'####Hz', b'####I', b'####J', b'####K', b'####KB', b'####KBS', b'####KG', b'####KVA', b'####KW', b'####Kg', b'####L', b'####LE', b'####M', b'####M#', b'####MB', b'####MBC', b'####MHz', b'####MM', b'####MPa', b'####MW', b'####Mbps', b'####Mhz', b'####N', b'####P', b'####R', b'####RPM', b'####S', b'####SBS', b'####SE', b'####T', b'####U', b'####V', b'####W', b'####WW', b'####X', b'####XM', b'####XT', b'####a', b'####b', b'####bps', b'####c', b'####cc', b'####dpi', b'####g', b'####h', b'####i', b'####k', b'####kW', b'####kg', b'####km', b'####kw', b'####m', b'####m#', b'####mA', b'####mAH', b'####mAh', b'####mL', b'####mg', b'####ml', b'####mm', b'####n', b'####nm', b'####p', b'####ppm', b'####px', b'####r', b'####rpm', b'####s', b'####t', b'####w', b'####x###', b'####x####', b'####x####dpi', b'####x###CPU', b'###.#', b'###.##', b'###.###', b'###.####', b'###.#####', b'###.######', b'###.############', b'###.#############', b'###A', b'###A#', b'###B', b'###BASE', b'###Base', b'###C', b'###CC', b'###CM', b'###CPU', b'###D', b'###DPI', b'###E', b'###ER', b'###F', b'###FX', b'###G', b'###GB', b'###GC', b'###GM', b'###Gbps', b'###H', b'###HB', b'###HP', b'###HZ', b'###Hz', b'###I', b'###IDE', b'###IU', b'###J', b'###K', b'###KB', b'###KG', b'###KHz', b'###KM', b'###KN', b'###KV', b'###KVA', b'###KW']
First 5 features of word 'home' [0.0, 5.97073221206665, 0.9316330552101135, 2.238185405731201, 4.006947040557861, -1.1626752614974976, -0.7516040205955505, 1.227917194366455, -1.4145559072494507, -0.2640715539455414, 0.6591814160346985, 0.8076657056808472, -0.9249187707901001, -1.1740055084228516, -1.4045637845993042, 0.09082894027233124, 1.1813067197799683, -3.0874106884002686, 0.3911340534687042, 0.5611420273780823, 0.5512977242469788, 0.35278433561325073, 1.3899575471878052, -0.717283308506012, 0.6040155291557312, 0.18827565014362335, 0.034780170768499374, 1.6348782777786255, -0.29165568947792053, -0.5576142072677612, -0.0736689567565918, -0.10574998706579208, 0.134212926030159, -0.20430435240268707, 0.021542036905884743, -1.15532648563385, -0.7548226118087769, -0.9946640729904175, -1.7020361423492432, 0.5003638863563538, 0.44695088267326355, -0.4260140359401703, 0.5145021677017212, 0.2267107367515564, -0.7658609747886658, 0.6321385502815247, 0.21326670050621033, -0.17607718706130981, 0.028666986152529716, -0.5636664032936096, -0.06806807965040207, -0.13893628120422363, -0.08074233680963516, -0.14667809009552002, -0.33493536710739136, 0.14767462015151978, -0.859512448310852, -1.0438836812973022, 0.45696118474006653, 0.35149556398391724, -0.3600770831108093, 0.0042051272466778755, -0.29487869143486023, -0.38836199045181274, -0.6497148275375366, 1.0274734497070312, -0.36141809821128845, 0.5086874961853027, 0.6933434009552002, -0.03857343643903732, 0.20878709852695465, -0.5905593037605286, -0.24104253947734833, -0.34866392612457275, -0.02311673015356064, 0.3223198354244232, 0.013303913176059723, -0.001131782541051507, -0.3987281918525696, -0.067450612783432, 0.05423640459775925, -0.2831002175807953, -0.17328380048274994, -0.41849902272224426, -0.13344939053058624, -0.00761751551181078, 0.038340575993061066, -0.13267484307289124, -0.00775308720767498, 0.1571117639541626, -0.10469522327184677, 4.675013065338135, 0.2411571741104126, 1.4947019815444946, 0.5157005786895752, 0.2781795263290405, -0.05498442053794861, 1.0271788835525513, -0.8578872084617615, -1.7238513231277466, -0.31369906663894653, 1.062556505203247, -0.9695983529090881, -0.012757161632180214, 0.05192200094461441, -0.04511411488056183, 0.45501765608787537, -1.540871024131775, 0.7663669586181641, -0.08107294142246246, 0.8695929646492004, -0.47152483463287354, 0.12559320032596588, -0.31935498118400574, 0.6184312105178833, -0.3377828598022461, -0.05853196978569031, 0.49122726917266846, -1.096035122871399, -0.9589917063713074, -0.016103968024253845, -0.2760816216468811, 0.041003793478012085, -0.28233757615089417, -0.12200267612934113, 0.6446003913879395, 0.3652326762676239, -0.6992666721343994, -0.3834816813468933, 0.06997451186180115, -0.23047509789466858, 0.09853935241699219, 0.42411935329437256, 0.6126387715339661, 0.6209908723831177, 0.1988573521375656, 0.7293605804443359, -0.8312186002731323, 0.2242833375930786, 0.07237271964550018, 0.8411544561386108, 0.7087806463241577, 0.17473521828651428, -0.37091606855392456, -0.39927002787590027, 0.19043025374412537, -0.32584795355796814, -1.7074172496795654, 0.35049715638160706, -0.5427274107933044, 0.3344540297985077, 0.8149847388267517, -0.1877124309539795, -0.6143186092376709, -0.33549371361732483, -0.010993757285177708, -0.13488821685314178, 0.12333866208791733, -0.3305477797985077, -0.8460304141044617, 0.9234035611152649, 0.14197398722171783, 0.17400005459785461, -0.5954909324645996, 0.3150802254676819, 0.2863784730434418, 0.05985362455248833, 0.28527897596359253, 0.13381527364253998, -0.44788849353790283, 0.03262010216712952, 1.1596838235855103, -0.8093892335891724, -0.33267325162887573, -0.1744241863489151, -0.005359921604394913, -0.12817354500293732, 0.1359061896800995, -0.08499933034181595, 0.27943363785743713, -0.006659158039838076, 1.3432492017745972, -0.23027761280536652, -0.348193496465683, 1.4949716329574585, -0.6175737380981445, 0.6155247092247009, 0.10847197473049164, 0.28908929228782654, 0.2870860993862152, 0.07618990540504456, 0.8690569400787354, -0.17588752508163452, 0.9899291396141052, 0.3552875220775604, -0.5823841691017151, 0.33167511224746704, 0.9118424654006958, 0.1751895248889923, 0.3398944139480591, -0.7944103479385376, -0.05951415002346039, 0.4463740289211273, 0.7732852697372437, -0.7164537310600281, -0.8260955214500427, -0.8753208518028259, 0.5310025811195374, -0.2737683653831482, -0.5334843993186951, 0.2802194654941559, 0.3884824514389038, 0.13959674537181854, -0.23291979730129242, -0.21691960096359253, 0.3409668207168579, -0.780008852481842, -0.23220616579055786, 1.1251206398010254, -0.3358197808265686, -0.7508742213249207, -0.6566134095191956, 0.3255535066127777, 0.7322554588317871, -0.20198243856430054, -0.6569051146507263, 0.23611800372600555, -0.1041901707649231, 0.22667613625526428, 1.0130470991134644, 0.8066354393959045, -0.8881375789642334, 1.4128897190093994, -1.049863338470459, -0.44621843099594116, 1.2763875722885132, -1.6735546588897705, -0.9210537075996399, 0.23736755549907684, 0.07718025147914886, -1.6013381481170654, -0.6766175627708435, -0.570293128490448, -0.21253947913646698, -0.8021684885025024, -0.39882969856262207, -0.5555001497268677, -0.9446086883544922, 0.750538170337677, -0.2971731722354889, -1.1391425132751465, -0.4305364489555359, 0.4248833954334259, 1.117802381515503, 0.2961348593235016, 0.8358013033866882, 1.8770716190338135, 0.09662353992462158, -0.43155574798583984, 0.5047833919525146, -0.2876260578632355, 0.20243491232395172, -0.5778147578239441, 0.038383230566978455, -0.5310025215148926, -0.3335713744163513, 0.5339454412460327, 0.729848325252533, 1.0905710458755493, 0.49858012795448303, 0.5788264870643616]
First 5 features of word '_word_not_in_dictionary_' [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4.参考文献
用Rasa_NLU构建自己的中文NLU系统