compromise

modest natural-language processing
授权协议 MIT License
开发语言 Python
所属分类 神经网络/人工智能、 自然语言处理
软件类型 开源软件
地区 不详
投 递 者 秦锐
操作系统 跨平台
开源组织
适用人群 未知
 软件概览
compromise
modest natural language processing
npm install compromise

isn't it weird how we can write text, but not parse it?
    ᔐᖜ↬- and how we can't get the information back out?⇬

it's like we've agreed that
text is a dead-end.
and the knowledge in it
should not really be used.

compromise tries its best to parse text.
it is small, quick, and often good-enough.
it is not as smart as you'd think.

.match():

interpret and match text:

let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"
if (doc.has('simon says #Verb') === false) {
  return null
}

.verbs():

conjugate and negate verbs in any tense:

let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'

.nouns():

play between plural, singular and possessive forms:

let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'

.numbers():

interpret plain-text numbers

nlp.extend(require('compromise-numbers'))

let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(2)
doc.text()
// 'ninety five thousand and fifty four'

.topics():

names/places/orgs, tldr:

let doc = nlp(buddyHolly)
doc.people().if('mary').json()
// [{text:'Mary Tyler Moore'}]

let doc = nlp(freshPrince)
doc.places().first().text()
// 'West Phillidelphia'

doc = nlp('the opera about richard nixon visiting china')
doc.topics().json()
// [
//   { text: 'richard nixon' },
//   { text: 'china' }
// ]

.contractions():

handle implicit terms:

let doc = nlp("we're not gonna take it, no we ain't gonna take it.")

// match an implicit term
doc.has('going') // true

// transform
doc.contractions().expand()
dox.text()
// 'we are not going to take it, no we are not going to take it.'

Use it on the client-side:

<script src="https://unpkg.com/compromise"></script>
<script src="https://unpkg.com/compromise-numbers"></script>
<script>
  nlp.extend(compromiseNumbers)

  var doc = nlp('two bottles of beer')
  doc.numbers().minus(1)
  document.body.innerHTML = doc.text()
  // 'one bottle of beer'
</script>

as an es-module:

import nlp from 'compromise'

var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'

compromise is 180kb (minified):

it's pretty fast. It can run on keypress:

it works mainly by conjugating all forms of a basic word list.

The final lexicon is ~14,000 words:

you can read more about how it works, here. it's weird.

.extend():

decide how words get interpreted:

let myWords = {
  kermit: 'FirstName',
  fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)

or make heavier changes with a compromise-plugin.

const nlp = require('compromise')

nlp.extend((Doc, world) => {
  // add new tags
  world.addTags({
    Character: {
      isA: 'Person',
      notA: 'Adjective',
    },
  })

  // add or change words in the lexicon
  world.addWords({
    kermit: 'Character',
    gonzo: 'Character',
  })

  // add methods to run after the tagger
  world.postProcess(doc => {
    doc.match('light the lights').tag('#Verb . #Plural')
  })

  // add a whole new method
  Doc.prototype.kermitVoice = function () {
    this.sentences().prepend('well,')
    this.match('i [(am|was)]').prepend('um,')
    return this
  }
})

Docs:

gentle introduction:
Documentation:
Concepts API Plugins
Accuracy Accessors Adjectives
Caching Constructor-methods Dates
Case Contractions Export
Filesize Insert Hash
Internals Json Html
Justification Lists Keypress
Lexicon Loops Ngrams
Match-syntax Match Numbers
Performance Nouns Paragraphs
Plugins Output Scan
Projects Selections Sentences
Tagger Sorting Syllables
Tags Split Pronounce
Tokenization Text Strict
Named-Entities Utils Penn-tags
Whitespace Verbs Typeahead
World data Normalization
Fuzzy-matching Typescript
Talks:
Articles:
Some fun Applications:

API:

Constructor

(these methods are on the nlp object)

  • .tokenize() - parse text without running POS-tagging
  • .extend() - mix in a compromise-plugin
  • .fromJSON() - load a compromise object from .json() result
  • .verbose() - log our decision-making for debugging
  • .version() - current semver version of the library
  • .world() - grab all current linguistic data
  • .parseMatch() - pre-parse any match statements for faster lookups
Utils
  • .all() - return the whole original document ('zoom out')
  • .found [getter] - is this document empty?
  • .parent() - return the previous result
  • .parents() - return all of the previous results
  • .tagger() - (re-)run the part-of-speech tagger on this document
  • .wordCount() - count the # of terms in the document
  • .length [getter] - count the # of characters in the document (string length)
  • .clone() - deep-copy the document, so that no references remain
  • .cache({}) - freeze the current state of the document, for speed-purposes
  • .uncache() - un-freezes the current state of the document, so it may be transformed
Accessors
Match

(all match methods use the match-syntax.)

  • .match('') - return a new Doc, with this one as a parent
  • .not('') - return all results except for this
  • .matchOne('') - return only the first match
  • .if('') - return each current phrase, only if it contains this match ('only')
  • .ifNo('') - Filter-out any current phrases that have this match ('notIf')
  • .has('') - Return a boolean if this match exists
  • .lookBehind('') - search through earlier terms, in the sentence
  • .lookAhead('') - search through following terms, in the sentence
  • .before('') - return all terms before a match, in each phrase
  • .after('') - return all terms after a match, in each phrase
  • .lookup([]) - quick find for an array of string matches
Case
Whitespace
  • .pre('') - add this punctuation or whitespace before each match
  • .post('') - add this punctuation or whitespace after each match
  • .trim() - remove start and end whitespace
  • .hyphenate() - connect words with hyphen, and remove whitespace
  • .dehyphenate() - remove hyphens between words, and set whitespace
  • .toQuotations() - add quotation marks around these matches
  • .toParentheses() - add brackets around these matches
Tag
  • .tag('') - Give all terms the given tag
  • .tagSafe('') - Only apply tag to terms if it is consistent with current tags
  • .unTag('') - Remove this term from the given terms
  • .canBe('') - return only the terms that can be this tag
Loops
  • .map(fn) - run each phrase through a function, and create a new document
  • .forEach(fn) - run a function on each phrase, as an individual document
  • .filter(fn) - return only the phrases that return true
  • .find(fn) - return a document with only the first phrase that matches
  • .some(fn) - return true or false if there is one matching phrase
  • .random(fn) - sample a subset of the results
Insert
Transform
Output
Selections
Subsets

Plugins:

These are some helpful extensions:

Adjectives

npm install compromise-adjectives

Dates

npm install compromise-dates

Numbers

npm install compromise-numbers

Export

npm install compromise-export

  • .export() - store a parsed document for later use
  • nlp.load() - re-generate a Doc object from .export() results
Html

npm install compromise-html

  • .html({}) - generate sanitized html from the document
Hash

npm install compromise-hash

  • .hash() - generate an md5 hash from the document+tags
  • .isEqual(doc) - compare the hash of two documents for semantic-equality
Keypress

npm install compromise-keypress

Ngrams

npm install compromise-ngrams

Paragraphs

npm install compromise-paragraphsthis plugin creates a wrapper around the default sentence objects.

Sentences

npm install compromise-sentences

Strict-match

npm install compromise-strict

Syllables

npm install compromise-syllables

  • .syllables() - split each term by its typical pronounciation
Penn-tags

npm install compromise-penn-tags


Typescript

we're committed to typescript/deno support, both in main and in the official-plugins:

import nlp from 'compromise'
import ngrams from 'compromise-ngrams'
import numbers from 'compromise-numbers'

const nlpEx = nlp.extend(ngrams).extend(numbers)

nlpEx('This is type safe!').ngrams({ min: 1 })
nlpEx('This is type safe!').numbers()

Partial-builds

or if you don't care about POS-tagging, you can use the tokenize-only build: (90kb!)

<script src="https://unpkg.com/compromise/builds/compromise-tokenize.js"></script>
<script>
  var doc = nlp('No, my son is also named Bort.')

  //you can see the text has no tags
  console.log(doc.has('#Noun')) //false

  //the rest of the api still works
  console.log(doc.has('my .* is .? named /^b[oa]rt/')) //true
</script>

Limitations:

  • slash-support:We currently split slashes up as different words, like we do for hyphens. so things like this don't work:nlp('the koala eats/shoots/leaves').has('koala leaves') //false

  • inter-sentence match:By default, sentences are the top-level abstraction.Inter-sentence, or multi-sentence matches aren't supported without a plugin:nlp("that's it. Back to Winnipeg!").has('it back')//false

  • nested match syntax:the danger beauty of regex is that you can recurse indefinitely.Our match syntax is much weaker. Things like this are not (yet) possible:doc.match('(modern (major|minor))? general')complex matches must be achieved with successive .match() statements.

  • dependency parsing:Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do.We should! Help wanted with this.

FAQ

    ☂️ Isn't javascript too...

      yeah it is!
      it wasn't built to compete with NLTK, and may not fit every project.
      string processing is synchronous too, and parallelizing node processes is weird.
      See here for information about speed & performance, and here for project motivations

    �� Can it run on my arduino-watch?

      Only if it's water-proof!
      Read quick start for running compromise in workers, mobile apps, and all sorts of funny environments.

    �� Compromise in other Languages?

      we've got work-in-progress forks for German and French, in the same philosophy.
      and need some help.

    Partial builds?

      we do offer a compromise-tokenize build, which has the POS-tagger pulled-out.
      but otherwise, compromise isn't easily tree-shaken.
      the tagging methods are competitive, and greedy, so it's not recommended to pull things out.
      Note that without a full POS-tagging, the contraction-parser won't work perfectly. ( (spencer's cool) vs. (spencer's house))
      It's recommended to run the library fully.

See Also:

MIT

  • 题目链接:http://acm.hust.edu.cn/vjudge/contest/view.action?cid=87125#problem/D 题意:         给出两段文字,以#为结束符,输出最长的公共单词串。         案例:         input         die einkommen der landwirte          sind fuer die ab

  • Unit 5 - The Language of Compromise(妥协的语言) The Language of Compromise Leslie Dunkling "Let me give you one piece of advice," I said to Ted and Mary just before they got married a few years ago. "If yo

  • He refused to compromise his principles. 他拒绝放弃原则。 ---------------------------------------------------------------------- compromise表示放弃(原则等 compromise也可做一个不及物动词,表示妥协,让步。要说明在哪个方面妥协或让步就用介词on。 例: He refu

  • UVA323 Jury Compromise - 洛谷 | 计算机科学教育新生态 (luogu.com.cn) 由于选择人数有限制(等于 \(m\)),因此考虑将人数设入动态规划的一维。 考虑目标是 \(D + P\) 最大,那么考虑背包,将 \(D + P\) 考虑成物品的总价值。 设 \(f(i, j, k)\),表示前 \(i\) 个人中选了 \(j\) 个人,\(D - P = k\),\

  • Jury Compromise Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 39580 Accepted: 10718 Special Judge Description In Frobnia, a far-away country, the verdicts in court trials are determined b

  • Jury Compromise Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 29984 Accepted: 7993 Special Judge Description In Frobnia, a far-away country, the verdicts in court trials are determined by

  • 题目简述:给定n个元素,其有a和b两种属性,从中选m个元素,使得属性a之和与属性b之和的差值最小,满足上述条件后再保证属性a之和与属性b之和的和最大。要求输出方案。 分析:先不考输出方案,就是个简单的dp题。因为数据范围较小,可以多开几个维度,先考虑一维f[i]表示前i个元素中挑的最小差值,然而发现不能递推,于是增加第二维表示差值,这样,f[i][j]表示从i个元素中挑出差值j是否可能。然后,考虑

  • 解法2:动态规划 AC代码: //动态规划 #include <iostream> #include <cstring> #include <algorithm> using namespace std; #pragma warning(push) #pragma warning(disable:6385) //801:加20*20修正D-P的值-》直接用下标来表示当前的D-P int dp[21

相关阅读

相关文章

相关问答

相关文档