1.2
How to use the documentation
artifact: 文本、音频、视频、图片等
Analysis Engines(AEs):对artifact进行分析
Analysis results:是经过AEs处理得到的结果,是可meta-data of original artifact。根据你想分析的内容得到一系列的
statements比如"Bush" denotes a person. the topic of the document is "Bush and Golf".
Type: pre-defined term. 比如person, the topic of the document.是你想要AE分析得到的结果
Annotation Type: begin and end. positions in document.
tightly-coupled: running in the same process
loosely-coupled: running in separate processes or even on different machines
step 2:
不同的component analytics解决analysis task的不同部分,比如一个analysis persion name。一个analysis persion relationship。这些component analytics要容易组装。
Annotators: 是AE的核心。分析的工作就是由Annotators完成的,在Annotators里,developers自定义自己做什么分析。
CAS(common analysis Results): representing analysis results.
Component Descriptors(CD):用XML表示,contains metadata describing the component, its identity, structure and behavior。
delegate analysis engines:The internal AEs specified in an aggregate are also called the delegate analysis engines.
Analysis Engine Assembler:We refer to the development role associated with building an aggregate from delegate AEs as the Analysis Engine Assembler .
Collection Reader: its job is to connect to and iterate through a source collection, acquiring documents and initializing CASes for analysis.
CAS Consumers: Their job is to do the
final CAS processing. A CAS Consumer may be implemented, for example, to
index CAS contents in a search engine
Collection Processing Engine (CPE): is an aggregate component that specifies a “source to sink” flow from a Collection Reader though a set of analysis engines and then to a set of CAS Consumers.
CPE Descriptors: CPEs are specified by XML files called CPE Descriptors.
2.UIMA是链接unstructred information 和 structred information的桥梁。它的作用就是对unstructred information的内容进行分析,得到structred information。
3. For each person found in the body of a document, the AE would
create a Person object in the CAS and link it to the span of text where the person was mentioned in the document.
4. UIMA Context: can ensure that different annotators working together in an aggregate flow may share the
same instance of an external file,
5.A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs, however, may be defined to contain other AEs organized in a workflow.
6.Users of this AE need not know how it is constructed internally but only need its name and its published
input requirements and
output types. These must be declared in the aggregate AE's descriptor. Aggregate AE's descriptors declare the components they contain and a
flow specification. The flow specification defines the order in which the internal component AEs should be run.
7.We refer to the development role associated with building an aggregate from delegate AEs as the Analysis Engine Assembler .
8.The UIMA framework implementation has
two built-in flow implementations: one that support a linear flow between components, and one with conditional branching based on the language of the document.
It also supports user-provided flow controllers.
9.The application then decides
what to do with the returned CAS. There are many possibilities. For instance the application could: display the results, store the CAS to disk for post processing, extract and index analysis results as part of a search or database application etc.
10.An Analysis Engine (AE) may contain a single annotator (this is referred to as a
Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an
Aggregate AE).
Annotators produce their analysis results in the form of typed
Feature Structures, which are simply data structures that
have a type and a set of (attribute, value) pairs.
All feature structures, including annotations, are represented in the UIMA
Common Analysis Structure(CAS).
native Java interface to the CAS called the JCas. The JCas
represents each feature structure as a Java object
Keep in mind that the CAS can represent arbitrary types of feature structures, and
feature structures can refer to other feature structures.
UIMA defines basic primitive types such as Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive types. UIMA also defines the built-in types
TOP, which is the root of the type system,
analogous to Object in Java;
FSArray, which is an array of Feature Structures (i.e.
an array of instances of TOP); and Annotation
The built-in
Annotation type declares three
fields (called Features in CAS terminology). The features
begin and
end store the character offsets of the span of text to which the annotation refers. The feature
sofa (Subject of Analysis) indicates which document the begin and end offsets point into.
Annotator implementations all implement a standard interface (AnalysisComponent), having several methods, the most important of which are:
1) initialize, 2) process, 3) destroy.
initialize is called by the framework
once when it first creates an instance of the annotator class.
process is called once per item being processed. destroy may be called by the application when it is done using your annotator. There is a default implementation of this interface for annotators using the JCas, called
JCasAnnotator_ImplBase, which has implementations of all required methods except for the process method.
we call
annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations
in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to
iterate over the annotations.
If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators
On the
Capabilities page, we define our annotator's
inputs and outputs, in terms of the types in the type system.
Although capabilities come in sets,
having multiple sets is deprecated; here we're just using one set. The RoomNumberAnnotator is very simple.
It requires no input types, as it operates directly on the document text -- which is supplied as a part of the CAS initialization
UIMA allows annotators to declare
configuration parameters in their descriptors. The descriptor also specifies
default values for the parameters, though these
can be overridden at runtime.
initialize method is a good place to read configuration parameter values.
The UIMA framework ensures that an Annotator instance is called by only one thread at a time. An instance never has to worry about running some method on one thread, and then asynchronously being called using another thread. When multiple threading is wanted, for performance, multiple instances of the Annotator are created, each one running on just one thread.
原文地址:http://gushuizerotoone.iteye.com/blog/705983