turkish-morphology

A two-level morphological analyzer for Turkish.
授权协议 Apache-2.0 License
开发语言 Python
所属分类 神经网络/人工智能、 自然语言处理
软件类型 开源软件
地区 不详
投 递 者 单于承
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

Turkish Morphology

A two-level morphological analyzer for Turkish.

This is not an official Google product.

Components

This implementation is composed of three layers:

  • Lexicons:

    This layer includes wide-coverage Turkish lexicons which are manuallyannotated and validated for part-of-speech and morphophonemic irregularities.They are intended to be used in building Turkish natural language processingtools, such as morphological analyzers. The set of base lexicons that weprovide includes annotated lexical items for 47,202 words. The tagsets andthe annotation scheme are described in the lexicon annotation guidelines.

  • Morphotactics:

    This layer includes a set of FST definitions which are implemented in acustom format which is similar to AT&T FSM format (only difference being thatwe can use strings as state names and input/output labels for each transitioninstead of integers). With each of these FSTs we define the suffixationpatterns and the morpheme inventories together with their correspondingoutput morphological feature category-value pairs for a given part-of-speech.Overall morphotactic model and the morphological feature category-valuetagsets are described in the morphotactic model guidelines.

  • Morphophonemics:

    This layer includes a set of Thrax grammars, where each implements astandalone morphophonemic process (such as vowel harmony, vowel drop,consonant voicing and consonant drop and so on). Composition of the exportedFSTs defined in these Thrax grammars yield the morphophonemic model ofTurkish.

The first level of the morphological analysis is implemented by themorphophonemic model, which takes a Turkish word and transforms it into theintermediate representation. The output of the first level is all possiblehypotheses of word stem annotations with morphophonemic irregularities followedby the meta-morphemes that correspond to the suffixes that are realized in thesurface form.

Input: affında
Output: af"+SH+NDA

Lexicon entries and morphotactic FST definitions are composed and compiled intoa single FST which acts as the second level of the morphological analysis,namely the morphotactic model. Morphotactic model takes the intermediatetape as the input and transforms it to all possible human-readable morphologicalanalyses that can be generated from the hypotheses generated by the first level.

Input: af"+SH+NDA
Output: (af[NN]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]

See Interpreting Human-Readable Morphological Analysis section for adescription of such human-readable morphological analysis.

How to Parse Words

To morphologically parse a word, simply run below from the project rootdirectory.

bazel run -c opt scripts:print_analyses -- --word=[WORD_TO_PARSE]

This will morphologically parse the input word against the two-levelmorphological analyzer and output a set of human-readable morphologicalanalysis, as such:

bazel run -c opt scripts:print_analyses -- --word=geldiğinde
> Morphological analyses for the word 'geldiğinde':
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=True]

If the input string is not accepted as a Turkish word, morphological analyzeroutputs an empty result.

bazel run -c opt scripts:print_analyses -- --word=foo
> 'foo' is not accepted as a Turkish word

Interpreting Human-Readable Morphological Analysis

An example output human-readable morphological analysis is as follows;

Input Word (evlerindekilerin = those that belongs to ones in theirhomes):

bazel run -c opt scripts:print_analyses -- --word=evlerindekilerin

Sample Output Morphological Analysis String:

(ev[NN]+[PersonNumber=A3sg]+lArH[Possessive=P3pl]+NDA[Case=Loc])([PRF]-ki[Derivation=Pron]+lAr[PersonNumber=A3sg]+[Possessive=Pnon]+NHn[Case=Gen])+[Proper=False]

Human-readable morphological analyses can be decomposed into parts:

  • Inflectional groups:

    Each human-readable morphological analysis is composed of inflectional groups.An inflectional group is a sub-word span, and it is created by affixation of aderivational morpheme. Inflectional group analyses are enclosed inparenthesis. Above example contains two inflectional groups:

    • (ev[NN]+[PersonNumber=A3sg]+lArH[Possessive=P3pl]+NDA[Case=Loc])
    • ([PRF]-ki[Derivation=Pron]+lAr[PersonNumber=A3pl]+[Possessive=Pnon]+NHn[Case=Gen])
  • Word stem:

    First inflectional group contains the word stem (e.g. ev is the root formfor the above example input word evlerindekilerin).

  • Analysis of morphemes:

    Within each inflectional group meta-morphemes andtheir corresponding morphological feature category-value tags are separatedwith either + or - delimiters. (e.g. +[PersonNumber=A3sg],+lArH[Possessive=P3pl], -ki[Derivation=Pron], etc.). Strings that areimmediate followers of the delimiters + or - are the meta-morphemes (e.g.NDA is the meta-morpheme in morpheme analysis +NDA[Case=Loc]).Morphological feature category-value tags are enclosed in brackets rightafter the meta-morphemes (e.g. Case is the feature category and Loc isfeature value in morpheme analysis +NDA[Case=Loc]).

  • Part-of-speech:

    Part-of-speech tag of each inflectional group is the first bracketed tag ofthe inflectional group (e.g. NN is the part-of-speech of the firstinflectional group and PRF is for the second inflectional group).

  • Inflectional vs. Derivational morphemes:

    Meta-morphemes that are separated with + delimiter do not create a newinflectional group. They are inflectional morphemes (e.g.+[PersonNumber=A3sg], +NDA[Case=Loc], +[Possessive=Pnon], etc.).Meta-morphemes that are separated with - delimiter create a newinflectional group. They are the derivational morphemes (e.g.-ki[Derivation=Pron]). Therefore, first meta-morpheme in an inflectionalgroup always follows the delimiter -, but not +.

  • Surface realization of inflections:

    Some meta-morphemes are not realized in the surface form. These meta-morphemesdo not correspond to a span of characters in the input word. For them we donot output any meta-morpheme in the morpheme analysis (e.g.+lArH[Possessive=P3pl] and +NDA[Case=Loc] are realized in the surfaceform, thus they have explicit meta-morphemes lArH and NDA in theirmorpheme analysis. However, +[PersonNumber=A3sg] and +[Possessive=Pnon]are not realized in the surface form, therefore only morphological featurecategory-value tags are output for them in their morpheme analysis).

  • Surface realization of derivations:

    Derivational morphemes must always realize in the surface form. They alwayscorrespond to a span of characters in the input word. Therefore, we alwaysoutput non-empty meta-morphemes in the corresponding morpheme analysis ofderivational morphemes. Meaning that no zero-derivations are allowed in themorphotactic model.

  • Proper noun analysis:

    An optional proper noun feature analysis is output at the end of eachinflectional group (e.g. +[Proper=False] which follows the secondinflectional group). Proper noun feature category can take two values Trueor False. If it is specified as True, the inflectional group that itfollows is considered to be a part of a proper noun. This feature is used tocapture the internal structure of proper nouns that are composed of multiplewords (e.g. for multi-word movie names the true part-of-speech andmorphological feature of words that compose a multi-word movie name can beannotated, while marking the fact that they are part of a proper noun usingthis feature).

    Proper noun feature analysis is omitted for some of the inflectional groupsto have a compact representation and to minimize the number of morphologicalanalyses generated by the morphological analyzer. In such cases, proper nounfeature analysis of an inflectional group applies to all precedinginflectional groups that does not have one (e.g. first inflectional group ofthe above example inherits its proper noun feature analysis Proper=Falsefrom the second inflectional group).

Python API

We also provide a Python API that can be used to morphologically analyzeTurkish words, generate Turkish word forms from morphological analyses, parsehuman-readable morphological analyses into protobuf messages, validate theirstructural well-formedness and to generate human-readable analyses from them.You can see some example use cases in //examples.

If you are using Bazel, you can depend on this repository as an externaldependency of your project by adding the following to your WORKSPACE file:

git_repository(
  name = "google_research_turkish_morphology",
  remote = "https://github.com/google-research/turkish-morphology.git",
  tag = "{version-tag}",
)

Then, you can simply use@google_research_turkish_morphology//turkish_morphology:analyze(or other modules of the API) as a dependecy of your relevant py_library orpy_binary BUILD targets.

The API is also available on PyPi. To install the latest release from PyPi, run:

python3 -m pip install turkish-morphology

To install from source, run below from the project root directory (preferablywithin a Python virtual environment):

bazel build //...
bazel-bin/setup install

Requirements

To build and run the morphological analyzer install Bazel version 4.1.0,Python 3.9. All other intrinsic dependencies will be imported, built andtaken care of by Bazel according to the WORKSPACE setup throughout thefirst invocation of the morphological analyzer runtime. If you are installingfrom PyPi, you need pip.

Citing

If you use or discuss the code, data or tools from this repository in your work,please cite:

Öztürel, A., Kayadelen, T. & Demirşahin, I (2019, September). A syntacticallyexpressive morphological analyzer for Turkish. In Proceedings of the 14thInternational Conference on Finite-State Methods and Natural LanguageProcessing (pp. 65-75).

@inproceedings{
    title = "A Syntactically Expressive Morphological Analyzer for Turkish",
    author = "\"{O}zt\"{u}rel, Adnan and Kayadelen, Tolga and Demir\c{s}ahin,
        I\c{s}{\i}n",
    booktitle = "Proceedings of the 14th International Conference on Finite-State
        Methods and Natural Language Processing",
    month = "23--25" # sep,
    year = "2019",
    address = "Dresden, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-3110",
    pages = "65--75",
}

License

Unless otherwise noted, all original files are licensed under anApache License, Version 2.0.

相关阅读

相关文章

相关问答

相关文档