{
    "componentChunkName": "component---src-templates-blog-post-jsx",
    "path": "/post/goya-yet-another-japanese-morphological-analyzer/",
    "result": {"data":{"site":{"siteMetadata":{"title":"WEB EGG","author":"Leko - CTO at Yuimedi"}},"markdownRemark":{"id":"074908a1-15cd-506d-bbd5-04996dafc30f","excerpt":"Goyaという形態素解析器を Rust で作りました。本記事は利用者目線で Goya の紹介をします。技術的な詳細については別途記事を書きます。 形態素解析とは？ （このセクションは形態素解析の基礎の話なので知ってる方は読み飛ばしてください） 形態素解析（けいたいそかいせき、Morphological Analysis…","html":"<p><a href=\"https://github.com/Leko/goya\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Goya</a>という形態素解析器を Rust で作りました。本記事は利用者目線で Goya の紹介をします。技術的な詳細については別途記事を書きます。</p>\n<h2 id=\"形態素解析とは\" style=\"position:relative;\"><a href=\"#%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90%E3%81%A8%E3%81%AF\" aria-label=\"形態素解析とは permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>形態素解析とは？</h2>\n<p>（このセクションは形態素解析の基礎の話なので知ってる方は読み飛ばしてください）</p>\n<blockquote>\n<p>形態素解析（けいたいそかいせき、Morphological Analysis）とは、文法的な情報の注記の無い自然言語のテキストデータ（文）から、対象言語の文法や、辞書と呼ばれる単語の品詞等の情報にもとづき、形態素（Morpheme, おおまかにいえば、言語で意味を持つ最小単位）の列に分割し、それぞれの形態素の品詞等を判別する作業である。</p>\n<p>— <a href=\"https://ja.wikipedia.org/wiki/%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">形態素解析 - Wikipedia</a></p>\n</blockquote>\n<p>例えば早口言葉の”すもももももももものうち”（スモモも桃も桃のうち）という言葉を形態素解析すると以下のような結果が得られます。スモモや桃が名詞、間にある”も・の”は助詞と解析されました。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">すもも\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ\nも\t助詞,係助詞,*,*,*,*,も,モ,モ\nもも\t名詞,一般,*,*,*,*,もも,モモ,モモ\nも\t助詞,係助詞,*,*,*,*,も,モ,モ\nもも\t名詞,一般,*,*,*,*,もも,モモ,モモ\nの\t助詞,連体化,*,*,*,*,の,ノ,ノ\nうち\t名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ</code></pre></div>\n<p>この解析結果は日本語として正しいのかについては言語学の専門家に委ねるとして、技術的に重要なことは<strong>単語の境界を機械的に判定できる</strong>ことです。文章を形態素に分解することで全文検索用のインデックスを生成したり、品詞解析や構文解析・係受け解析、キーワード抽出や文章要約など様々な自然言語処理が適用可能になります。よく知られた形態素解析ライブラリとしては<a href=\"https://github.com/taku910/mecab\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">MeCab</a>や<a href=\"https://chasen-legacy.osdn.jp/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">ChaSen</a>、<a href=\"https://github.com/ku-nlp/jumanpp\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Juman++</a>、<a href=\"https://www.atilika.com/ja/kuromoji/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">kuromoji</a>、<a href=\"https://github.com/takuyaa/kuromoji.js/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">kuromoji.js</a>などが挙げられます。</p>\n<h2 id=\"goya-とは\" style=\"position:relative;\"><a href=\"#goya-%E3%81%A8%E3%81%AF\" aria-label=\"goya とは permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Goya とは？</h2>\n<p>Goya は Rust で実装された形態素解析ライブラリです。形態素解析ライブラリの大御所<a href=\"https://github.com/taku910/mecab\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">MeCab</a> から実装のアイデアを多く頂いています。</p>\n<ul>\n<li><strong>WebAssembly(WASM) でブラウザや Node.js でも動作</strong></li>\n<li>WASI がある言語ならなんでも動作する可能性</li>\n<li>MeCab の IPA 辞書を解析に使用</li>\n<li>未知語を含む解析も可能</li>\n</ul>\n<p>WASM 版のオンラインデモはこちらです。</p>\n<p><a href=\"https://goya.pages.dev\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">https://goya.pages.dev</a></p>\n<p>（CDN でレスポンス時に動的に Brotli 圧縮をかけてるため初回の読み込みが遅いことがありますが、２回目以降の読み込み・解析は比較的高速です）</p>\n<h2 id=\"cli-を試す\" style=\"position:relative;\"><a href=\"#cli-%E3%82%92%E8%A9%A6%E3%81%99\" aria-label=\"cli を試す permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>CLI を試す</h2>\n<p>CLI からも Goya を利用できます。</p>\n<ol>\n<li><a href=\"https://taku910.github.io/mecab/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">MeCab の公式サイト</a>から IPA 辞書をダウンロードし、解凍します</li>\n<li>cargo 経由で CLI をインストールします\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">cargo install goya-cli</code></pre></div>\n</li>\n<li><code>goya compile</code>コマンドで解析に使用する辞書をコンパイルします。環境によりますが 1-2 分かかります\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">goya compile /path/to/mecab/ipadic</code></pre></div>\n</li>\n<li>goya コマンドに標準入力からテキストを与えると形態素解析の結果が出力されます\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">$ echo すもももももももものうち | goya\nすもも\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ\nも\t助詞,係助詞,*,*,*,*,も,モ,モ\nもも\t名詞,一般,*,*,*,*,もも,モモ,モモ\nも\t助詞,係助詞,*,*,*,*,も,モ,モ\nもも\t名詞,一般,*,*,*,*,もも,モモ,モモ\nの\t助詞,連体化,*,*,*,*,の,ノ,ノ\nうち\t名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ\nEOS</code></pre></div>\n</li>\n</ol>\n<p>複数行のテキストを一度に与えることもできます。改行区切りでそれぞれの行を処理します。goya CLI は現状プロセス起動時のバイナリ辞書を読み込むオーバーヘッドが大きいため、１プロセスに複数のテキストをまとめて解析させる方が効率的です。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">$ cat in.txt\nれこと申します\n東京特許許可局\n\n$ goya &lt; in.txt\nれこ    名詞,一般,*,*,*,*,れこ,レコ,レコ\nと      助詞,格助詞,引用,*,*,*,と,ト,ト\n申し    動詞,自立,*,*,五段・サ行,連用形,申す,モウシ,モーシ\nます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス\nEOS\n東京    名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー\n特許    名詞,サ変接続,*,*,*,*,特許,トッキョ,トッキョ\n許可    名詞,サ変接続,*,*,*,*,許可,キョカ,キョカ\n局      名詞,接尾,一般,*,*,*,局,キョク,キョク\nEOS</code></pre></div>\n<h2 id=\"用語素性feature\" style=\"position:relative;\"><a href=\"#%E7%94%A8%E8%AA%9E%E7%B4%A0%E6%80%A7feature\" aria-label=\"用語素性feature permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>用語：素性（feature）</h2>\n<p>以降の説明に”素性”という用語がたびたび登場します。英語では feature と言います。混乱を避けるために明示しますが、<strong>ここでいう feature は feature request などの feature（機能）や、Rust でコンパイル内容を制御する feature でもなく、言語学の用語です。</strong> 一言で説明するなら「形態素解析の動作には必要ない形態素ごとのメタ情報」です。具体例として形態素解析の結果の一行を抜粋します。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">すもも\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ</code></pre></div>\n<blockquote>\n<p>左から,\n<code>表層形\\t品詞,品詞細分類 1,品詞細分類 2,品詞細分類 3,活用型,活用形,原形,読み,発音</code>\nとなっています。</p>\n<p>— <a href=\"https://taku910.github.io/mecab/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">MeCab: Yet Another Part-of-Speech and Morphological Analyzer</a></p>\n</blockquote>\n<p>“表層形”とは見出し語のことだと解釈して問題ありません。この結果のうち、表層形を除いたその他全ての情報を素性（feature）と呼びます。 MeCab の IPA 辞書にはデフォルトで上記の素性が定義されていますが、仕様としてはユーザが任意個のフィールドを独自に定義可能な任意項目で、全項目が省略可能です。</p>\n<h2 id=\"プログラムからの利用\" style=\"position:relative;\"><a href=\"#%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3%83%A0%E3%81%8B%E3%82%89%E3%81%AE%E5%88%A9%E7%94%A8\" aria-label=\"プログラムからの利用 permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>プログラムからの利用</h2>\n<p>CLI 以外では Goya は以下のユースケースを想定しています。</p>\n<ul>\n<li>WebAssembly（npm パッケージ）としての利用\n<ul>\n<li><code>goya-core</code>: 形態素解析のコア。分かち書きなどの素性が不要なタスクならこれ単独でも使える</li>\n<li><code>goya-features</code>: 解析結果から品詞や読み仮名などの素性（feature）を得たいときに使用</li>\n</ul>\n</li>\n<li>Rust の crate としての利用\n<ul>\n<li><a href=\"https://crates.io/crates/goya\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">crates.io で公開しています</a>がまだ API が安定してないので紹介しません。CLI のソースコードを参照</li>\n</ul>\n</li>\n</ul>\n<p><code>goya-core</code>と<code>goya-features</code>が分かれている理由は WASM のサイズ削減のためです。素性は IPA 辞書に登録された数十万件の語彙のメタデータなのでかなりデータ量が大きいです。分かち書きなどの素性を必要としないユースケースでは core だけ使用し、品詞などの素性が必要なユースケースでは goya-features を併用する想定です。</p>\n<h2 id=\"wasm-での利用\" style=\"position:relative;\"><a href=\"#wasm-%E3%81%A7%E3%81%AE%E5%88%A9%E7%94%A8\" aria-label=\"wasm での利用 permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>WASM での利用</h2>\n<p>Node.js なら普通の npm パッケージのように使えます。ブラウザでは ES Modules か何かしらの bundler を使用することになると思います。<code>.d.ts</code>をパッケージに含めているため TS の型も効きます。</p>\n<p>詳しいインストール方法やその他サンプルコードは<a href=\"https://github.com/Leko/goya\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">リポジトリ</a>を参照してください。</p>\n<h3 id=\"分かち書き\" style=\"position:relative;\"><a href=\"#%E5%88%86%E3%81%8B%E3%81%A1%E6%9B%B8%E3%81%8D\" aria-label=\"分かち書き permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>分かち書き</h3>\n<p>goya-core を import して <code>parse</code> 関数を使用します。parse メソッドの戻り値から各種メソッドを呼べるようにしています。\n分かち書きをするなら<code>wakachi</code>メソッドを使用します。</p>\n<div class=\"gatsby-highlight\" data-language=\"ts\"><pre class=\"language-ts\"><code class=\"language-ts\"><span class=\"token keyword\">import</span> core <span class=\"token keyword\">from</span> <span class=\"token string\">'goya-core'</span>\n\n<span class=\"token keyword\">const</span> lattice <span class=\"token operator\">=</span> core<span class=\"token punctuation\">.</span><span class=\"token function\">parse</span><span class=\"token punctuation\">(</span><span class=\"token string\">'すもももももももものうち'</span><span class=\"token punctuation\">)</span>\nlattice<span class=\"token punctuation\">.</span><span class=\"token function\">wakachi</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span> <span class=\"token comment\">// => [\"すもも\", \"も\", \"もも\", \"も\", \"もも\", \"の\", \"うち\"]</span></code></pre></div>\n<h3 id=\"形態素解析\" style=\"position:relative;\"><a href=\"#%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90\" aria-label=\"形態素解析 permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>形態素解析</h3>\n<p>形態素解析の結果を得るには<code>find_best</code>メソッドを使用します。find_best は形態素の配列を返します。各形態素はこれらのフィールドを持っています。サイズ削減のためこのオブジェクトは品詞や読み仮名などの素性を持っていません。</p>\n<ul>\n<li>wid: 語彙 ID。goya-features で使用 （後述）</li>\n<li>is_known: 既知後なら true、未知語なら false</li>\n<li>surface_form: 表層体</li>\n</ul>\n<div class=\"gatsby-highlight\" data-language=\"ts\"><pre class=\"language-ts\"><code class=\"language-ts\">lattice<span class=\"token punctuation\">.</span><span class=\"token function\">find_best</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>surface_form <span class=\"token comment\">// => \"すもも\"</span>\nlattice<span class=\"token punctuation\">.</span><span class=\"token function\">find_best</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>is_known <span class=\"token comment\">// => true</span>\nlattice<span class=\"token punctuation\">.</span><span class=\"token function\">find_best</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>wid <span class=\"token comment\">// => 次項で説明</span></code></pre></div>\n<h3 id=\"素性featuresの取得\" style=\"position:relative;\"><a href=\"#%E7%B4%A0%E6%80%A7features%E3%81%AE%E5%8F%96%E5%BE%97\" aria-label=\"素性featuresの取得 permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>素性（features）の取得</h3>\n<p>品詞や読み仮名などの素性を得るには<code>goya-features</code>パッケージの<code>get_features</code>関数を利用します。各形態素が持つ<code>wid</code>の配列を渡し対応する素性の配列を得ます。<br>\n戻り値は渡した<code>wid</code>ごとに素性（<code>string[]</code>）の配列（つまり<code>string[][]</code>）となります。素性の各要素は MeCab IPA 辞書を何も改変せず使った場合、その通りの順序（<code>品詞,品詞細分類 1,品詞細分類 2,品詞細分類 3,活用型,活用形,原形,読み,発音</code>）になっています。辞書のカスタマイズや容量削減のため不要な素性を削るケースを考慮しているため、あえてプロパティ名を付けず辞書の CSV 通りの順序をそのまま返しています。特定の品詞を取りたいケースでは、お使いの辞書に合わせて添字を定数化しておくと多少なり可読性が増すと思います。ただし、辞書はカスタマイズ可能であり添字は可変のためこの定数は goya としては提供できません。</p>\n<div class=\"gatsby-highlight\" data-language=\"ts\"><pre class=\"language-ts\"><code class=\"language-ts\"><span class=\"token keyword\">import</span> <span class=\"token punctuation\">{</span> get_features <span class=\"token punctuation\">}</span> <span class=\"token keyword\">from</span> <span class=\"token string\">'wasm-features'</span>\n\n<span class=\"token comment\">// MeCab IPA辞書のデフォルトでは品詞(Part of Speech)は添字0</span>\n<span class=\"token keyword\">const</span> <span class=\"token constant\">INDEX_POS</span> <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n<span class=\"token keyword\">const</span> morphemes <span class=\"token operator\">=</span> lattice<span class=\"token punctuation\">.</span><span class=\"token function\">find_best</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"token comment\">// widの配列から素性の配列を得る</span>\n<span class=\"token keyword\">const</span> features <span class=\"token operator\">=</span> <span class=\"token function\">get_features</span><span class=\"token punctuation\">(</span>morphemes<span class=\"token punctuation\">.</span><span class=\"token function\">map</span><span class=\"token punctuation\">(</span>morph <span class=\"token operator\">=></span> morph<span class=\"token punctuation\">.</span>wid<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n<span class=\"token comment\">// 1要素ずつ取得してもいいが、まとめて取得する方がオーバーヘッドが少なく高速</span>\n<span class=\"token function\">get_features</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">[</span>morphemes<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>wid<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n\nmorphemes<span class=\"token punctuation\">.</span><span class=\"token function\">forEach</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">{</span> surface_form <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span> i<span class=\"token punctuation\">)</span> <span class=\"token operator\">=></span> <span class=\"token punctuation\">{</span>\n  <span class=\"token keyword\">const</span> feature <span class=\"token operator\">=</span> features<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token comment\">// 渡したwid通りの順序で素性が得られる</span>\n  <span class=\"token keyword\">const</span> line <span class=\"token operator\">=</span> surface_form <span class=\"token operator\">+</span> <span class=\"token string\">'\\t'</span> <span class=\"token operator\">+</span> feature<span class=\"token punctuation\">.</span><span class=\"token function\">join</span><span class=\"token punctuation\">(</span><span class=\"token string\">','</span><span class=\"token punctuation\">)</span>\n  <span class=\"token builtin\">console</span><span class=\"token punctuation\">.</span><span class=\"token function\">log</span><span class=\"token punctuation\">(</span>line<span class=\"token punctuation\">)</span> <span class=\"token comment\">// => \"すもも\\t名詞,一般,*,*,*,*,すもも,スモモ,スモモ\"</span>\n  <span class=\"token builtin\">console</span><span class=\"token punctuation\">.</span><span class=\"token function\">log</span><span class=\"token punctuation\">(</span>feature<span class=\"token punctuation\">[</span><span class=\"token constant\">INDEX_POS</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span> <span class=\"token comment\">// => \"名詞\"</span>\n<span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span></code></pre></div>\n<h2 id=\"実行速度\" style=\"position:relative;\"><a href=\"#%E5%AE%9F%E8%A1%8C%E9%80%9F%E5%BA%A6\" aria-label=\"実行速度 permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>実行速度</h2>\n<p>最後に実行速度の比較です。動作確認に使用したマシンは以下の通りです。<br>\n実行環境によってパフォーマンスは変わると思うので、ご自身の環境でも試してもらえればと思います。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">MacBook Pro (13-inch, 2020, Four Thunderbolt 3 ports)\n2.3 GHz Quad-Core Intel Core i7\n32 GB 3733 MHz LPDDR4X\nIntel Iris Plus Graphics 1536 MB</code></pre></div>\n<h3 id=\"cli\" style=\"position:relative;\"><a href=\"#cli\" aria-label=\"cli permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>CLI</h3>\n<p>まず Goya CLI と MeCab CLI の速度を比較します。<a href=\"https://github.com/mmorise/ita-corpus/tree/fece1d56bacb942b9c30ef01179243847cd65fbc\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">ITA コーパスの文章リスト公開用リポジトリ</a>にて掲載されている 424 文を整形してテキストファイルに書き出し、１プロセスで 424 文全て解析した時の実行時間を比較してみました。</p>\n<p>MeCab は 25ms くらいです。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">time mecab &lt; ita-corpus.txt > /dev/null\n\nreal    0m0.024s\nuser    0m0.014s\nsys     0m0.007s</code></pre></div>\n<p>Goya は 165ms くらいでした。遅い。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">time goya &lt; ita-corpus.txt > /dev/null\n\nreal    0m0.165s\nuser    0m0.104s\nsys     0m0.064s</code></pre></div>\n<p>Goya が遅い主な原因はプロセス起動時のバイナリ辞書の読み込みです。辞書を全てメモリ上に展開する処理が初期化にて発生するため、空のテキストファイル（初期化だけして何もせず終了）でも 140ms ほどかかっています。特に素性はデータ量がかなり大きいのでこれの復元が遅いです。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">touch empty.txt\ntime goya &lt; empty.txt\n\nreal    0m0.140s\nuser    0m0.075s\nsys     0m0.063s</code></pre></div>\n<p>例えば Rust の軽量 KVS の<a href=\"https://github.com/spacejam/sled\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">sled</a>などを用いて、辞書をメモリ上に復元しないアプローチで初期化コストを削れば MeCab に近いパフォーマンスが出せそうです。ただ、sled は WASM で動作しないので、あくまで CLI や Rust での使用に限った改善案ですが。</p>\n<h3 id=\"nodejs-wasm\" style=\"position:relative;\"><a href=\"#nodejs-wasm\" aria-label=\"nodejs wasm permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Node.js (WASM)</h3>\n<p>次に Node.js での速度比較です。ベンチマークとして kuromoji.js と速度を比較します。まずはプロセスの起動から終了までを含めたプロセス全体の速度の比較です。測定に使うテキストは CLI と同じです。ベンチマークのコードは<a href=\"https://github.com/Leko/goya/blob/main/benchmarks\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">リポジトリにあげてるのでそちらを参照</a>、検証に使用した Node.js のバージョンは <code>v16.11.1</code> です。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">$ time node goya.js &lt; ita-corpus.txt\n$ time node kuromoji.js &lt; ita-corpus.txt</code></pre></div>\n<ul>\n<li>goya:\n<ul>\n<li>time: 609ms (SD: 50ms)</li>\n<li>memory: 203 MiB (SD: 1 MiB)</li>\n</ul>\n</li>\n<li>kuromoji.js:\n<ul>\n<li>time: 714ms (SD: 63ms)</li>\n<li>memory 402 MiB (SD: 6 MiB)</li>\n</ul>\n</li>\n</ul>\n<p>この条件なら Goya の方が高速で、メモリ使用量も kuromoji.js と比較して 50％程度に抑えられています。どちらもバイナリの辞書をランタイムで復元するアプローチでかつ MeCab の IPA 辞書をベースにしています。ただし Goya ではバイナリ辞書の構造をデータを損なわない範囲で最適化をしており、バイナリサイズをかなり小さくできています（未圧縮時で Goya 36 MB、kuromoji.js 95 MB）。これが初期化コスト及びメモリ使用量に効いてると思います。</p>\n<p>kuromoji.js の作者の方が MAST というアルゴリズムの可能性について言及しており、これを実装すれば初期化のコストをさらに大きく削れるかもしれません。</p>\n<blockquote>\n<p>現在は Double-Array Trie というトライ木の一種を使っていますが、Minimal Acyclic Subsequential Transducer という FST の一種を使うことで、サイズを 1/10 くらいにできるという報告を聞いています。FST の実装については、Go で FST を書いた @ikawaha さんのエントリが参考になります。実装手法も面白いので、ぜひ fst.js を実装してみたいと思っています。<br>\n<a href=\"https://qiita.com/ikawaha/items/be95304a803020e1b2d1\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Lucene で使われてる FST を実装してみた（正規表現マッチ：VM アプローチへの招待） - Qiita</a></p>\n<p>— <a href=\"http://stp-the-wld.blogspot.com/2015/01/javascriptkuromojijs.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">stop-the-world: ブラウザで自然言語処理 - JavaScript の形態素解析器 kuromoji.js を作った</a></p>\n</blockquote>\n<p>次に初期化コストを無視して形態素解析だけの速度で比較してみます。bench.js も<a href=\"https://github.com/Leko/goya/blob/main/benchmarks/bench.js\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">同リポジトリにあるのでそちらを参照</a>してください。</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">$ node bench.js &lt; ita-corpus.txt\ngoya x 0.80 ops/sec ±11.92% (6 runs sampled)\nkuromoji x 21.37 ops/sec ±3.45% (39 runs sampled)\nFastest is kuromoji</code></pre></div>\n<p>Goya の惨敗です。Goya も 424 文 x 0.80 = 339 文/秒 くらいパースできていますが、kuromoji.js に 20 倍以上差をつけられています。形態素解析の速度で見ると kuromoji.js の方が圧倒的に早いです。これは単純に形態素解析アルゴリズムの良し悪しの差なので、形態素解析だけでみても kuromoji.js に負けないよう改良していきたいです。</p>\n<h2 id=\"おわりに\" style=\"position:relative;\"><a href=\"#%E3%81%8A%E3%82%8F%E3%82%8A%E3%81%AB\" aria-label=\"おわりに permalink\" class=\"autolink-header before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>おわりに</h2>\n<p>以上、Goya の紹介でした。最後にリンクを再掲して終わります。Rust の方も JS の方も WASM の方も NLPer の方も試していただいて何かあれば GitHub issue などでフィードバックいただけたら幸いです。</p>\n<ul>\n<li><a href=\"https://github.com/Leko/goya\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">GitHub リポジトリ</a></li>\n<li><a href=\"https://goya.pages.dev\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">オンラインデモ</a></li>\n</ul>","timeToRead":14,"frontmatter":{"title":"WebAssemblyの形態素解析器GoyaをRustで作った","tags":["JavaScript","TypeScript","WebAssembly","Rust","NLP","形態素解析"],"date":"November 27, 2021","featuredImage":{"childImageSharp":{"fluid":{"tracedSVG":"data:image/svg+xml,%3csvg%20xmlns='http://www.w3.org/2000/svg'%20width='400'%20height='236'%20viewBox='0%200%20400%20236'%20preserveAspectRatio='none'%3e%3cpath%20d='M14%20189l34%201%2036-1-35-1-35%201'%20fill='%23d3d3d3'%20fill-rule='evenodd'/%3e%3c/svg%3e","aspectRatio":1.6954314720812182,"src":"/static/2f5360409c74a20d0fb513eb75283c58/3a46b/2021-10-23-13-31-45.png","srcSet":"/static/2f5360409c74a20d0fb513eb75283c58/1ec58/2021-10-23-13-31-45.png 334w,\n/static/2f5360409c74a20d0fb513eb75283c58/3a46b/2021-10-23-13-31-45.png 593w","srcWebp":"/static/2f5360409c74a20d0fb513eb75283c58/74aa6/2021-10-23-13-31-45.webp","srcSetWebp":"/static/2f5360409c74a20d0fb513eb75283c58/cd98f/2021-10-23-13-31-45.webp 334w,\n/static/2f5360409c74a20d0fb513eb75283c58/74aa6/2021-10-23-13-31-45.webp 593w","sizes":"(max-width: 593px) 100vw, 593px"}}}}}},"pageContext":{"slug":"/goya-yet-another-japanese-morphological-analyzer/","previous":{"fields":{"slug":"/what-gadgets-i-bought-in-2020/"},"frontmatter":{"title":"2020年に買って良かったガジェット・家電・備品たち","tags":null}},"next":{"fields":{"slug":"/2021-javascript-typescript-trending-history/"},"frontmatter":{"title":"GitHubのトレンドで振り返る2021年のJavaScript/TypeScript","tags":["JavaScript","TypeScript","GitHub"]}}}},
    "staticQueryHashes": ["2585454260","2954598359"]}