巧妇难为无米之炊，NLP任务也需好的数据来作为支撑。这里就有两个方面：

完全没有数据
有大量未标注脏数据，标注极少甚至没有

这个问题我打算用几篇博客一一讨论，本篇针对完全没数据的场景，介绍使用chatito来生成数据。

Chatito简介

Chatito使用简单易上手的DSL语法来为几类场景的NLP任务生成数据。原话是

Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!

亲测确实方便生成一定量的数据，但是生成的训练集和测试集都是一个模板（构成规则）出来的，训练测试数据同源同构，很容易造成严重的过拟合。典型的表现是在测试集上的准确率和F1等指标会接近1，在未知数据上的泛化性会不好。

显然，采用这种方式生成数据并不是最好的方式。但在实在没有数据的情况下，怎样去解决同源的问题呢？想到的解决方式有几点：

生成的过程中，只填入几条典型的场景，同类型的采用词典，并后续任务上构造词典特征
生成后采用一些数据增强方式（同义词替换、位置交换等），增加训练数据的多样性。

本文不具体介绍这两种方式，会另外用实际的例子和博客分别进行记录，包括本文的例子都放在repo: DataGeneratorForNLP

使用前准备

安装node.js

首先需安装node.js >= v8.11

官网下载编译好的包
解压
设置软连接

ln -s /usr/software/nodejs/bin/npm   /usr/local/bin/ 
ln -s /usr/software/nodejs/bin/node   /usr/local/bin/

在mac上直接采用homebrew安装即可

brew install node
brew install npm

npm配置

npm config set registry https://registry.npm.taobao.org --global
npm config set disturl https://npm.taobao.org/dist --global
# 更新
npm install -g npm

安装chatito npm package

npm i chatito --save

编写构成脚本

因为一个脚本只能生成一个类型的，比如在分类问题中要生成多个类的数据，最好一个类一个生成文件。所以，最好新建一个文件夹，存放所有脚本，比如chatito

下面以对话中介绍新朋友这样一个场景为例，介绍脚本的写法，完整的语法参见DSL。

新建一个以.chatito结尾文件，命名为intro_new_user.chatito，内容为：

import ./common.chatito

%[intro_new_user]('training': '100', 'testing': '50')
    *[60%] ~[hi?]，~[pre1?]~[pre2]~[pre3]，~[indicate?]@[username]
    *[20%] ~[hi?]，~[indicate]@[username]
    *[20%] ~[indicate]@[username]

~[pre1]
    给你

~[pre2]
    介绍
    认识
    了解

~[pre3]
    一个新朋友
    一位新朋友
    个新朋友
    位新朋友
    一个朋友
    一位朋友
    个朋友
    位朋友
    一下
    下

~[indicate]
    这是
    他是
    她是
    他叫
    她叫
    我是
    我叫

@[username]
    小红
    小花
    大黄
    小明

其中common.chatito为另外一个提供通用组成部分的脚本，内容为

~[hi]
    你好
    嗨
    嘿
    哈喽
    hi
    hello

~[please]
    请

~[thanks]
    谢谢
    谢了
    thx
    谢谢你

比如hi，import后就可以采用~[hi]直接引用了。

因为chatito初衷是给对话生成数据，所以脚本里的概念有三个：意图（%[intent_name]）、槽值（@[slot_name]）和别称（~[alias_name]）。意图可以视为分类问题的类别，槽值可视为NER问题的实体，别称只是为了方便组合，随机选取，有点像正则的里中括号里的内容（如[a-zA-Z]）。p.s. 别称里的内容不会被认为是实体。

在上面的例子中%[intro_new_user]('training': '100', 'testing': '50')，表明想生成的意图是intro_new_user。并且训练集生成100个样本，测试集50个。

接下来的一行, *[60%] ~[hi?]，~[pre1?]~[pre2]~[pre3]，~[indicate?]@[username]

*[60%]：表示这一行的构成规则在最好生成数据中占的比例
~[hi?]：随机选择别称hi的一个（比如，选择你好），?表示可以不选，这个与正则中的概念相似
@[username]：随机选取槽username中的一个，在生成的数据中，选取的槽值会被标记为实体，可用于实体识别，有位置信息。

这里需要注意的是：各个部分之间如果有空格，生成的结果中也会有空格。生成结果只是替换~[hi?]，~[hi?]和其后面的~[pre1?]之间的任何内容都会原封不动保留，比如这里的逗号。所以，对于要考虑分词的误差的场景下，建议各个部分之间不要采用空格分隔的方式，保持中文的自然连接。

基本这些内容就够用了，其他用法可以自行探索DSL。

生成数据

生成语法为

npx chatito <pathToFileOrDirectory> --format=<format> --formatOptions=<formatOptions> --outputPath=<outputPath> --trainingFileName=<trainingFileName> --testingFileName=<testingFileName> --defaultDistribution=<defaultDistribution> --autoAliases=<autoAliases>

<pathToFileOrDirectory> path to a .chatito file or a directory that contains chatito files. If it is a directory, will search recursively for all *.chatito files inside and use them to generate the dataset. e.g.: lightsChange.chatito or ./chatitoFilesFolder
<format> Optional. default, rasa, luis, flair or snips.
<formatOptions> Optional. Path to a .json file that each adapter optionally can use
<outputPath> Optional. The directory where to save the generated datasets. Uses the current directory as default.
<trainingFileName> Optional. The name of the generated training dataset file. Do not forget to add a .json extension at the end. Uses ``_dataset_training.json as default file name.
<testingFileName> Optional. The name of the generated testing dataset file. Do not forget to add a .json extension at the end. Uses ``_dataset_testing.json as default file name.
<defaultDistribution> Optional. The default frequency distribution if not defined at the entity level. Defaults to regular and can be set to even.
<autoAliases> Optional. The generaor behavior when finding an undefined alias. Valid opions are allow, warn, restrict. Defauls to ‘allow’.

可以生成Rasa、Flair、LUIS、Snips NLU格式的数据，以Rasa为例。

npx chatito ./chatito --format=rasa --outputPath=./data

生成的结果放在data文件夹下

rasa_dataset_testing.json
rasa_dataset_training.json

生成的文件是一行的json，采用pbcopy < rasa_dataset_testing.json, 粘贴在http://www.totootool.com/json.html。

训练集会多两项

"regex_features":[],"entity_synonyms":[]

具体跟rasa有关，这里不再赘述。

接下来，会介绍如何使用snorkel做NLP数据增强和弱监督训练数据生成。

自然语言处理

自然语言处理数据生成 NLU ML

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

论文学习——MathGraph 上一篇

hexo添加评论和访问统计填坑下一篇

NLP训练数据生成之chatito