[TOC]

业务场景：小样本数据上的任务型对话理解。

对话领域三类

问答类
任务类
闲聊类

1. 规则方法

1.1 意图识别

词典法
CFG（上下文无关语法）
JSGF（JSpeech Grammar Format）

参考资料：

1.2 命名实体识别

需要构造词典

AC自动机算法（Aho–Corasick算法）
Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配
基于规则的模型

参考：

2. 模型方法

A dataset survey about task-oriented dialogue, including recent datasets and SoA results & papers.

2.1 pipeline

pipeline方法将意图识别和槽填充分为两个独立的部分，分别进行训练。

2.1.1 意图识别

本质上是短文本分类任务，一般的文本分类算法都可以处理

传统算法：

LR
SVM
KNN
RF
GBDT
…

深度学习方法

Fasttext
TextCNN
GRU
LSTM
IDCNN
TextRNN

经调研，预训练fasttext词向量+单层textcnn从分类效果和速度上都相对较优，作为优先选择。

TextCNN的改进：

K-max pooling
DPCNN
…

2.1.2 槽填充

CRF
RNN/LSTM/CNN+CRF
BiLSTM+CRF
BiLSTM+CNN+CRF

2.2 joint model

其中第三条提到的模型: Convolutional Sequence to Sequence Learning

3. 企业做法

3.1 阿里小蜜

Arxiv: AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience

note: 经内部人员考证，这套框架太老已弃用

business rule parser: 大量的样式(patterns)组成的前缀树匹配结构(trie-based)
Intention classifier: 场景分类，pre-train采用fasttext，分类采用单层cnn
- requesting for assistance
- asking for information or solution
- chatting
Semantic Parser: a trie-based, 匹配知识图谱中的实体

3.2 美团

参考：美团对话理解技术及实践

上下文无关文法，工具，规则的写法

4. 数据

【语料】百度的中文问答数据集WebQA
SophonPlus/ChineseNlpCorpus
candlewill/Dialog_Corpus: 用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
brightmart/nlp_chinese_corpus: 大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

5. 开源工具

5.1 ChatterBot

github 9.1k

没有NLU模块，做法是匹配式，训练的输入是一系列完整的对话过程，数据库存储。

通过Logic adapters来获取输出结果

BestMatch
TimeLogicAdapter
MathematicalEvaluation

这个框架主要对问题文本使用相似度匹配，找出库中预定好的答案。比较适合，知识问答类的情形。

5.2 rasa

数据

语料标注工具：rasa-nlu-trainer
数据生成工具：chatito

意图识别

KeywordIntentClassifier：This classifier is mostly used as a placeholder. It is able to recognize hello and goodbye intents by searching for these keywords in the passed messages.
MitieIntentClassifier： This classifier uses MITIE to perform intent classification. The underlying classifier is using a multi-class linear SVM with a sparse linear kernel 。
SklearnIntentClassifier： The sklearn intent classifier trains a linear SVM which gets optimized using a grid search.需要前置feature extractor
EmbeddingIntentClassifier： The embedding intent classifier embeds user inputs and intent labels into the same space. Supervised embeddings are trained by maximizing similarity between them. This algorithm is based on StarSpace.

实体识别

MitieEntityExtractor：The underlying classifier is using a multi class linear SVM with a sparse linear kernel and custom features
SpacyEntityExtractor：Using spaCy this component predicts the entities of a message. spacy uses a statistical BILOU transition model.
EntitySynonymMapper： Maps synonymous entity values to the same value. 通过数据中的value来提供
CRFEntityExtractor：spaCy has to be installed. 貌似用的spaCy的实现
DucklingHTTPExtractor： Duckling lets you extract common entities like dates, amounts of money, distances, and others in a number of languages.

槽填充

官方文档：slot的使用

参考：

GaoQ1/rasa-nlp-architect: 采用nlp-architect实现rasa-nlu中文意图提取和槽填充
Building contextual assistants with Rasa Forms: 原文, 译文

均可自定义component: Enhancing Rasa NLU models with Custom Components

5.3 DeepPavlov

deepmipt/DeepPavlov: 3.6k

An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai

支持英文和俄语。功能全面，可作为学习参考。

基本概念

Agent is a conversational agent communicating with users in natural language (text).
Skill fulfills user’s goal in some domain. Typically, this is accomplished by presenting information or completing transaction (e.g. answer question by FAQ, booking tickets etc.). However, for some tasks a success of interaction is defined as continuous engagement (e.g. chit-chat).
Model is any NLP model that doesn’t necessarily communicates with user in natural language.
Component is a reusable functional part of Model or Skill.
Rule-based Models cannot be trained.
Machine Learning Models can be trained only stand alone.
Deep Learning Models can be trained independently and in an end-to-end mode being joined in a chain.
Skill Manager performs selection of the Skill to generate response.
Chainer builds an agent/model pipeline from heterogeneous components (Rule-based/ML/DL). It allows to train and infer models in a pipeline as a whole.

Models:

NER model [docs]: BERT-based and Bi-LSTM+CRF.
Slot filling models [docs]:
Classification model [docs]
Automatic spelling correction model [docs]
Ranking model [docs]
TF-IDF Ranker model [docs]
Question Answering model [docs]
Morphological tagging model [docs]
Frequently Asked Questions (FAQ) model [docs]

意图识别

BERT classifier (see here) builds BERT 8 architecture for classification problem on Tensorflow.
Keras classifier (see here) builds neural network on Keras with tensorflow backend.
Sklearn classifier (see here) builds most of sklearn classifiers.

模型很丰富

NER

standard RNN based and BERT based.
Multilingual BERT Zero-Shot Transfer
Few-shot Language-Model based

槽填充

官方文档: Neural Named Entity Recognition and Slot Filling

This model solves Slot-Filling task using Levenshtein search and different neural network architectures for NER.

Slotfiller will perform fuzzy search through the all variations of all entity values of given entity type. The entity type is determined by the NER component.

使用博客：DeepPavlov articles with Python code

规则编写

只见到了对话规则的编写，通过PatternMatchingSkill，使用正则编写pattern和response

有一个包装rasa的Rasa Skill

DeepPavlov存在的问题

环境依赖
- DeepPavlov是基于TensorFlow和Keras实现的，不能继承其他计算框架的模型实现（如PyTorch）。
语言支持
- Pre-train模型和评测数据集主要基于英文和俄文，不支持中文。
生产环境部署
- DeepPavlov在运行时需要依赖整个框架源码，开发环境对框架修改后，生产环境需要更新整个框架。
- 也不能直接将功能Component作为服务独立导出，不适合在生产环境的部署和发布。

5.4 Snips-nlu

snipsco/snips-nlu: 3k

Snips Python library to extract meaning from text https://snips-nlu.readthedocs.io

不支持中文

Tutorial：意图和槽值都放在训练数据中了

# turnLightOn intent
---
type: intent
name: turnLightOn
slots:
  - name: room
    entity: room
utterances:
  - Turn on the lights in the [room](kitchen)
  - give me some light in the [room](bathroom) please
  - Can you light up the [room](living room) ?
  - switch the [room](bedroom)'s lights on please

This parser parses text using two steps: first it classifies the intent using an IntentClassifier and once the intent is known, it using a SlotFiller in order to extract the slots.

IntentClassifier

Logistic Regression
Feature extractor for text classification relying on ngrams tfidf and optionally word cooccurrences features
scikit-learn TfidfVectorizer
Featurizer that takes utterances and extracts ordered word cooccurrence features matrix from them

SlotFiller

Linear-Chain Conditional Random Fields

5.5 其他

参考：

自然语言处理

NLP NER NLU machine learning

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

vue文件上传下载上一篇

configparser配置解析下一篇

NLU调研