[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-10708":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},10708,"DeepQA","Conchylicultor\u002FDeepQA","Conchylicultor","My tensorflow implementation of \"A neural conversational model\", a Deep learning based chatbot","",null,"Python",2912,1157,203,90,0,1,31.19,"Apache License 2.0",false,"master",true,[24,25,26,27],"chatbot","deep-learning","seq2seq","tensorflow","2026-06-12 02:02:25","# Deep Q&A\n[![Join the chat at https:\u002F\u002Fgitter.im\u002Fchatbot-pilots\u002FDeepQA](https:\u002F\u002Fbadges.gitter.im\u002Fchatbot-pilots\u002FDeepQA.svg)](https:\u002F\u002Fgitter.im\u002Fchatbot-pilots\u002FDeepQA?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)\n\n#### Table of Contents\n\n* [Presentation](#presentation)\n* [Installation](#installation)\n* [Running](#running)\n    * [Chatbot](#chatbot)\n    * [Web interface](#web-interface)\n* [Results](#results)\n* [Pretrained model](#pretrained-model)\n* [Improvements](#improvements)\n* [Upgrade](#upgrade)\n\n## Presentation\n\nThis work tries to reproduce the results of [A Neural Conversational Model](http:\u002F\u002Farxiv.org\u002Fabs\u002F1506.05869) (aka the Google chatbot). It uses a RNN (seq2seq model) for sentence predictions. It is done using python and TensorFlow.\n\nThe loading corpus part of the program is inspired by the Torch [neuralconvo](https:\u002F\u002Fgithub.com\u002Fmacournoyer\u002Fneuralconvo) from [macournoyer](https:\u002F\u002Fgithub.com\u002Fmacournoyer).\n\nFor now, DeepQA support the following dialog corpus:\n * [Cornell Movie Dialogs](http:\u002F\u002Fwww.cs.cornell.edu\u002F~cristian\u002FCornell_Movie-Dialogs_Corpus.html) corpus (default). Already included when cloning the repository.\n * [OpenSubtitles](http:\u002F\u002Fopus.lingfil.uu.se\u002FOpenSubtitles.php) (thanks to [Eschnou](https:\u002F\u002Fgithub.com\u002Feschnou)). Much bigger corpus (but also noisier). To use it, follow [those instructions](data\u002Fopensubs\u002F) and use the flag `--corpus opensubs`.\n * Supreme Court Conversation Data (thanks to [julien-c](https:\u002F\u002Fgithub.com\u002Fjulien-c)). Available using `--corpus scotus`. See the [instructions](data\u002Fscotus\u002F) for installation.\n * [Ubuntu Dialogue Corpus](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.08909) (thanks to [julien-c](https:\u002F\u002Fgithub.com\u002Fjulien-c)). Available using `--corpus ubuntu`. See the [instructions](data\u002Fubuntu\u002F) for installation.\n * Your own data (thanks to [julien-c](https:\u002F\u002Fgithub.com\u002Fjulien-c)) by using a simple custom conversation format (See [here](data\u002Flightweight) for more info).\n\nTo speedup the training, it's also possible to use pre-trained word embeddings (thanks to [Eschnou](https:\u002F\u002Fgithub.com\u002Feschnou)). More info [here](data\u002Fembeddings).\n\n## Installation\n\nThe program requires the following dependencies (easy to install using pip: `pip3 install -r requirements.txt`):\n * python 3.5\n * tensorflow (tested with v1.0)\n * numpy\n * CUDA (for using GPU)\n * nltk (natural language toolkit for tokenized the sentences)\n * tqdm (for the nice progression bars)\n\nYou might also need to download additional data to make nltk work.\n\n```\npython3 -m nltk.downloader punkt\n```\n\nThe Cornell dataset is already included. For the other datasets, look at the readme files into their respective folders (inside `data\u002F`).\n\nThe web interface requires some additional packages:\n * django (tested with 1.10)\n * channels\n * Redis (see [here](http:\u002F\u002Fredis.io\u002Ftopics\u002Fquickstart))\n * asgi_redis (at least 1.0)\n\nA Docker installation is also available. More detailed instructions [here](docker\u002FREADME.md).\n\n## Running\n\n### Chatbot\n\nTo train the model, simply run `main.py`. Once trained, you can test the results with `main.py --test` (results generated in 'save\u002Fmodel\u002Fsamples_predictions.txt') or `main.py --test interactive` (more fun).\n\nHere are some flags which could be useful. For more help and options, use `python main.py -h`:\n * `--modelTag \u003Cname>`: allow to give a name to the current model to differentiate between them when testing\u002Ftraining.\n * `--keepAll`: use this flag when training if when testing, you want to see the predictions at different steps (it can be interesting to see the program changes its name and age as the training progress). Warning: It can quickly take a lot of storage space if you don't increase the `--saveEvery` option.\n * `--filterVocab 20` or `--vocabularySize 30000`: Limit the vocabulary size to and optimize the performances and memory usage. Replace the words used less than 20 times by the `\u003Cunknown>` token and set a maximum vocabulary size.\n * `--verbose`: when testing, will print the sentences as they are computed.\n * `--playDataset`: show some dialogue samples from the dataset (can be use conjointly with `--createDataset` if this is the only action you want to perform).\n\nTo visualize the computational graph and the cost with [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Fhow_tos\u002Fsummaries_and_tensorboard\u002F), just run `tensorboard --logdir save\u002F`.\n\nBy default, the network architecture is a standard encoder\u002Fdecoder with two LSTM layers (hidden size of 256) and an embedding size for the vocabulary of 32. The network is trained using ADAM. The maximum sentence length is set to 10 words, but can be increased.\n\n### Web interface\n\nOnce trained, it's possible to chat with it using a more user friendly interface. The server will look at the model copied to `save\u002Fmodel-server\u002Fmodel.ckpt`. The first time you want to use it, you'll need to configure it with:\n\n```bash\nexport CHATBOT_SECRET_KEY=\"my-secret-key\"\ncd chatbot_website\u002F\npython manage.py makemigrations\npython manage.py migrate\n```\n\nThen, to launch the server locally, use the following commands:\n\n```bash\ncd chatbot_website\u002F\nredis-server &  # Launch Redis in background\npython manage.py runserver\n```\n\nAfter launch, the interface should be available on [http:\u002F\u002Flocalhost:8000\u002F](http:\u002F\u002Flocalhost:8000\u002F). If you want to deploy the program on a server, use `python manage.py runserver 0.0.0.0` instead. More info [here](https:\u002F\u002Fdocs.djangoproject.com\u002Fen\u002F1.10\u002Fhowto\u002Fdeployment\u002Fchecklist\u002F).\n\n[![Chatbot web interface](chatbot_miniature.png)](http:\u002F\u002Fe-pot.xyz\u002Fcv\u002Fchatbot.png)\n\n## Results\n\nSurprisingly, it's possible to get some results after only 1 or 2 hours of training (on a GeForce GT 740M), by drastically reducing the input sequence to 5 words and output to 3 (plus the `go` and `eos` tokens) and by using a small embedding size (something like 25). Since then I modified the code and now the output length has to match the input one but you can still reproduce the original results, using [this version](https:\u002F\u002Fgithub.com\u002FConchylicultor\u002FDeepQA\u002Ftree\u002F92863e2929580818f866f16969909fe2093d41d1).\nOf course, the network won't be really chatty:\n\n    Q: Hi\n    A: Hi.\n\n    Q: What is your name ?\n    A: Laura.\n\n    Q: What does that mean ?\n    A: I dunno.\n\n    Q: How old are you ?\n    A: thirty-five.\n\n    Q: Will Google hire me ?\n    A: No.\n\n    Q: Tell me the alphabet\n    A: Fuck you.\n\n    Q: That's not nice\n    A: Yeah.\n\nHere is some case where it fails:\n\n    Q: Who is Laura ?\n    A: My brother.\n\n    Q: Say 'goodbye'\n    A: Alright.\n\n    Q: What is cooking ?\n    A: A channel.\n\n    Q: Can you say no ?\n    A: No.\n\n    Q: Two plus two\n    A: Manny...\n\nWith longer sentences, the network is much slower to train. After 250000 iterations with maxLength=10, the network still mainly gives evasive answers (by answering most of the open ended questions by 'I don't know') but it seems to have integrated some basic semantic:\n\n    Q: When are the fireworks ?\n    A: Tomorrow.\n\n    Q: What color ?\n    A: Gray.\n\n    Q: Which apple ?\n    A: That's the first one.\n\n    Q: What time is it ?\n    A: Eleven o'clock.\n\n    Q: Who is Skywalker ?\n    A: She's a freelance bio-exorcist.\n\n    Q: Who is John ?\n    A: The old man.\n\n    Q: Who is Jenny ?\n    A: The man who killed herself.\n\n    Q: When were you born ?\n    A: Soon.\n\nAs shown on the last questions, the network still does some confusion between the genders and the present\u002Ffuture. I also tried some deeper philosophical questions with more or less success.\n\n    Q: What is the purpose of being intelligent ?\n    A: The man are obsolete.\n\n    Q: What is immoral ?\n    A: The truth.\n\n    Q: What is morality ?\n    A: The worst.\n\n    Q: What is the purpose of existence ?\n    A: It 's private.\n\n    Q: What happens if machines can think ?\n    A: I think I'll just lie with you.\n\nThe model I trained is quite limited by the embedding size I put and by its size, and by the training corpus size. Its 'thought vector' is also probably too small to answer the kind of following:\n\n    Q: My name is David. What is my name ?\n    A: Debbie.\n\n    Q: My name is John. What is my name ?\n    A: Carl Smith.\n\n    Q: How many legs does a cat have ?\n    A: Four.\n\n    Q: How many legs does a spider have ?\n    A: Four.\n\n    Q: Do you prefer football or soccer ?\n    A: Whiskey.\n\nIt also seems to overfit as sometimes it will just pop out sentences from its training set unrelated to the question. Adding some dropout should reduce the issue.\n\n    Q: I play tennis. What do I play ?\n    A: I'm going to get laid.\n\n    Q: What color is the sky ?\n    A: One of the deputies is everywhere.\n\n## Pretrained model (not working)\n\nNote: Links do not work anymore.\n\n~~You can find a pre-trained model [here](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F0Bw-phsNSkq23OXRFTkNqN0JGUU0\u002Fview?usp=sharing)~~, trained of the default corpus. To use it:\n 1. Extract the zip file inside `DeepQA\u002Fsave\u002F`\n 2. Copy the preprocessed dataset from `save\u002Fmodel-pretrainedv2\u002Fdataset-cornell-old-lenght10-filter0-vocabSize0.pkl` to `data\u002Fsamples\u002F`.\n 3. Run `.\u002Fmain.py --modelTag pretrainedv2 --test interactive`.\n\nThanks to Nicholas C., [here](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F0Bw-phsNSkq23c29ZQ2N6X3lyc1U?resourcekey=0-u1yOQjC701db6O9FfCW9EA&usp=sharing) ([original](https:\u002F\u002Fmcastedu-my.sharepoint.com\u002Fpersonal\u002Fnicholas_cutajar_a100636_mcast_edu_mt\u002F_layouts\u002F15\u002Fguestaccess.aspx?folderid=077576c4cf9854642a968f67909380f45&authkey=AVt2JWMPkf2R_mWBpI1eAUY)) are some additional pre-trained models (compatible with TF 1.2) for diverse datasets. The folder also contains the pre-processed dataset for Cornell, OpenSubtitles, Ubuntu and Scotus (to move inside `data\u002Fsamples\u002F`). Those are required is you don't want to process the datasets yourself.\n\nIf you have a high-end GPU, don't hesitate to play with the hyper-parameters\u002Fcorpus to train a better model. From my experiments, it seems that the learning rate and dropout rate have the most impact on the results. Also if you want to share your models, don't hesitate to contact me and I'll add it here.\n\n## Improvements\n\nIn addition to trying larger\u002Fdeeper model, there are a lot of small improvements which could be tested. Don't hesitate to send a pull request if you implement one of those. Here are some ideas:\n\n* For now, the predictions are deterministic (the network just take the most likely output) so when answering a question, the network will always gives the same answer. By adding a sampling mechanism, the network could give more diverse (and maybe more interesting) answers. The easiest way to do that is to sample the next predicted word from the SoftMax probability distribution. By combining that with the `loop_function` argument of `tf.nn.seq2seq.rnn_decoder`, it shouldn't be too difficult to add. After that, it should be possible to play with the SoftMax temperature to get more conservative or exotic predictions.\n* Adding attention could potentially improve the predictions, especially for longer sentences. It should be straightforward by replacing `embedding_rnn_seq2seq` by `embedding_attention_seq2seq` on `model.py`.\n* Having more data usually don't hurt. Training on a bigger corpus should be beneficial. [Reddit comments dataset](https:\u002F\u002Fwww.reddit.com\u002Fr\u002Fdatasets\u002Fcomments\u002F59039y\u002Fupdated_reddit_comment_dataset_up_to_201608\u002F) seems the biggest for now (and is too big for this program to support it). Another trick to artificially increase the dataset size when creating the corpus could be to split the sentences of each training sample (ex: from the sample `Q:Sentence 1. Sentence 2. => A:Sentence X. Sentence Y.` we could generate 3 new samples: `Q:Sentence 1. Sentence 2. => A:Sentence X.`, `Q:Sentence 2. => A:Sentence X. Sentence Y.` and `Q:Sentence 2. => A:Sentence X.`. Warning: other combinations like `Q:Sentence 1. => A:Sentence X.` won't work because it would break the transition `2 => X` which links the question to the answer)\n* The testing curve should really be monitored as done in my other [music generation](https:\u002F\u002Fgithub.com\u002FConchylicultor\u002FMusicGenerator) project. This would greatly help to see the impact of dropout on overfitting. For now it's just done empirically by manually checking the testing prediction at different training steps.\n* For now, the questions are independent from each other. To link questions together, a straightforward way would be to feed all previous questions and answer to the encoder before giving the answer. Some caching could be done on the final encoder stated to avoid recomputing it each time. To improve the accuracy, the network should be retrain on entire dialogues instead of just individual QA. Also when feeding the previous dialogue to the encoder, new tokens `\u003CQ>` and `\u003CA>` could be added so the encoder knows when the interlocutor is changing. I'm not sure though that the simple seq2seq model would be sufficient to capture long term dependencies between sentences. Adding a bucket system to group similar input lengths together could greatly improve training speed.\n","DeepQA 是一个基于深度学习的聊天机器人项目，实现了“A Neural Conversational Model”论文中的模型。该项目使用Python和TensorFlow构建，核心功能是通过RNN（seq2seq模型）预测句子，支持多种对话语料库如Cornell Movie Dialogs、OpenSubtitles等，并允许用户自定义数据集。此外，它还支持预训练词嵌入以加速训练过程。适用于需要构建智能对话系统的场景，如客户服务自动化、虚拟助手开发等。项目开源且易于扩展，适合研究者与开发者探索改进。",2,"2026-06-11 03:29:50","top_topic"]