Forum Text Processing and Summarization

Yen-Wei Mak - Multimedia University, Cyberjaya, 63100, Malaysia.
Hui-Ngo Goh - Multimedia University, Cyberjaya, 63100, Malaysia.
Amy Hui-Lan Lim - Multimedia University, Cyberjaya, 63100, Malaysia.

Citation Format:



Frequently Asked Questions (FAQs) are extensively studied in general domains like the medical field, but such frameworks are lacking in domains such as software engineering and open-source communities. This research aims to bridge this gap by establishing the foundations of an automated FAQ Generation and Retrieval framework specifically tailored to the software engineering domain. The framework involves analyzing, ranking, performing sentiment analysis, and summarization techniques on open forums like StackOverflow and GitHub issues. A corpus of Stack Overflow post data is collected to evaluate the proposed framework and the selected models. Integrating state-of-the-art models of string-matching models, sentiment analysis models, summarization models, and the proprietary ranking formula proposed in this paper forms a robust Automatic FAQ Generation and Retrieval framework to facilitate developers' work. String matching, sentiment analysis, and summarization models are evaluated, and F1 scores of 71.31%, 74.90%, and 53.4% were achieved. Given the subjective nature of evaluations in this context, a human review is used to further validate the effectiveness of the overall framework, with assessments made on relevancy, preferred ranking, and preferred summarization. Future work includes improving summarization models by incorporating text classification and summarizing them individually (Kou et al, 2023), as well as proposing feedback loop systems based on human reinforcement learning. Furthermore, efforts will be made to optimize the framework by utilizing knowledge graphs for dimension reduction, enabling it to handle larger corpora effectively


Deep Learning (DL); Forum Processing; Natural Language Processing (NLP); Summarization; Sentiment Analysis

Full Text:



Nilay Patel, “Microsoft thinks AI can beat Google at search — CEO Satya Nadella explains why.” Accessed: Jun. 06, 2023. [Online]. Available:

L. Tamine and L. Goeuriot, “Semantic information retrieval on medical texts: Research challenges, survey, and open issues,” ACM Computing Surveys (CSUR), vol. 54, no. 7, pp. 1–38, 2021, doi: 10.1145/3462476.

S. Gupta and V. R. Carvalho, “FAQ retrieval using attentive matching,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 929–932. doi: 10.1145/3331184.3331294.

F. Raazaghi, “Auto-FAQ-Gen: automatic frequently asked questions generation,” in Advances in Artificial Intelligence: 28th Canadian Conference on Artificial Intelligence, Canadian AI 2015, Halifax, Nova Scotia, Canada, June 2-5, 2015, Proceedings 28, Springer, 2015, pp. 334–337. doi: 10.1007/978-3-319-18356-5_30.

W.-C. Hu, D.-F. Yu, and H. C. Jiau, “A faq finding process in open source project forums,” in 2010 Fifth International Conference on Software Engineering Advances, IEEE, 2010, pp. 259–264. doi: 10.1109/ICSEA.2010.46.

S. Henß, M. Monperrus, and M. Mezini, “Semi-automatically extracting FAQs to improve accessibility of software development knowledge,” in 2012 34th International Conference on Software Engineering (ICSE), IEEE, 2012, pp. 793–803. doi: 10.1109/ICSE.2012.6227139.

F. Razzaghi, H. Minaee, and A. A. Ghorbani, “Context free frequently asked questions detection using machine learning techniques,” in 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), IEEE, 2016, pp. 558–561. doi: 10.1109/WI.2016.0095.

A. Virani, R. Yadav, P. Sonawane, and S. Jawale, “Automatic Question Answer Generation using T5 and NLP,” in 2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS), IEEE, 2023, pp. 1667–1673. doi: 10.1109/ICSCSS57650.2023.10169726.

S. Gangopadhyay and S. M. Ravikiran, “Focused questions and answer generation by key content selection,” in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), IEEE, 2020, pp. 45–53. doi: 10.1109/BigMM50055.2020.00017.

S. Dutta, H. Assem, and E. Burgin, “Sequence-to-sequence learning on keywords for efficient FAQ retrieval,” arXiv preprint arXiv:2108.10019, 2021, doi: 10.48550/arXiv.2108.10019.

V. Jijkoun and M. de Rijke, “Retrieving answers from frequently asked questions pages on the web,” in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 76–83. doi: 10.1145/1099554.1099571.

T. Makino, T. Noro, and T. Iwakura, “An FAQ search method using a document classifier trained with automatically generated training data,” in PRICAI 2016: Trends in Artificial Intelligence: 14th Pacific Rim International Conference on Artificial Intelligence, Phuket, Thailand, August 22-26, 2016, Proceedings 14, Springer, 2016, pp. 295–305. doi: 10.1007/978-3-319-42911-3_25.

S. Vasisht, V. Tirthani, A. Eppa, P. Koujalgi, and R. Srinath, “Automatic FAQ generation using text-to-text transformer model,” in 2022 3rd International Conference for Emerging Technology (INCET), IEEE, 2022, pp. 1–7. doi: 10.1109/INCET54531.2022.9823967.

G. Kothari, S. Negi, T. A. Faruquie, V. T. Chakaravarthy, and L. V. Subramaniam, “SMS based interface for FAQ retrieval,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 852–860.

S. Zhang, Y. Hu, and G. Bian, “Research on string similarity algorithm based on Levenshtein Distance,” in 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), IEEE, 2017, pp. 2247–2251. doi: 10.1109/IAEAC.2017.8054419.

G. Zhou, Y. Liu, F. Liu, D. Zeng, and J. Zhao, “Improving question retrieval in community question answering using world knowledge,” in Twenty-third international joint conference on artificial intelligence, 2013.

M. Gerlach, H. Shi, and L. A. N. Amaral, “A universal information theoretic approach to the identification of stopwords,” Nat Mach Intell, vol. 1, no. 12, pp. 606–612, 2019, doi: 10.1038/s42256-019-0112-6.

S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS One, vol. 16, no. 8, p. e0254937, 2021, doi: 10.1371/journal.pone.0254937.

C. P. Chai, “Comparison of text preprocessing methods,” Nat Lang Eng, vol. 29, no. 3, pp. 509–553, 2023, doi: 10.1017/S1351324922000213.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv Neural Inf Process Syst, vol. 26, 2013.

Y. Wang, J. Guo, C. Yuan, and B. Li, “Sentiment analysis of Twitter data,” Applied Sciences, vol. 12, no. 22, p. 11775, 2022, doi: 10.3390/app122211775 Received: 27 October 2022 / Revised: 15 November 2022 /.

Y.-C. Fung, L.-K. Lee, K. T. Chui, G. H.-K. Cheung, C.-H. Tang, and S.-M. Wong, “Sentiment analysis and summarization of Facebook posts on news media,” in Data Mining Approaches for Big Data and Sentiment Analysis in Social Media, IGI Global, 2022, pp. 142–154. doi: 10.4018/978-1-7998-8413-2.ch006.

C.-P. Chan and J.-H. Yang, “Instagram Text Sentiment Analysis Combining Machine Learning and NLP,” ACM Transactions on Asian and Low-Resource Language Information Processing, doi: 10.1145/3606370.

O. Alqaryouti, N. Siyam, A. Abdel Monem, and K. Shaalan, “Aspect-based sentiment analysis using smart government review data,” Applied Computing and Informatics, 2020, doi: 10.1016/j.aci.2019.11.003.

D. Yadav, J. Desai, and A. K. Yadav, “Automatic text summarization methods: A comprehensive review,” arXiv preprint arXiv:2204.01849, 2022, doi: 10.48550/arXiv.2204.01849.

L. Banarescu et al., “Abstract meaning representation for sembanking,” in Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, 2013, pp. 178–186.

F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” in Proceedings of the 40th International Conference on Software Engineering, 2018, p. 128. doi: 10.1145/3180155.3182519.

M. Lewis et al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019, doi: 10.48550/arXiv.1910.13461.

B. Kou, Y. Di, M. Chen, and T. Zhang, “SOSum: a dataset of stack overflow post summaries,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 247–251. doi: 10.1145/3524842.3528487.

N. Novielli, F. Calefato, and F. Lanubile, “A gold standard for emotion annotation in stack overflow,” in Proceedings of the 15th international conference on mining software repositories, 2018, pp. 14–17. doi: 10.1145/3196398.3196453.

A. A. Syed, F. L. Gaol, and T. Matsuo, “A survey of the state-of-the-art models in neural abstractive text summarization,” IEEE Access, vol. 9, pp. 13248–13265, 2021, doi: 10.1109/ACCESS.2021.3052783.

B. Kou, M. Chen, and T. Zhang, “Automated Summarization of Stack Overflow Posts,” arXiv preprint arXiv:2305.16680, 2023, doi: 10.48550/arXiv.2305.16680.