ON INFORMATICS VISUALIZATION

— In this paper, we surveyed recent publications on topic modeling and analyzed the forms of visualizations and tools used. Expectedly, this information will help Natural Language Processing (NLP) researchers to make better decisions about which types of visualization are appropriate for them and which tools can help them. This could also spark further development of existing visualizations or the emergence of new visualizations if a gap is present. Topic modeling is an NLP technique used to identify topics hidden in a collection of documents. Visualizing these topics permits a faster understanding of the underlying subject matter in terms of its domain. This survey covered publications from 2017 to early 2022. The PRISMA methodology was used to review the publications. One hundred articles were collected, and 42 were found eligible for this study after filtration. Two research questions were formulated. The first question asks, "What are the different forms of visualizations used to display the result of topic modeling?" and the second question is "What visualization software or API is used? From our results, we discovered that different forms of visualizations meet different purposes of their display. We categorized them as maps, networks, evolution-based charts, and others. We also discovered that LDAvis is the most frequently used software/API, followed by the R language packages and D3.js. The primary limitation of this survey is it is not


I. INTRODUCTION
Topic modeling represents abstract models of hidden topics represented by a distribution of words highly correlated to a particular topic in a document collection [1]. According to Kherwa and Bansal [1], visualization is one of the challenges of topic modeling apart from its interpretability, computation complexity, and output stability. Visualization allows an easy understanding of the identified topics. Kherwa and Bansal also reported that listing the top terms and word clouds are the two most common visualizations in topic modeling. In this paper, we review recent literature on topic modeling, which presented its result in visualization and sought to answer the following research questions.
 What are the different visualizations used to display the result of topic modeling?  What visualization software or API is used? By answering these questions, NLP researchers can decide which forms of visualization they may find suitable and the tools to assist them. This could also initiate further progress to existing visualizations or their lack thereof, where we can expect to see the birth of new novel visualizations.
In this study, we adopted the PRISMA [2] systematic literature review methodology. Published papers from 2017 to 2022 were extracted, filtered, and analyzed. The search was done via the Web of Science search engine due to its high standards of publication indexing. Thus, expectedly, the papers extracted are of superior quality that have undergone a rigorous peer-review process.
Topic modeling approaches convert a text collection into a low-dimensional topic subspace, a group of words called a topic [3]. Topic models are statistical models that reveal the text data's hidden structure. It allows users to search through a large text collection, a digital library, and web material, among other things [3]. It may be used to describe the corpus in several different ways, such as the percentage of topics in a document, the number of documents that particular topics covered, and the percentage of the collection that fits into various categories of themes [3]. The remaining part of this paper is organized as follows: Section 2 explains the PRISMA methodology. Section 3 describes the result of the survey. Section 4 discusses the limitations of this survey, and section 5 provides the study's conclusion.

II. MATERIALS AND METHODS
This section explains the steps taken to conduct this systematic review of literature based on the PRISMA methodology. The PRISMA method includes the identification, screening, eligibility, and inclusion phases. In order to perform the identification phase, research questions are formulated to reflect the review's focus and motivation. Next, one or more search queries are constructed, and the time range is determined. Now searching can take place using the queries, and extraction of the publications would yield a collection of papers. In the case there are duplications of papers, they must be removed. The eligibility of each paper is carefully determined by filtering them using predetermined inclusion and exclusion criteria. Papers that meet these criteria are then filtered again for eligibility based on their full text. Only the final listed papers are included in the review to answer the formulated research questions. Figure 1 shows the PRISMA method. Each following subsection describes the detailed activities performed at each phase of the PRISMA methodology.

A. Research Questions Formulation
In this phase, research questions were formulated to assist researchers in deciding the most suitable type of visualization they could use by knowing what visualization types have already been adopted by earlier works on topic modeling. Additionally, researchers would be aware of existing software tools or API libraries.

B. Publication Search and Extraction
For publication searching, we resorted to the Web of Science search engine as, at present, it holds the highest level of standard in academic publication indexing, adopting the Journal Citation Reports (JCR) ranking. To conduct the search, we used the following search query, combining the logical Boolean operators of AND and OR. We aimed to find all papers on topic modeling that also presented a visualization of its result. Therefore, we specified in the Advanced Search to search for "topic modeling" in the publication title, while for "visualization" in the abstract. The logic behind this is that a topic modeling paper that displayed a visualization as a secondary contribution would often highlight it in the abstract if the work is a secondary contribution and worth mentioning.
In these papers, we expect some useful explanation, justification, or description of the visualization. For anything not highlighted in the abstract, we considered the visualization efforts to be too minimal to be included in the review. The following is the query we used. (TI=(topic modelling)) AND ((AB=(visualize)) OR AB=(visualization)) The search was conducted from 23 rd to 29 th May 2022. The years are scoped from 2017 to 2022. Finally, the publications are extracted and uploaded to the Mendeley reference manager for filtration.

C. Publication Filtration
The next phase of the PRISMA method consists of a twolevel filtration of the collected publications to determine their eligibility (See Fig. 2). The first level was conducted by applying preconstructed criteria of inclusion and exclusion.  The second level of filtration involves careful perusal of each paper's full text to determine its true relevance to the focus of this review. They must clearly show the visualization, justify its selection, and explain it.

III. RESULTS AND DISCUSSIONS
This section discusses the result of this review, which is the list of collected publications found relevant to this review's focus (Table II) and the analysis conducted to answer the formulated research questions. Before that, we show in Figure  2 the number of papers we acquired and found eligible throughout the PRISMA phases. Our initial collection identified 100 candidate papers from the Web of Science. No duplicate documents were found. Filtration was conducted to ascertain the articles' eligibility. During the first level of filtration, based on the inclusion and exclusion criteria determined earlier, six articles were excluded, leaving only 94. After a more detailed perusal was performed at the second level of filtration, 52 papers were further excluded. Therefore, the number of published articles found eligible is 42.   2) Networks: Studies having connections or associations used networks to display them. Networks implemented in the review (Fig. 4) are semantic graph (i), network diagram (ii), and sociogram (iii). Semantic graphs store data in a rich, contextually, and conceptually formed format -complex data that reflects the real world. A network diagram is a type of data visualization that helps users comprehend the relationship between the data. A sociogram is a graph database that illustrates the connections between individuals in a group to map that group's social network.

4) Charts:
We simply categorized all visualizations that do not involve geoinformation, connections, or time progression as charts. They include bars, stacked bars, bubbles, scatterplots, etc. A minimum of two axes can be found in these charts. As they are quite common, no further elaborations are necessary.

5) Others:
Falling within this category are visualization forms created specifically to satisfy the underlying domain's purpose. They are relation matrix, visual clusters, concentration function, and hierarchy tree. As they are specially purposed, interested readers are invited to refer to the original publication for further perusal.

B. RQ2: What visualization software or API is used?
The result of our review showed that LDAvis [4] is the most frequently used software to visualize topics among NLP researchers. It was built using a combination of R language and D3.js [5]. LDAvis is web-based and interactive, offering not only an overview of discovered topics but also permitting drilling into the details of the terms related to each topic. This gives users the flexibility to investigate topic-term relationships based on their relevance.
The second most frequently used software is the R package. The packages used are stm [6], which focuses on structural topic modelling, streamgraph [7] offering the plotting and construction of a stream graph, smacof [8] implements tools relating to multidimensional scaling, machine learning packages offering a variety of algorithms such as Caret (Classification and Regression Training), and statistical packages for statistical analysis.
Another popular API is the D3.js built from the JavaScript library for interactive data visualizations in web browsers. D3 stands for Data-Driven Documents where it is able to utilize Scalable Vector Graphics (SVG), HTML5 and Cascading Style Sheets (CSS). D3 is known to be lightweight and suitable across most web standards, making it suitable for web-based visualization of topic models. Other less popular software/APIs used are Bokeh, UMAP, Gephi, KNIME, VOSviewer, and NetMiner. The rest of the articles involve self-development or simply were not stated.

C. Limitations to Review Assessment
This review is not exhaustive; hence, there is room for improvement; future researchers can apply a different perspective to convey another point of view. This section discusses limitations of this study.
The search query coverage may appear limited. However, it can be argued that the terms chosen for the query were popularly used and were able to tap into a large pool of publications, implementing various forms and tools to visualize topic modelling. We do not dismiss there may be publications using a less popular term that were not included.
The search was executed only on the Web of Science (WoS) platform. We acknowledge this good objection and the restriction that entails. Nevertheless, this review was intended to inform researchers of the choices they have in visualizing and implementing topic modeling and this review is not a comprehensive list of every possible variety that existed. WoS is a prestigious platform where one could expect the latest state-of-the-art studies are published.

D. Future Directions
This exercise allowed us to look closely into the plethora of selections used in visualizing topic modeling results. Apparently, their creations were motivated by the goal to highlight essential aspects discovered. Maps, networks, and charts, each has specific unshareable features. Research and technological advancement are the underlying forces driving the design of these visualizations. To guess the future direction is to guess future research interest and technological opportunities. Hence, with the ever-increasing generation of data, we expect to see a faster improvement in sophistication for the network-type visualizations. This is to support the expanding complex relationships and links of open data. Tools implementable through APIs to visualize results in topic modelling are, to a degree, chosen based on the algorithm a researcher utilized, for example LDAvis. Preexisting generic libraries, e.g. R, which are the building blocks behind these user-friendly APIs, require considerable programming knowledge to build visualizations from scratch. Data science is an area with a high diversity of entries; technically or functionally adept. With the progressive reach of data science to more domains, the latter group of data scientists' growth will supercede the former, hence, a more user-friendly tool will be demanded.

IV. CONCLUSION
In conclusion, this paper presented a recent survey on the different forms of visualization and tools used in topic modeling. The period covered was from 2017 to early 2022. After conducting the PRISMA method of systematic reviewing, 42 papers were found eligible for this study. Based on our findings, we discovered that different types of visualizations serve different purposes in terms of display. They were classified as maps, networks, evolution-based, charts, and others. We also discovered that the most commonly used software/API is LDAvis, followed by R language packages and D3.js. The primary limitation of this survey is that it is not exhaustive; thus, some eligible publications may be missed. Our review is not without limitations. The first limitation involves the finalized list of articles reviewed. This list is not exhaustive. It depended substantially on the subscription to online databases by our university. Additionally, this paper's writing time may plausibly exclude newer eligible publications.