Automatic Summarization of Court Decision Documents Over Narcotic Cases Using BERT

— Reviewing court decision documents for references in handling similar cases can be time-consuming. From this perspective, we need a system that can allow the summarization of court decision documents to enable adequate information extraction. This study used 50 court decision documents taken from the official website of the Supreme Court of the Republic of Indonesia, with the cases raised being Narcotics and Psychotropics. The court decision document dataset was divided into two types, court decision documents with the identity of the defendant and court decision documents without the defendant's identity. We used BERT specific to the IndoBERT model to summarize the court decision documents. This study uses four types of IndoBert models: IndoBERT-Base-Phase 1, IndoBERT-Lite-Bas-Phase 1, IndoBERT-Large-Phase 1, and IndoBERT-Lite-Large-Phase 1. This study also uses three types of ratios and ROUGE-N in summarizing court decision documents consisting of ratios of 20%, 30%, and 40% ratios, as well as ROUGE1, ROUGE2, and ROUGE3. The results have found that IndoBERT pre-trained model had a better performance in summarizing court decision documents with or without the defendant's identity with a 40% summarizing ratio. The highest ROUGE score produced by IndoBERT was found in the INDOBERT-LITE-BASE PHASE 1 model with a ROUGE value of 1.00 for documents with the defendant's identity and 0.970 for documents without the defendant's identity at a ratio of 40% in R-1. For future research, it is expected to be able to use other types of Bert models such as IndoBERT Phase-2, LegalBert


I. INTRODUCTION
In the legal context, the term court decision refers to the process by which the court decides on legal disputes and records this process. The term case is also used synonymously. The source is the judge's correct decision. Based on data from the website of the Indonesian supreme court, www.mahkamahagung.go.id, the total number of court decisions is 2,400,121, with an average of 206,832 new decisions per year. Until December 2021, the number of decision documents on narcotic and psychotropic cases on the Indonesian Supreme Court's decision directory website had reached 276349 decisions and is still growing. Data on the growth in the number of court decisions shows that as long as there is a legal process, the number of decision documents will continue to increase. Many case decision documents show many narcotic and psychotropic abuse in Indonesia, and people still need to understand the law's impact. Decision documents also have lengthy systematics that takes more work to understand. Therefore, it is necessary to summarize the narcotic case decision document [1].
Automatic document summarization aims to transform the entered text into a condensed form to present the most critical information to users [2]. The undertaking is regularly partitioned into two ideal models: abstractive summarization and extractive summarization. In the abstractive summarization model, target summaries contain words or expressions that were not in the first content. This model required different content-changing tasks to create words or phrases that were not in the original text. In contrast, extractive approaches form summaries by copying and concatenating the most important sentences in a document [3].
Automatic summarization has previously been carried out in a study [4]. In this case, the data used in the automated operation is a single Arabic document collected independently and translated from English to Arabic using Google translate. This research summary uses a combination of statistical and semantic features by utilizing several preprocessing such as Tokenization, Normalization, Stop-word removal, and Stemming. The results given to each feature are measured using ROUGE-2, the result of the F-score of 0.617. This result shows that the proposed machine learning method has a better impact when compared to Naive Bayes, SVM (with RBF kernel), two-layer neural network, J48, and Random Forest in recall and F-score. The average increase of respectively 33% and 14%.
A research conducted by Meena et al [5] also uses supervised and unsupervised learning algorithms. In this case, the data used for summarizing is collected from Amazon Product reviews using the TFRSP (Text Frequency Ranking Sentence Prediction) method. After the data set is obtained, the data will be processed using the TF-IDF-TR (Term Frequency -Inverse Document Frequency -Text Rank) algorithm, an unsupervised learning algorithm to produce extractive summaries in the first phase. The extractive summary involves the seq2seq model (supervised learning algorithm) to obtain the second phase of the abstractive summary, which includes the training and test dataset. The second phase aims to obtain a practical summary when performance is calculated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores. The results obtained through the TFRSP algorithm by combining unsupervised (extractive summary) and supervised (abstract summary) techniques provide an accuracy increase of 87.58% % when compared to traditional methods in text summarization.
Automatic summarization of multiple documents sharing the same information was carried out in a study [6]. In multidocument summaries, two or more introductory sentences may share similar information. Incorporating all the meaningful sentences into the summary result will give you too much information and may lead to repetition of information. This study aims to reduce similar sentences from multi-documents that share identical information to obtain a more concise text summary. The data used in this study are online news articles sourced from Tribunnews (www.tribunnews.com) and Detik (detik.com). The number of articles used was thirty, divided into six categories. Each article category consists of 200 to 700 words. After the data is ready, the next step is the TextRank algorithm to extract important sentences using text similarity measurements. The TextRank process will provide summarized results, but the summary results still contain similar sentences. This study uses Maximal Marginal Relevance (MMR) calculation to reduce similar sentences. The results of the MMR are the final summary which will then be evaluated using ROUGE-1 and ROUGE-2 with an average F-score of 0.5103 and 0.4257, respectively.
In this research, the summary method is extractive singledocument summarization of court decision documents on narcotic and psychotropic substances. However, due to many narcotic types and crimes, this study focused on court decision documents legally binding with the evidence of ecstasy and the indictment of Article 114 of Law of the Republic of Indonesia No.35 of 2009.
The basis of this study's algorithm model is referred to as BERT (Bidirectional Encoder Representation form Transformer). BERT is conceptually simple and empirically robust, and it obtains new cutting-edge results on eleven natural language processing tasks. BERT is designed for practicing deep bidirectional representations of unlabeled text by co-conditioning the left and right context across all layers. As a result, the pre-existing BERT model can enhance various information with just one additional layer to create a sophisticated model for a task [7]. BERT is also beneficial for applying pre-trained language models such as BERT to extractive summary models [8]. This study aims to implement IndoBert pre-trained in Indonesian automatic document summarization, which is applied to narcotic and psychotropic drug decisions.

II. MATERIALS AND METHOD
The sophisticated extractive text summarization method considers a document (or collection of documents) as a set of textual units (e.g., sentences, clauses, phrases). It formulates summaries as a combinatorial optimization problem, i.e., selecting subsets of a group of textual units that maximize purpose without breaking any length restrictions [9]. For example, Table I shows that a court decision document has a structure considering a set of textual units [10]. The method implemented in this research was extractive summarization with the BERT algorithm. BERT is a new language representation model that generates a pre-train model of bidirectional (two-way) representations of unlabeled text by co-conditioning from both contexts across all layers [7]. BERT was developed on layers of the two-way Transformer encoder [11], or, in other words, BERT is a stack of encoder transformers so that the trained BERT model can be adjusted by adding only one output layer. This study chooses BERT because it is simple to apply, empirically powerful [7], highly beneficial [8], and it can further boost the performance of extractive summarization [3].
Some studies have applied BERT in text summarization. For example, the first research summarised the CNN / Daily Mail and New York Times dataset using BERT and Transformer Based Decoder. This study has resulted in the CNN / Daily Mail dataset, with the value of R1 accounting for 41.71 and R2 scoring 19.49, while in the New York Times dataset, the value of R1 accounted for 45.33 and R2 was 26.5 [12].
Another research was focused on abstractive and extractive single-document summarization on the CNN / Daily Mail dataset using BERT as the pre-trained encoder. Evaluation using ROUGE F1 obtained ROUGE-1 results of 41.76, ROUGE-2 amounting to 19.31, and ROUGE-L of 38.86 [13]. Other research tried to combine summarization models using BERT and OpenAI GPT-2, providing abstract and comprehensive keyword-based information from a collection of scientific articles on the COVID-19 Open Research Dataset Challenge. The results of extractive summarization produced a higher ROUGE score than abstractive summarization [14].
The last research was creating a model to summarize the lecturer's material using BERT for extractive summarization and K-Means clustering to identify sentence selection choices. The extractive summarization is still imperfect, but compared to TextRank, BERT has a steady increase in summary quality and the integration of context with meaningful sentences [15]. Various other studies using BERT for text summarization in various languages have also been carried out [16]- [19].
In case summarization in the Indonesian language, IndoBERT provides pre-trained models for document summarization [20]. Some research has been done to prove the IndoBERT accuracy [21], but few studies have used BERT to summarize legal documents. Referring to these problems, the authors developed this research based on previous research using different datasets and pre-training models. Court decision documents were used as datasets, and the monolingual pre-trained BERT algorithm was used to summarize documents. This study aims to see the performance of IndoBERT in summarizing legal documents. The research was conducted in several stages, as shown in Figure 1 below.

A. Dataset
The dataset used in this study uses legal documents dealing with Narcotics and Psychotropics cases. Narcotics and psychotropics are a class of drugs that are managed under strict laws by the government. The law is strict on these drugs because they have a great potential for abuse and dependency effects. Narcotics and Psychotropics are included in special criminal acts according to Indonesian law. In addition, corruption, money laundering, terrorism, and others are also included in special crimes. This research focuses on narcotics and psychotropics because these crimes are quite numerous and can be committed by all ages, both young and old. The strict law regarding narcotics and psychotropics also aims to prevent damage to the next generation of Indonesian.
The legal basis for narcotics is regulated in the Law of the Republic of Indonesia No. 35 of 2009. According to article 1, paragraph 1 of Law no. 35 of 2009 concerning narcotics, "Narcotics are substances or drugs derived from plants or nonplants, both synthetic and semi-synthetic, which can cause a decrease or change in consciousness, loss of taste, reduce to eliminate pain, and can cause dependence, which is divided into groups as attached to this Law." Narcotics crime consists of 4 groups based on the type of narcotics, and each has provisions for sanctions. Article 127, paragraph 1 explains the punishment that the perpetrator will serve. The four categories are Narcotics Group I will get a maximum prison sentence of 4 years. Narcotics class II for a maximum of 2 years, and Group III for a maximum of 1 year in prison.
The legal basis for Psychotropics is regulated in the Law of the Republic of Indonesia No. 5 of 1997. According to article 1, paragraph 1 of Law no. 5 of 1997 concerning Psychotropics, "Psychotropics are substances or drugs, both natural and synthetic, which are not narcotics, which have psychoactive properties through a selective influence on the central nervous system which causes certain changes in mental activity and behavior." Psychotropic users are not immediately imprisoned but will be rehabilitated, and this provision is stated in article 39, paragraph 1. Apart from the disadvantages of narcotics and psychotropics, both also have good benefits in the medical field if used according to the provisions.
A dataset of court decision documents was taken from the official website of the Supreme Court of the Republic of Indonesia. However, due to a large number of narcotics decision documents, this study only took a sample of 50 documents resulting from decisions on special criminal cases of narcotic and psychotropic substances containing the criminal charges under Article 114 and evidence of ecstasy, which has permanent legal force.
PDF is the format document utilized as a dataset in this research. Then the document is converted into TXT format using phyton before being processed into the model. Changing data format utilizes a library provided by Pdfminer in PDFResourceManager, PDFPageInterpreter, PDFPage, TextConverter, and LAParams. The purpose of using this library is to change the file format from pdf to TXT and organize the sentence structure. The converting data PDF into TXT goes through several stages, as shown in Figure 2. The initial step is case folding.
It is a simple text processing, and Despite simple, case folding works effectively. The purpose of case folding is to change all uppercase letters in a document to lowercase. In this process, only the letters a to z can be accepted in the case folding system. Characters other than these letters will be omitted and considered a delimiter. In the case folding process, we don't use external libraries but use the functions and modules available in Python.
The extracted data sometimes has double spacing that can ruin the structure and interface of the document that will be processed in a model. Therefore, this approach can affect model performance in summarizing an automatic document. This research utilizes a function to remove white space to solve the double spacing problem.
Not only white space and case folding, headers, and footers on the documents used in this study were also removed. The reason for eliminating headers and footers is that every legal document must have them. The purpose of automatic summarization of decision documents focuses on the document's contents.
Data already in TXT format is then pre-processed again before input into the BERT model. The second pre-processing stage in this study was divided into two types. Version 1 used pre-processing, included in the IndoBERT library, and version 2 added stemming and case folding to existing preprocessing. Pre-processing was carried out on a dataset that was separated into a document with the defendant's identity and without the defendant's identity.

1) Tokenization:
Tokenization is the first step in most text-processing jobs [22]., tokenization is the task of separating the full-text string into a separate list of words [23]. Another study defines tokenization as a type of lexical analysis that breaks down text into words, phrases, symbols or other meaningful elements called tokens [24].
2) Word Embedding: IndoBERT was trained with 4 billion vocabularies and 250 million sentences incorporated in the Indo4B dataset [25]. First, this dataset was used to build the fastText model. Then the fastText embeddings were pretrained using a skip-gram word representation and generated a 300-dimensional embedding vector [26]. Then all the embeddings needed for each task were created from the previous pre-trained FastText and included all the vocabulary [20]. The result of tokenization was then converted into a vector using word embedding.

B. BERT Extractive Summarization
This study uses a pre-trained model modified to monolingual (Indonesian). We used four pre-trained models in this research, namely 1) IndoBERT-BASE Phase 1; 2) IndoBERT-Lite-Base Phase 1; 3) IndoBERT-Large Phase 1; 4) IndoBERT-Lite-Large Phase 1. This pre-training model was chosen because the monolingual model learns sentimentlevel semantics better in everyday and formal languages than the multilingual model. The discrepancy between one pretrained model and others is in the number of vocabularies. The experimental results also showed that the larger model had a performance advantage over the smaller model [20].
In addition to the number of vocabularies, there are several characteristics of Bert's pre-trained model, which are also differentiators for each model. The different characteristics of each model are the number of layers, hidden units, attention heads, and parameters [27]. These characteristics can be seen in Table II.  The process of summarizing using BERT was split into two cycles. The first was summarizing the court decision document with the defendant's identity, while the second involved the summary of court decision documents without the defendant's identity. Documents were divided into two types because identity is patent information that cannot be summarized. The separation was performed to see the performance of BERT in extracting meaningful information in the document unless there was identity in it.
This study tries three types of summary ratios (20%, 30%, and 40%) to see the best performance of BERT in extracting legal documents. The ratio value was determined to avoid the difference in the number of sentences that were too large between the system summary results and the expert summary reference.

C. Testing Scenario
The summarization result measurement was performed using Recall-Oriented Understanding for Gisting Evaluation (ROUGE) [28]. The ROUGE score will measure the similarity between the two summary objects produced by adding the number of overlapping units [29]. Therefore, we chose ROUGE-N, as represented in (1).

A. Data Processing
The total data collected from the court's decision directory website was 50 court decision documents on specific crimes of Narcotics and Psychotropics. Before converting the document dataset from pdf to txt, the watermark, header, and footer were manually removed. Furthermore, the dataset was divided into two. Document A represents a document with an identity, and document B represents a document without an identity. After the data processing, the data were converted before being submitted into the system.

D. Evaluation
The evaluation scenario used in this study was intended to test each pre-trained model for each summary ratio on both types of documents. The number of NGRAM used was N=4.
The tests were conducted using ROUGE-N and carried out on both document datasets. Doc A represents the court decision document with the defendant's identity, and Doc B is the court decision document without the defendant's identity. The ideal ROUGE score was identified as a balanced ROUGE score between Doc A and Doc B, occurring at the same summarization ratio and the same number of NGRAMs. Based on testing on four pre-trained models in the IndoBERT-Base Phase 1 model, as shown in Table III, the ideal ROUGE score was found at all summary ratios of 20%, 30%, and 40% for R-1. The highest ROUGE score for Doc A also fell under this criterion. However, Doc A's highest ROUGE score was at R-1 at a 40% summary ratio. In the IndoBERT-Lite-Large Phase 1 model, as shown in Table IV, an increase occurred in the ideal ROUGE score, which still occurred at a summary ratio of 40% in R-1. Meanwhile, the highest ROUGE score increased for each Doc A and Doc B but was still within the same criteria.
In the IndoBERT-Large Phase 1 model shown in Table V, there was a decrease in the ideal ROUGE score; however, the decreased score was insignificant and still occurred at the 30% summary ratio in R-1. In the ROUGE scores generated in Table VI, the pre-trained model used was IndoBERT-Lite-Base Phase 1. This model found the ideal and highest ROUGE score at a summary ratio of 40% at R-1.
The 30% and 40% summary ratios had the pattern to see the best NGRAM that can be used to evaluate the system summary results with expert summaries for both types of documents. The greater the value of NGRAM, the evaluation result will get bigger. R-1 had the best evaluation value for these two ratios in each record because ROUGE-N paid attention to the similarity of words and the order of words in the sentence. Meanwhile, the 40% ratio did not have a specific pattern to see which test NGRAM had the best value for evaluating the document summary results with the expert summary. However, the best result from the ROUGE-N evaluation in the 40% ratio was R-1 for the court decision document with identity and without the defendant's identity.
Furthermore, we will examine which document types had the best BERT summary performance. In the ratio of 20% and 30%, the ROUGE score had a random pattern, making it difficult to determine what type of decision document was to be used as input for the BERT summarization program. Meanwhile, at the 40% ratio, it is clear that the ROUGE score for doc A (documents with the defendant's identity) was higher than doc B (documents without the defendant's identity). With a 40% ratio of BERT summarization, Doc A had the highest ROUGE score among the other ratios and document types.
From the results of table VI analysis, it can be concluded that 40% is the most proper ratio to use in summarizing court decision documents using IndoBERT pre-trained model. At the same time, R-1 is the best NGRAM evaluation if ROUGE-N is used as the evaluation method. To improve the summarization result, separation of the defendant's identity from the document's contents is necessary before entering the system to avoid unimportant information getting extracted and included in the summary results.

B. Summary Result
The summary A and B results in every ratio can still be read well and extract essential points in the court decision document. However, there is some unimportant information that goes into the summary results. Based on the results of this study, BERT can be implemented in court to improve performance in digitizing document filtering based on evidence or certain types of cases.
Some errors in extracting sentences in summarizing can mess up the summary results. It may occur because of the document watermark, header, footer removal, and manual document conversion. In addition, some unimportant information was not completely erased and mixed after being converted to a .txt file.
If you want to do similar research, you should be more careful about cleaning and converting data processes. For further similar research, it is essential to bear in mind that cleaning and converting should be carefully performed in data processing, recalling that the template of the court decision document of the Republic of Indonesia has a watermark on every page. Without proper cleaning, the system will fail to read the data properly, and summary results will be compromised.

IV. CONCLUSION
The rapid development of digital storage technology has triggered a surge in electronic documents, one of which is court decision documents. Many documents can be a problem for law enforcement officers such as lawyers or judges who intend to find or compare similar cases. This research used the BERT (Bidirectional Encoder Representation Form Transformer) algorithm to summarize documents automatically. This study used a dataset of 50 documents of criminal court decisions specifically for narcotic and psychotropic substances containing criminal charges under Article 114 and evidence of ecstasy. The documents used involved only Indonesian language documents that have permanent legal force. From the results of this study, it was found that IndoBERT-Lite-Base Phase 1 had a better performance in summarizing court decision documents with or without the defendant's identity with a 40% summarizing ratio, and the best evaluation method to evaluate the summarization result is ROUGE-N with N=1. This research is expected to serve as a reference to create an excellent automatic document summary system and to further serve as a basis for decision-making for law enforcement officers conducting investigations over similar cases.
There are many opportunities for further research using a monolingual pre-trained encoder such as IndoBert. BERT algorithm is still classified as a new language processing system and can be fine-tuned according to tasks such as summarizing, classifying, and answering questions. In the case of conducting similar research, it is important to carefully clean watermarks, headers, and footers at the early stages and convert documents from .pdf to .txt files. Adding processes in the pre-processing stage and different types of datasets can also be performed to see BERT's performance summarizing other types of text or documents.