Document Similarity of Czech Supreme Court Decisions

Tereza Novotná

Abstract

Retrieval of court decisions dealing with a similar legal matter is a prevalent task performed by lawyers as it is a part of a relevant decision-making practice review. In spite of the natural language processing methods that are currently available, this legal research is still mostly done through Boolean searches or by contextual retrieval. In this study, it is experimentally verified whether the doc2vec method together with cosine similarity, can automatically retrieve the Czech Supreme Court decisions dealing with a similar legal issue as a given decision. Furthermore, the limits and challenges of these methods and its application on the Czech Supreme Court decisions are discussed.

Keywords

Automatic Court Decisions Processing; Cosine Similarity; Czech Supreme Court; Document Semantic Similarity; Doc2Vec

Full Text:

References

Show references Hide references

[1] Barzilay, R. and Elhadad, M. (1997) Using Lexical chains for Text Summarization. In: Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, pp. 10–17.

[2] Bilgin, M. and Şentürk, I. F. (2017) Sentiment analysis on Twitter data with semi-supervised Doc2Vec. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 661–666. Available from: https://doi.org/10.1109/UBMK.2017.8093492 [Accessed 20 January 2020].

[3] Bobek, M. et al. (2013) Judikatura a právní argumentace. 2nd ed. Praha: Auditorium.

[4] Ercan, G. and Cicekli, I. (2007) Using Lexical Chains for Keyword Extraction. Information Processing & Management, 43 (6), pp. 1705–1714. https://doi.org/10.1016/j.ipm.2007.01.015

[5] Gerloch, A. (2017) Teorie práva. 7th ed. Plzeň: Vydavatelství a nakladatelství Aleš Čeněk.

[6] Gomaa, W. H. and Fahmy, A. A. (2013) A survey of text similarity approaches. International Journal of Computer Applications, 68 (13), pp. 13–18. https://doi.org/10.5120/11638-7118

[7] Harašta, J. et al. (2018) Annotated Corpus of Czech Case Law for Reference Recognition Tasks. In: Text, Speech, and Dialogue: 21st International Conference proceeding, pp. 239–250. https://doi.org/10.1007/978-3-030-00794-2_26

[8] Harašta, J. et al. (2019) Automatic Segmentation of Czech Court Decision into Multi-Paragraph Parts. Jusletter IT, 23 May 2019, pp. 1–11.

[9] Harvánek, J. et al. (2008) Teorie práva. Plzeň: Vydavatelstvi a nakladatelstvi Aleš Čenek.

[10] Hearst, M. A. (1997) TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics, 23 (1), pp. 33–64.

[11] Hrala, M. and Král, P. (2013) Evaluation of the Document Classification Approaches. In: Burdul R. et al. (eds.). Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. Heidelberg: Springer International Publishing, pp. 877–885. Advances in Intelligent Systems and Computing. Available from: https://doi.org/10.1007/978-3-319-00969-8_86 [Accessed 20 January 2020].

[12] Kannan, Subbu and Gurusamy, V. (2014) Preprocessing techniques for text mining. International Journal of Computer Science & Communication Networks, 5 (1), pp. 7–16.

[13] Kim, D. et al. (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. In: Information Sciences. 477, pp. 15–29. Available from: https://doi.org/10.1016/j.ins.2018.10.006 [Accessed 20 January 2020].

[14] Kocmi, T. (2020) Exploring Benefits of Transfer Learning in Neural Machine Translation. [preprint] Available from: https://arxiv.org/abs/2001.01622 [Accessed 20 January 2020].

[15] Kríž, V. et al. (2014) Statistical Recognition of References in Czech Court Decisions. In: Proceedings of MICAI, pp. 51–61. https://doi.org/10.1007/978-3-319-13647-9_6

[16] Kubů, L., Hungr, P. and Osina, P. (2007) Teorie práva. Praha: Linde.

[17] Lau, J. H. and Baldwin, T. (2016) An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin: Association for Computational Linguistics, pp. 78–86. Available from: https://www.aclweb.org/anthology/W16-1609.pdf [Accessed 20 January 2020]. https://doi.org/10.18653/v1/W16-1609

[18] Le, Q. and Mikolov, T. (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, pp. 1188–1196.

[19] MacCormick, N. and Summers, R. S. (1997) Interpreting Precedents. A Comparative Study. Dartmouth: Aldeshot.

[20] Mandal, A. et al. (2017) Measuring similarity among legal court case documents. In: Proceedings of the 10th Annual ACM India Compute Conference, pp. 1–9. https://doi.org/10.1145/3140107.3140119

[21] Maslova, N. and Potapov, V. (2017) Neural Network Doc2vec in Automated Sentiment Analysis for Short Informal Texts. In: Karpov, A. et al. (eds.). Speech and Computer. Cham: Springer International Publishing, pp. 546–554. Lecture Notes in Computer Science. Available from: https://doi.org/10.1007/978-3-319-66429-3_54[Accessed 20 January 2020].

[22] Mikolov, T. et al. (2013) Efficient estimation of word representations in vector space. In: Proceedings of Workshop at the International Conference on Learning Representations. Scottsdale, USA.

[23] Nandi, N. R. et al. (2018) Bangla News Recommendation Using doc2vec. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5. Available from: https://doi.org/10.1109/ICBSLP.2018.8554679 [Accessed 20 January 2020].

[24] Novotná, T. and Harašta, J. (2019) The Czech Court Decisions Corpus (CzCDC): Availability as the First Step. ArXiv:1910.09513. [online] Available from: http://arxiv.org/abs/1910.09513 [Accessed 20 January 2020].

[25] Novotný, J. and Ircing, P. (2018) The Benefit of Document Embedding in Unsupervised Document Classification. In: Karpov, A. et al. (eds.). Speech and Computer. Cham: Springer International Publishing, pp. 470–478. Lecture Notes in Computer Science. Available from: https://doi.org/10.1007/978-3-319-99579-3_49[Accessed 20 January 2020].

[26] Peczenik, A. (1997) The Binding Force of Precedent. In: MacCormick, N., Summers, R. S. (eds.). Interpreting Precedents. A Comparative Study. Dartmouth: Aldeshot, pp. 461–479. https://doi.org/10.4324/9781315251905-14

[27] Ranera, L. T. B., Solano, G. A. and Oco, N. (2019) Retrieval of Semantically Similar Philippine Supreme Court Case Decisions using Doc2Vec. In: 2019 International Symposium on Multimedia and Communication Technology (ISMAC). IEEE, pp. 1–6.

[28] Renjit, S. and Idicula, S. M. (2019) CUSAT NLP@AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings. In: Bhattacharya, P. et al. Overview of the FIRE 2019 AILA track: Artificial Intelligence for Legal Assistance. Proc. of FIRE, pp. 12–15.

[29] Savelka, J. and Ashley, K. D. (2018) Segmenting U.S. Court Decisions into Functional and Issue Specific Parts. In: Palmirani, M. (ed.). Legal Knowledge and Information Systems JURIX 2018. IOS Press Ebooks, pp. 111–120. Available from: http://ebooks.iospress.nl/volume/legal-knowledge-and-information-systems-jurix-2018-the-thirty-first-annual-conference [Accessed 20 January 2020].

[30] Schweighofer, E., Winiwarter, W. and Merkl, D. (1995) Information filtering: the computation of similarities in large corpora of legal texts. In: Proceedings of the 5th international conference on Artificial intelligence and law, p. 119–126. https://doi.org/10.1145/222092.222205

[31] Smejkalová, T. (2019) Judikatura, nebo precedens? Právník. Teoretický časopis pro otázky státu a práva, 158 (9), pp. 852–864.

[32] Tian, X. et al. (2018) K-Means Clustering for Controversial Issues Merging in Chinese Legal Texts. In: Palmirani, M. (ed.). Legal Knowledge and Information Systems JURIX 2018. IOS Press Ebooks, pp. 215–219. Available from: http://ebooks.iospress.nl/volume/legalknowledge-and-information-systems-jurix-2018-the-thirty-first-annual-conference [Accessed 20 January 2020].

[33] Trieu, L. Q., Tran, H. Q. and Tran, M.-T. (2017) News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion. In: Proceedings of the Eighth International Symposium on Information and Communication Technology. Nha Trang City, Viet Nam: Association for Computing Machinery, pp. 460–467. Available from: https://doi.org/10.1145/3155133.3155206[Accessed 20 January 2020].

[34] Wagh, R. and Anand, D. (2017) Application of citation network analysis for improved similarity index estimation of legal case documents: A study. In: 2017 IEEE International Conference on Current Trends in Advanced Computing (ICCTAC), pp. 1–5. Available from: https://doi.org/10.1109/ICCTAC.2017.8249996[Accessed 20 January 2020].

https://doi.org/10.5817/MUJLT2020-1-5



Copyright (c) 2020 Masaryk University Journal of Law and Technology