Automated Comparison of Texts and Plagiarism Search Based on
Transcription
Automated Comparison of Texts and Plagiarism Search Based on
Automated Comparison of Texts and Plagiarism Search Based on Frequency Analysis Marek Kowalski1, Dorota Narojczyk2, Marek Szczepański1 Cardinal Stefan Wyszyński University in Warsaw University of Finance and Management in Warsaw 1 2 This presentation aims to show methods of overcoming most important difficulties in automated texts’ comparison based on frequency analysis. The most important one is that by using standard cosine similarity measure we “consider the global similarity of documents which may not lead to detecting plagiarism”, see [1 , p. 9]. We’ll deal with a refined mathematical model for information gathering, processing and similarity evaluation involving the term frequency (TF) and inverse document frequency (IDF) vectors. We’ll focus on the following issues: 1. 2. 3. 4. 5. 6. 7. 8. Copyright and other legal limitations; Safe indexing of the reference texts database (RTD); Cascade clustering of the RTD; Automated fragmenting of input texts; Fast detection of most relevant clusters (MRC) based on their centroids; Fast preselection (i.e., similarity search in the MRC) involving modified cosine measures; Eliminating floating point operations from the preselection; Eliminating false-positive similarities. In practical implementations initial clustering of the RTD has to be crated according to appropriate standards, e.g., International Standard Classification of Education, see [5]. Once initial clustering is made the process of clustering can be automated by using the Rocchio algorithm, see [3, 4]. Given an input text T we define MRC as the union of tokenized and lemmatized clusters 𝐵𝑖1 , … , 𝐵𝑖𝑘 of 𝑅𝑇𝐷 = 𝐵1 ᴗ … ᴗ 𝐵𝑛 such that values of the maxima ∑𝑠 𝜖 𝑡 𝑤𝑡 (𝑠)𝑐𝑖 (𝑠) max 1≤𝑖≤𝑛 2 √∑𝑠 𝜖 𝑡(𝑐𝑖 (𝑠))2 √∑𝑠 𝜖 𝑡(𝑤𝑡 (𝑠)) are bigger than given numbers and attained for 𝑖 = 𝑖1 , …, 𝑖𝑘 when t ranges over the fragments of T. Here 𝑠 is a lemmatized word in 𝑡, the numbers {𝑐𝑖 (𝑠)}𝑠∈𝑡 form the centroid of 𝐵𝑖 , see [2, 4], and 𝑤𝑡 (𝑠) = 𝑡𝑓(𝑠, 𝑡) ∗ log(#RTD/#RTD(𝑠)), where 𝑡𝑓(𝑠, 𝑡) is the number of appearances of 𝑠 in 𝑡 and RTD(𝑠) is a subset of RTD consisting of the texts containing 𝑠 and #𝑋 denotes the number of elements in 𝑋. After qualifying a tokenized and lemmatized text 𝑡 to a cluster B we search for similar (tokenized and lemmatized) elements 𝑦 in 𝐵 employing quantities 𝐼(𝑡, 𝑦), 𝐶(𝑡, 𝑦), 𝑅(𝑡, 𝑦) given below. ∑𝑠∈𝑡∩𝑦 𝑤𝑡 (𝑠)𝑤𝑦 (𝑠) 𝐼(𝑡, 𝑦) = 2 2 , √∑𝑠∈𝑡∩𝑦(𝑤𝑡 (𝑠)) ∑𝑠∈𝑡∩𝑦 (𝑤𝑦 (𝑠)) 𝐶(𝑡, 𝑦) = ∑𝑠∈𝑡∩𝑦 min (𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠)) min(∑𝑠∈𝑡∩𝑦 𝑤𝑡 (𝑠), ∑𝑠∈𝑡∩𝑦 𝑤𝑦 (𝑠)) . Here 𝑡 ∩ 𝑦 stands for the set of those words which simultaneously appear in 𝑡 and 𝑦 and 𝑤𝑡 (𝑠) is given by the formula 𝑤𝑥 (𝑠) = 𝑡𝑓(𝑠, 𝑥) ∗ 𝑟(𝑠), where 𝑟(𝑠) is the rank of 𝑠 in 𝐵. In standard formulation 𝑟(𝑠) = log(𝐼𝐷𝐹(𝑠)) , 𝐼𝐷𝐹(𝑠) = #𝐵 , #𝐵(𝑠) where 𝐵(𝑠) is a subset of 𝐵 consisting of the texts containing 𝑠. Alternatively one may use discrete ranks assuming the values 0, 20,2-1, …, 2-k with a fixed 𝑘 ∈ ℕ which leads to serious reduction of computational costs. To define 𝑅(𝑡, 𝑦) we assume that 𝑡 ∩ 𝑦 = {𝑠1 , 𝑠2 , … , 𝑠𝑚 } and we consider the text 𝑖𝑛𝑑(𝑡, 𝑦) = {𝑖(𝑠1 ), 𝑖(𝑠2 ), … , 𝑖(𝑠𝑚 )} consisting of the words 0,1,2 formed according to the rule 𝑖𝑓 𝑤𝑡 (𝑠) = 𝑤𝑦 (𝑠), 0, 𝑖𝑓 𝑤𝑡 (𝑠) ≠ 𝑤𝑦 (𝑠) and 𝑤𝑡 (𝑠) = min{𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠)}, 𝑖(𝑠) = {1, 2, 𝑖𝑓 𝑤𝑡 (𝑠) ≠ 𝑤𝑦 (𝑠) and 𝑤𝑦 (𝑠) = min{𝑤𝑡 (𝑠), 𝑤𝑦 (𝑠)} . For 𝑒 𝜖 {1,2} we consider the text 𝑖𝑛𝑑(𝑡, 𝑦, 𝑒) created from 𝑖𝑛𝑑(𝑡, 𝑦) by eliminating all appearances of 𝑒. We now set 𝑅(𝑡, 𝑦) = 2max{length(𝑖𝑛𝑑(𝑡, 𝑦, 1)), length(𝑖𝑛𝑑(𝑡, 𝑦, 1))} − 1. length(𝑡 ∩ 𝑦) To measure similarity between t and y we can use any mapping 𝜑: [0,1]3 → [0,1] which is an increasing function of each argument when two other arguments are fixed. The texts t and y are considered to be similar if 𝜑(𝐼(𝑡, 𝑦), 𝐶(𝑡, 𝑦), 𝑅(𝑡, 𝑦)) > 0.5, In extensive tests and simulations we obtained very good results for 𝜑(𝑥, 𝑦, 𝑧) = 𝑔(max{𝑥, 𝑦𝑧}), 1 where 𝑔(𝑢) = 1 − (1 − (1 − arccos(𝑢) 𝑞 𝑞 2∗ ) ) , with 𝜋 𝑞 ≈ 2. Bibliography [1] S. Alzahrani, N. Salim, A. Abraham, Understanding plagiarism linguistic patterns textual features and detection methods, IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, Vol.. XX, No. XX , pp. 1 – 17, 2011. [2] C. Buckley, G. Salton, J. Allan, The effect of adding relevance information in a relevance feedback environment, International ACM SIGIR Conference, pp. 292-300, 1994. [3] J. Rocchio, Relevance feedback in information retrieval in The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, ed., Prentice-Hall, pp. 313-323, 1971. [4] M. Szczepański, Algorytmy klasyfikacji tekstów i ich wykorzystanie w systemie wykrywania plagiatów, Oficyna Wydawnicza Politechniki Warszawskiej, ISBN 978-83-7814-189-1, 2014. [5] http://www.uis.unesco.org/Education/Pages/international-standard-classification-of-education.aspx (opened March 20, 2015).