Comparison of Compression Algorithms in text data for Data Mining

Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text. Keywords— data compression, data mining, encryption, text data.


INTRODUCTION
In computer science and information theory, using encryption techniques encrypts and compresses data using fewer bits or symbols than the original representation. Data compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth [3]. In the present situation where we are faced with the problem of data abundance and data disruption, data mining and data refinement and compression are used for this purpose. However, the problem is that the pressure of large compression may cause data loss [2]. That data is one of the valuable assets of any organization that the loss of some information can lead to major problems in the organization's decisions and strategies. The purpose here is to present and implement some data compression algorithms efficiently and compare them to the methods used for data compression in data mining. Calculating some of the important compression factors for each of these algorithms, comparing different encryption techniques and improving the performance of different data compression techniques, and selecting a suitable encryption method for data mining is the textual data considered.
Data mining, which has become very popular in recent years, refers to the use of data analysis tools to discover valid patterns and relationships that have been unknown until now [13]. These tools may include statistical models, mathematical algorithms, and learning methods (Machine Learning Methods) that are automated based on experience gained through neural networks or decision-making trees. Improves. Data mining is not limited to data collection and management, but also involves the analysis of data and the refinement and compression of data. Applications that explore data by examining text or multimedia files for data warehouses that have become popular [10]. This type of warehouse provides a variety of data types from text, image to storage and processing for use in organizational strategies and better understanding of data as information. Their advantage over databases is the ability to refine data and compress data to preserve important data and delete trivial data [5]. Each database has its query and query language. A multi-year problem in data analysis is increasing the size of the data set. This process demonstrates the need for more efficient compression programs and also for performing analytical operations that work directly on compressed data. Efficient compression schemes can be designed based on the misuse of patterns and intrinsic structures in the data [9]. Data rotation is one of these features that can significantly enhance compression.
Text compression is a process to reduce text size by encoding more information. It uses this number of bits and bytes to store data which is somewhat reduced [1]. It requires a hassle-free technique that will not lose any data when compressing a text file. This approach returns the text file to its original state [4]. Text compression and decompression techniques are intended for natural languages such as large data sets of high redundancy English and other data with a similar sequential structure such as the source code of a program. However, this method can be used for any data type to achieve some compression [11]. The importance of text compression involves reducing storage hardware, data transfer time, and bandwidth, which is the use of file Compression can lead to a significant reduction in the cost of a drive or solid-state drive and improve network bandwidth, cost savings [7].
Textual data of various sizes are used in English to apply compression and decompression techniques. Text compression reduces data storage space. By reducing or eliminating redundancies, data compression is done. The encryption technique uses fewer bits than the original version and eliminates unnecessary information in this case. This is a harmless way in which the number of bits and bytes is cut off to store information.

Previous investigations
There are various methods for compressing data that take about forty years to develop. Shannon Fano and Hoffman developed and developed compression algorithms in the late 40s [10]. In 1949, Shannon and Fano devised a method for compression using a systematic method of assigning code words based on the probability of blocks called Shannon Fano coding. Another method of compressing data was found in 1951 by Huffman, known as the Huffman algorithm. Since then, the Huffman algorithm has been used for compression.
It has been shown in a review [13], that text data compression using the Shannon-Fano algorithm performs the same function as the Huffman algorithm, when all characters are repeated in a string, and when the expression is short and only one character in Text repeats, has the same functionality. When input data that have long text and data text has a more combinatorial character in the string or word, the Huffman algorithm has a greater impact on compression.
In another review [2], we tried compression ratio, compression time and decompression time for RLE encryption algorithms, Huffman encryption algorithm, Shannon Fan algorithm, Adaptive Hoffman encryption algorithm, Arithmetic encryption algorithm, and Lempel Zev Wel encryption. Were compared using random text files. The results show that as the text file size increases, the compression time increases. Medium-value encryption was less time consuming than the other algorithms for both Huffman and Shannon-Fano encryption approaches. In the LZW algorithm, it only worked fine for small files. The compression time of the Huffman algorithm was comparable to the encoding size at the highest time. The decompression time of all algorithms was less than 500,000 milliseconds except for the Adaptive Huffman and LZW algorithms. The compression ratio was similar for small size files except for RLE encoding, but LZW encoding worked best.
In another study [9], a comparison was made between RLE, Huffman, LZW computational encoding, first LZW then Hoffman and finally RLE, on the random doc, txt, BMP, tiff, gif and jpg files. , Which studies showed that LZW and Hoffman give almost the same results when used for compressing text files.
In another study [4], different methods of data compression algorithms such as LZW, Hoffman, fixed-length code (FLC) and Hoffman after using fixed-length code (HFLC) on English text files in terms of size compression, The ratio, time (speed) and entropy were examined. The best algorithms on all LZW encryption compression scales, then Hoffman, Hoffman, after using a fixed-length code (HFLC) and a fixed-length code (FLC), with entropy were 4.719, 4.855, 5.014 and 6.889, respectively.
In a similar study [5] that analyzed Hoffmann's algorithm and compared it with other common compression techniques such as Arithmetic, LZW, and RLE based on their use in different programs and its benefits. This is very efficient coding, and in RLE coding, when the sequence of pixels with fewer bits is more frequent, it significantly reduces the file size. The LZW algorithm is mostly used in TIFF, GIF and text files, which is a fast, harmless algorithm that is very easy to use, while the Hoffmann algorithm is used in JPEG compression, which produces optimal and low-volume code but is relatively slow.
In another paper [7], Shannon Fano coding, Hoffman coding, Adaptive Huffman coding, RLE, arithmetic coding, LZ77, LZ78, and LZW were tested using the Calgary body. In statistical compression techniques, accounting coding has been more effective than other methods. In another entropy review, the English text file for coding Shannon Fano, Hoffmann coding, Run-length (RLE), Lempel-Ziv-Welch (LZW) coding was calculated. The compression ratio for encoding Shannon Fano and Hoffman was almost the same, and the two algorithms can save 54.7% of space. The compression ratio of Lempel-Ziv-Welch algorithms is low compared to the Hoffmann and Shannon Fano algorithms, and it has been concluded that the Hoffmann encryption algorithm is the best result for text files.
In another study [12], data was first compressed by each code based on the length of the run, such as Golomb code, FDR code, EFDR, MFDR, SAFDR, and OLEL encryption, and then another compressed data with Hoffman code was compressed. The double compression using the Hoffman code was 50.8% compression ratio. Better results were obtained for the data set with incremental data.
In a study [1] on execution time, compression ratio and compression efficiency in a distributed customer-server environment using four compression algorithms: Hoffmann algorithm, Shannon Fano algorithm, LZW algorithm, and Run-length encryption algorithm Was analyzed. A customer's data is distributed across multiple processors/servers and subsequently compressed by servers in remote locations and sent to the customer. The Sigrid framework was used, and the results showed that the LZ algorithm achieves better efficiency/scalability and compression ratio, but is slower than other algorithms. Hoffman coding, LZW coding, LZW-based Hoffman coding were compared for multiple and single compression. This showed that Hoffman-based LZW encryption can compress data more than any other three, while in normal mode the LZW compression ratio is based on Hoffmann 4.41, where the maximum compression ratio is maximal in the case of compression. The LZW build is 4.17. Hoffman-based LZW compression is in some cases better than LZW compression.

II. METHOD
As a test, five compression and stress relief techniques were used: Shannon-Fano coding, Hoffman coding, Hoffman comparative coding, and Lempel Ziv Welch coding, and improved RLE and RLE coding. Different sizes of text files have been used as data sets, and the compression ratio has been measured to determine which compression method is best.

III. RESULTS
The compression ratio depends on the size of the output file. The more compressed the output file is, the lower the compression ratio. When the compression ratio exceeds 1, the output file size is larger than the original input file size.
Here we can see that in most modified RLE encrypted input files the compression ratio is greater than 1, meaning that these algorithms expand the original files rather than compress them. RLE encoding works best for files with sequential duplicates, but typically duplicate data files do not. For this reason, the compression ratio is greater than 1.
The results also show that Shannon Fano encoding is better for most inputs than Huffman encoding since for some input files the symbol code length in Huffman encoding is so large that it increases the output files over the input files. Becomes original. It can be said that Repeated Huffman encoding shows the best compression ratio for most files.
If the code is shorter, the algorithms have less memory. Now if we consider the average length of the code, we will see that for Runlength encoding, we have no value for the average code length because no encoding is created in Run length. But for other techniques, Shannon Fano encodes less code than Huffman coding and modified Run-length coding. Here, too, Huffman coding shows the best average code length for most files. Again, if the standard deviations are smaller, the algorithms occupy less memory. Here, too, for run-length coding, we have no value for standard deviation since no coding is created in Run-length coding. But for other techniques, Shannon Fano has less standard deviation than Huffman coding and modified Run-length coding. Again, Huffman coding provides the best standard deviation for most files.
From the above analysis, we can say that Repeated Huffman coding works best for most cases. It has a smaller compression ratio, average code length, and standard deviation. So, it seems the best algorithm among them. This ultimately results in less memory than others for the output file after the final lapse.
If extra runtime is provided for Hoffman's frequent programming, then Shannon Fano's programming at lower runtime would work quite well. RLE coding for files with consecutive duplicates can be very effective. Modified RLE coding can be very useful if the temporary file intermediary created after applying the Huffman encoding to the input file has 0 and 1 consecutive iterations. Finally, each compression technique has its pros and cons. It depends on the content of the input files on how much better compression can be achieved. Table 1 shows the four input files after compression and resizing algorithms.

IV. CONCLUSION
Text compression is very important in data refinement in data mining. Because data is an important part of data mining textual data. Therefore, it requires a no-loss technique so that no efficient information is lost when compressing or compressing information. In the investigations performed in this study on the performance and computation and comparison of compression ratios, mean code length and standard deviation for Shannon Fano coding, Huffman coding, Huffman repeated coding, Run-length coding, and modified Run-length coding, in The issue of how to compress was examined by each one so that, at present, the most effective algorithm can be used based on the size of the input text file, the type of content, available memory, and execution time to get the best results. A new approach to data compression is called "Run-length Run Algorithm", which offers much better compression than the run-length algorithm.