Introduction – Anomaly Detection
Anomaly Detection (also known as outlier analysis) is a step in data mining, to identify outliers or irregular patterns that do not correspond to predicted behaviour. It has wide range of market uses, typically data may reveal crucial events. From technological glitch, intruder detection, health surveillance, to fraud detection, anomaly detection has various use cases.
What is an Anomaly?
An anomaly is defined as the unusual behavior or pattern of the data. This particular indicates the presence of the error in the system. It describes that the actual result is different from the obtained result. Thus the applied model does not fit into the given assumptions. The anomaly is further divided into three categories described below:
- Point Anomalies: A single instance of a point is considered as an anomaly when it is farthest from the rest of the data.
- Contextual Anomalies: This type of anomaly related to the abnormal behavior of the particular type of context within data. It is commonly observed in time series problems.
- Collective Anomalies: When the collected instance of data is helpful for detecting anomalies is considered as collective anomalies. The system produces logs that contain information about the state of the system. By analyzing the log data anomalies can be detected so that the security of the system could be protected. This can be performed by using Data Mining Techniques. This is because there is a need for usage of dynamic rules along with the data mining approach.
Anomaly Detection Techniques
Supervised Learning: Machines learn a function that maps input features to outputs based on example input-output pairs as they learn with supervision. Supervised anomaly detection algorithms are designed to integrate application-specific knowledge into the anomaly detection process.
Unsupervised Learning: Unsupervised learning occurs when computers lack examples of input-output pairs from which to learn a function that maps input features to outputs. They instead learn by looking for structure in the input functions. Since labeled anomalous data is comparatively uncommon, unsupervised approaches are more prevalent in the anomaly detection sector than supervised approaches.
Semi-supervised Learning: Semi-supervised learning techniques are a kind of middle ground, employing a collection of tools that can use both vast volumes of unlabeled data and small amounts of labeled anomalous data. Many real-world anomaly detection use cases are well suited to semi-supervised learning since there are a large number of regular instances to learn from, but just a few examples of the labeled anomalous data.
Applications of Anomaly Detection with Deep Learning
Network Intrusion Detection: In today’s generation, the use of computers has been increased. Due to this, the probability of cybercrime has also increased. Therefore, a system is developed known as Network Intrusion Detection which enables the security of the computer system. In this system, Data Mining Techniques and the signature database are used. The description of the process is given below:
- Firstly, the log files are collected from all the sources.
- Pre-processing of log files is performed, i.e., the log data is represented in a structured form.
- After that, Data Mining Techniques such as Support Vector Machine (SVM), Random Forest, etc. are applied to the log data to identify the patterns.
- The log data is searched in the common log database and the attack log database.
- If the pattern is not matched with the common log database, it will be classified as an attack log data pattern.
- From the identified collected patterns unusual patterns as an attack are identified by the user.
- After the identification of unusual patterns, the attack patterns are stored in the signature database (attack log database).
- If the pattern is already present in the signature database, an alert will be given by the system.
- In the end, clustering is performed multiple times to identify the security attack with the operating system.
Fraud Detection in Banking Sector: Banks are the organizations for depositing and withdrawing money, getting the provision of loans. This facility is available to all. Therefore, the proper security mechanism should be introduced. Some points are necessary for consideration before performing fraud detection that is mentioned below:
- Fraudsters have analyzed the whole procedure of bank.
- They even are experts in copying the signature of the customer without any doubt.
Firstly, the banks have stored previous information about each customer in their database. This storage of data in the database is known as Data Warehousing. The next approach is to analyze the data. After that, Association rules are created in the form of if/then patterns. The support and confidence field is used to identify the relationships between the data. Support field contains information about the frequent occurrences of data within the database. The confidence field provides information about sometimes the if/then statements are found to be true. For example, while analyzing the database of the bank association rules are made by a customer. A customer named Nilu Sharma does not withdraw money more than 1 lakh and transactions are frequently occur after two months. Here, the limitation in withdrawing of money within 1 lakh is a supported field, and the occurrence of the operation after two months is the confidence field. Therefore, any transaction of Nilu Sharma more than 1 lakh will be examined, and further authentication is performed. After the failure of authentication, an alert will be created, and the transaction will be canceled by the bank.
Fraud Detection with Deep Learning: Banks have to analyze millions of money transactions in a day. But, due to lack of advanced techniques banks are not able to examine transactions properly as it becomes difficult to examine a few fraud activities within millions of transactions. Therefore, the scalable technique is needed which updates the system automatically. Deep Learning Neural Network can be used that can detect fraud activities automatically, and the system can learn automatically whenever the new data will arrive without the interference of human beings. Larger institutions and organizations indulge in large financial transactions. So, open-source-deep learning is introduced for them so that they can fight for fraud activities at economical rates by using sky minds with deep learning neural networks. Deeplearning4j is an open-source deep-learning library that uses distributed deep learning by integrating with Apache Hadoop and Apache Spark. This library not only detects frauds, anomalies, and patterns in real-time rather it also learns from the new data parallelly.
Medical Diagnosis: A number of data points (e.g., X-rays, MRIs, ECGs) suggestive of health status are obtained as part of diagnostic procedures. Medical equipment that the patients use (e.g., glucose monitors, pacemakers, smart watches), also gather some of these data points as well. Anomaly identification methods can be used to highlight instances of suspicious readings that may indicate health issues or be precursors to medical accidents.
Manufacturing Defect Detection: Quality assurance in the manufacturing sector requires an automated approach to detecting defects, particularly in products produced in large volumes. This role can be seen as an anomaly detecting exercise, with the aim of identifying produced products that differ dramatically or even marginally from ideal that have passed quality assurance checks.