Introduction
Software developers are vital to any organization and often encounter suspicious and unusual behaviors when developing their software. A new and innovative approach is needed to detect and prevent these behaviors, and this is where this thesis proposal applies. This thesis proposes to use knowledge graphs to detect and identify suspect, wrong, or unusual behavior in software developers. Knowledge graphs provide a powerful way to represent, store, and analyze data. I plan to use this technology to develop an anomaly detection system that can identify anomalous behaviors in software developers. This thesis proposal explores the potential of knowledge graphs for anomaly detection of suspect, wrong or unusual behavior in software developers.
Preliminary Literature
The literature review will explore the use of knowledge graphs for anomaly detection and look at existing solutions that could detect anomalies in software developers. This section will also compare existing approaches to solving suspicious, wrong, or unusual behavior in software developers. By doing this, the researcher will better understand the current state of anomaly detection in software developers using a knowledge graph.
Forestiero (2021) documented that anomaly detection involves identifying patterns and behaviors considered unusual or out of the ordinary to identify potential system issues or detect malicious activities. According to Zheng et al. (2022), knowledge graphs have been increasingly used to detect these anomalies, providing a unique way to explore relationships between entities. By examining the relationships between entities, such as software developers, programming languages, and repositories, it is possible to identify patterns and behaviors that indicate an anomaly. In this literature review, we will explore the use of knowledge graphs for anomaly detection and look at existing solutions that could be used to detect anomalies in software developers. We will also discuss the challenges associated with using knowledge graphs for anomaly detection and suggest potential solutions to address these challenges. By doing this, it is possible to better understand the current state of anomaly detection in software developers using a knowledge graph.
The use of knowledge graphs in software development has been studied extensively in recent years, especially concerning anomaly detection. Farahani et al. (2019) noted that anomaly detection is a process that seeks to identify unusual behavior or patterns in data sets, which they agree with in Forestiero (2021). In software development, anomalies can refer to malicious or suspicious activities, such as code injection or backdoors, and more subtle behaviors, such as unexpected coding styles or refactoring activities (Farahani et al., 2019). Previous research has attempted to detect these anomalies by analyzing code repositories with various machine-learning algorithms, such as those found on GitHub (Wartschinski et al., 2022). However, this approach often yields inaccurate results due to the limited data available from these sources.
In order to improve accuracy, more recent research has sought to leverage knowledge graphs for anomaly detection in software development. A knowledge graph collects connected data points, usually structured as nodes and relationships, enabling efficient mapping of complex datasets (Liu et al., 2022). Researchers can use knowledge graphs to identify unusual or suspicious activities by analyzing the relationships between data points.
Deecke et al. 2021 proposed that one of the most prominent approaches in this area is “Semantic Anomaly Detection .”This method uses semantic similarity metrics to identify anomalies in software development projects by comparing source code and its associated meta-data. Xu et al. 2022 added that another approach is the “Cumulative Anomaly Detection” framework. This method uses knowledge graphs to analyze changes in software development activities over time. The framework can identify suspicious activities by monitoring for changes in the frequency and intensity of activities.
Sagar et al. (2021) presented a method for using knowledge graphs to detect anomalies in software development processes. They created a knowledge graph from commit logs and source code, which was then used to identify anomalies in a project’s development cycle. Moreover, Zoppi et al. (2023) proposed a system for detecting anomalies in software repositories using knowledge graphs. Their system used unsupervised and supervised machine learning models to analyze software metrics. It was able to accurately identify anomalies in software projects, as well as determine their cause.
Research has also been conducted into applying knowledge graphs to anomaly detection in software defect prediction. For example, Bai et al. (2020) proposed a system that utilizes graph embeddings and knowledge graphs to predict software defects. The system was effective in accurately predicting software defects and their causes.
These studies demonstrate the potential of using knowledge graphs for anomaly detection in software development and defect prediction. By leveraging the rich information in software repositories, it is possible to detect suspicious, wrong, or unusual behavior that may otherwise be missed. In this thesis, we aim to further explore this potential by applying knowledge graphs to anomaly detection of wrong or unusual behavior in software developers.
Overall, using knowledge graphs for anomaly detection in software development has shown great potential, with both Semantic Anomaly Detection and Cumulative Anomaly Detection showing high accuracy and precision rates. With more research on the use of knowledge graphs in software development, further advances will likely be made in anomaly detection in the near future. For example, researchers have suggested combining different methods (e.g., semantic and cumulative) to create hybrid models that could provide more accurate results than either approach alone. Additionally, natural language processing (NLP) advancements could enable machines to better understand source code and its related artifacts, thus improving their ability to detect anomalies accurately.
Research Questions
- How does anomaly detection using knowledge graphs compare with existing approaches for detecting suspicious or wrong behavior in software developers?
- What factors can be used to measure the accuracy and precision of anomaly detection models based on knowledge graphs?
- How can knowledge graph-based anomaly detection be applied to detect wrong or unusual behavior in software developers?
Methodology
Data Collection
The data for this thesis will be collected from multiple sources, including software development tools, such as GitHub and Stack Overflow, and various software development blogs and forums. This data will be collected using web scraping techniques, allowing for the capture of relevant information.
Data Analysis
Once the data has been collected, it will then be analyzed to identify any unusual behavior of software developers. In order to do this, a knowledge graph will be created, which contains relationships between different entities, such as developers, technologies, topics, and other resources. When using this graph, machine learning algorithms will be applied to detect any anomalies within the data. In addition to using a knowledge graph and machine learning algorithms, natural language processing techniques will also be used to analyze text-based data and detect any patterns that may indicate unusual behavior. This will include analyzing text from blog posts and forum discussions and code snippets from open-source repositories. Finally, statistical methods will be used to validate the analysis results further and draw conclusions about the data.
The Implementation Plan of the Research Methods
The implementation plan for this methodology will involve the following steps:
- Collecting data from software development tools, blogs, and forums using web scraping techniques.
- Creating a knowledge graph with the collected data to visualize relationships between different entities.
- Applying machine learning algorithms on the knowledge graph to identify anomalies and patterns which may indicate suspect or unusual behavior of software developers.
- Applying natural language processing techniques on text-based data to detect patterns that may indicate unusual behavior.
- Applying statistical methods to validate the analysis results further and draw conclusions about the data.
Results
The results of this study will be evaluated by analyzing the effectiveness of the anomaly detection model in detecting wrong or unusual behavior of software developers. The evaluation will be based on how accurately the model can identify anomalies in the knowledge graphs used to train the system. The model’s accuracy can be measured by calculating the number of false positives and negatives that occur while using the model (Ko et al., 2021). The model’s performance can also be tested with other datasets to determine its generalizability. Once the model’s effectiveness is determined, this study’s results will be discussed in terms of their implications for the software development community. Specifically, the findings from this research could provide software developers with an automated tool for detecting possible suspicious activity promptly. This could help prevent security threats from occurring before they cause damage to an organization or individual. Furthermore, the results of this research could lead to further improvements in the technology related to security and knowledge graphs, allowing for more accurate models to be developed in the future. Finally, this research could provide insight into techniques that could be used in other security areas, such as financial fraud or data leakage detection. All these aspects should be addressed to ensure that this proposal fulfills its primary goal, which is to enhance security in software development.
Conclusion and direction of Future Research
The study will explore the possibility of using knowledge graphs to detect wrong or unusual behavior in software developers. Through research, people will see the potential of utilizing this technology to monitor and manage software development. However, more research is needed to determine its effectiveness in this field. Based on the reviewed literature, knowledge graphs have the potential to be an effective tool for identifying unusual behavior within software development processes. By analyzing the data collected from a knowledge graph, developers can identify anomalies and react accordingly to mitigate risks associated with such anomalies. For future research, it is suggested that further work should be done to investigate the use of knowledge graphs in other fields, such as security and fraud detection. Additionally, further experiments should be conducted to test the accuracy and effectiveness of knowledge graphs when applied to different datasets. Finally, further studies should be conducted to determine how well knowledge graphs can scale to larger datasets and their limitations in certain situations. With these studies, it is possible to understand better how knowledge graphs can be applied to anomaly detection.
References
Bai, Y., Xing, Z., Li, X., Feng, Z., & Ma, D. (2020, June). Unsuccessful story about a few-shot malware family classification and siamese network to the rescue. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (pp. 1560–1571). https://dl.acm.org/doi/abs/10.1145/3377811.3380354
Deecke, L., Ruff, L., Vandermeulen, R. A., & Bilen, H. (2021, July). Transfer-based semantic anomaly detection. In International Conference on Machine Learning (pp. 2546-2558). PMLR. https://proceedings.mlr.press/v139/deecke21a.html
Farahani, I. V., Chien, A., King, R. E., Kay, M. G., & Klenz, B. (2019, December). Time series anomaly detection from a Markov chain perspective. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 1000-1007). IEEE. DOI:10.1109/ICMLA.2019.00170
Forestiero, A. (2021). Metaheuristic algorithm for anomaly detection in the Internet of Things leveraging on a neural-driven multiagent system. Knowledge-Based Systems, 228, 107241. https://doi.org/10.1016/j.knosys.2021.107241
Ko, Y., Lee, Y., Azam, S., Munir, F., Jeon, M., & Pedrycz, W. (2021). Key points estimation and point instance segmentation approach for lane detection. IEEE Transactions on Intelligent Transportation Systems, 23(7), 8949-8958. 10.1109/TITS.2021.3088488
Liu, K., Wang, F., Ding, Z., Liang, S., Yu, Z., & Zhou, Y. (2022). Recent Progress of Using Knowledge Graph for Cybersecurity. Electronics, 11(15), 2287. https://doi.org/10.3390/electronics11152287
Sagar, P. S., Alomar, E. A., Mkaouer, M. W., Ouni, A., & Newman, C. D. (2021). Comparing commit messages and source code metrics for the prediction refactoring activities. Algorithms, 14(10), 289. https://doi.org/10.3390/a14100289
Wartschinski, L., Noller, Y., Vogel, T., Kehrer, T., & Grunske, L. (2022). Vudenc: Vulnerability detection with deep learning on a natural codebase for python. Information and Software Technology, 144, 106809. https://doi.org/10.1016/j.infsof.2021.106809Get
Xu, Z., Yang, T., & Najafi, M. L. (2022). Method of Cumulative Anomaly Identification for Security Database Based on Discrete Markov chain. Security and Communication Networks, 2022. https://doi.org/10.1155/2022/5113725
Zheng, H., Tian, B., Liu, X., Zhang, W., Liu, S., & Wang, C. (2022, August). Data Quality Identification Model for Power Big Data. In Data Science: 8th International Conference of Pioneering Computer Scientists, Engineers, and Educators, ICPCSEE 2022, Chengdu, China, August 19–22, 2022, Proceedings, Part II (pp. 20–29). Singapore: Springer Nature Singapore. https://link.springer.com/chapter/10.1007/978-981-19-5209-8_2
Zoppi, T., Ceccarelli, A., Puccetti, T., & Bondavalli, A. (2023). Which Algorithm can Detect Unknown Attacks? Comparison of Supervised, Unsupervised, and Meta-Learning Algorithms for Intrusion Detection. Computers & Security, 103107. https://doi.org/10.1016/j.cose.2023.103107