Code analysis tools for taint tracking – statically, dynamically or hybrid – are only as good as the definition of sources and sinks. The tools check if there is a potential flow between a source and a sink and inform the analyst about their findings. We checked different code analysis tools in the area of Android and found out that all tools do only contain a hand-picked amount of sources and sinks. This gave us the motivation to create a novel tool for the fully automated generation of Android sources and sinks.
We wrote a technical report SuSi: A Tool for the Fully Automated Classification and Categorization of Android Sources and Sinks that describes the details of our approach.
Abstract:
Today’s smartphone users face a security dilemma: many apps they install operate on privacy-sensitive data, although they might originate from developers whose trustworthiness is hard to judge. Researchers have proposed more and more sophisticated static and dynamic analysis tools as an aid to assess the behavior of such applications. Those tools, however, are only as good as the privacy policies they are configured with. Policies typically refer to a list of sources of sensitive data as well as sinks which might leak data to untrusted observers. Sources and sinks are a moving target: new versions of the mobile operating system regularly introduce new methods, and security tools need to be re- configured to take them into account.
In this work we show that, at least for the case of Android, the API comprises hundreds of sources and sinks. We propose SuSi, a novel and fully automated machine-learning approach for identifying sources and sinks directly from the Android source code. On our training set, SuSi achieves a recall and precision of more than 92%. To provide more fine-grained information, SuSi further categorizes the sources (e.g., unique identifier, location information, etc.) and sinks (e.g., network, file, etc.), with an average precision and recall of about 89%. We also show that many current program analysis tools can be circumvented because they use hand-picked lists of source and sinks which are largely incomplete, hence allowing many potential data leaks to go unnoticed.
Where can I find more information?
More information can be found here.
Is the tool available online?
Yes! The tool is open source tool and can be downloaded from GitHub and here.