This is basically a data collection paper. How did the authors collect more than three million free Android apps (more than 20 terabytes)? The answer: it’s somewhat more delicate than one might have thought. In particular, one should avoid triggering the source’s defenses. Deduplication is also a problem, as is distinguishing a source with no changes from a source that has changed in such a way that we don’t detect the changes.
They also give some statistics: 60 percent of the apps are from Google Play, and two Chinese markets account for a little less than 20 percent each. While 22 percent of the Google Play apps trigger at least one of the antivirus products at VirusTotal, less than one percent trigger ten or more of them. This is in marked contrast to the Chinese stores (where 33 percent and 17 percent, respectively, trigger at least ten) or another store where 100 percent do.
The dataset is available, though the authors have some important caveats based on “the lack of a clear, universal copyright exemption for research.” The authors use this tool in their own research, for which it is good to have the data collection methodology so clearly described.