Microsoft-Malware-Dataset-Visualizations

Microsoft Malware dataset Visualization

Exploratory Data Analysis Of Dataset

Pallavi Yadkikar

Dataset

The data is available online at:

Initial Analysis Questions

  1. Which Microsoft product is most prone to malware attacks?
  2. Does installing antivirus products helps to prevent malware attacks?
  3. Do we need firewall or just Antivirus products are sufficient to prevent malware attacks?
  4. Does SmartScreen type plays any role in getting malware attacks?

Data Wrangling using pandas

Before going into data exploration I felt, I need to find out which attributes contributes the most in a dataset. Therefore I used python (pandas, matplotlib etc.) to implement Random Forest Algorithm for Feature Selection. This algorithm calculates importance of each attribute in a dataset. So initially I had 85 attributes in dataset. After applying Random Forest Algorithm, I got 39 most important attributes. Then I used some pandas libraries to get top 13 features. The above image explains the top 39 features plotted against their importance scores.

Above image explains how attributes are correlated to each other. It shows relationships between attributes which we can use for further expolaration.

Discoveries & Insights

This visualizations shows which Microsoft Application versions are most vulnerable to Malware atatcks even if they are protected. Application version, color and size shows average of malware detections. The marks are labled by AppVersion. Data is filtered on IsProtected.

The above image shows average of HasDetections for each product name. Color shows details about product name. The data is filtered on average of Firewall which keeps non Null vaules only.

This describes average malware detection and average protection enabled for each Platform. Color shows details about Avg HasDetections & Avg IsProtected.

As smartscreen was one of the important attribute I found out during data wrangling, I wanted to know in what ways SmartScreen is related to malware detection. Image shows Average of HasDetections for each Smart Screen. Color shows details about SmartScreen. The view is filtered on SmartScreen, which excludes Null values.

Average of ScreenSize and average of HasDetections for each SmartScreen. The view is filtered on SmartScreen, which excludes Null values.From above image we can say that more the SmartScreen size is, more will be the malware atatcks.

The image shows IsGamer and Has Detections for each Sku Edition. Color shows details about IsGamer and Has Detections. The view is filtered on Sku Edition, which excludes Invalid. If we exclude some outliers like Education edition, we can find out that being a gamer has a good chance of malware attack.

Image shows that for sku edition, Installing Firewall doesnot really helps for malware attacks. This insight might be shocking because generally people feel that Firewall secure their system from possible malware attacks. Average of Firewall and average of HasDetections for each Platform is shown. The data is filtered on Microsoft Editions.

Answer of above question is yes because, more the number of AntiVirus ProductStatesIdentifier, lesser will be the Malware Detections.The trends of average of AntiVirus ProductStatesIdentifier and average of HasDetections for number of AntiVirus Products Installed. The view is filtered on average of AntiVirusProductStatesIdentifier, which keeps non-Null values only.

We know that windows8 platform had the most malware detection among all other microsoft platforms.So I further explored platform windows8. The trends of Avg. IsProtected and HasDetections for AntiVirus Products Installed broken down by Platform. Color shows details about Avg. IsProtected and HasDetections. The view is filtered on Platform and AntiVirus Products Installed.

From the above image, we cannot really say what causes Windows 8 to be most prone to malware attacks. Moreover, when maximum number of Antivirus products were installed, malware attacks were also highest. That means there might be a chance that, If we explore further with 5 Antivirus product in Windows 8 we might get some more rigid insights.

The plot of average of Has Detections for AV Products Installed broken down by Platform. The view is filtered on AV Products Installed, which keeps non-Null values only.

Summary

This helped me to gain some rigid insights of my dataset. Pandas libraries are really helpful and fast while dealing with large dataset. I found out some really interesting facts and relationships which will help me further. In this dataset, attributes like IsProtected, SmartScreen, SmartScreenSize, AntiVirusProducts, Antivirus State Identifiers are most important for exploring HasDetections. Firewall shockingly does not provide much protections against malwares.