In the field of cybersecurity, automation is not just a convenience; it’s a necessity. Whether you’re dealing with a handful or a plethora of files, manually scanning each one is neither efficient nor practical. This article aims to guide you through the process of automating file scans by calculating their hash values and leveraging the VirusTotal API with Python for swift and efficient checks.
At a glance, this might seem like the ultimate solution. After all, if a malware’s signature is already catalogued in the VirusTotal database, you’ve saved yourself significant time and effort. But here’s the catch: this is just the tip of the iceberg. While this method can identify known threats, it’s not foolproof. Sophisticated malware, especially polymorphic or metamorphic variants, are designed to alter their code, ensuring each instance has a unique signature. This cunning tactic allows them to slip past basic hash-based detections, potentially leading to dangerous false negatives.
So, while automating your initial scans with VirusTotal can be a powerful first line of defence, it’s crucial to approach negative results with a healthy dose of scepticism. Dive in as we explore this automation process and delve deeper into the nuances of malware that can elude basic hash recognition.
File Hashes in Malware Analysis
Before starting I want to briefly introduce what is file hashing and one way we can use it in Malware Analysis. They’re like digital fingerprints for files. When a file is created, an algorithm processes its content, producing a unique string of characters – the hash. For instance, tools might use MD5, SHA-1, or SHA-256 algorithms to generate these hashes. VirusTotal, in its quest to detect malware, expects these file hashes.
By submitting a hash, you’re essentially asking VirusTotal, “Have you seen this file before?” If it has, it’ll provide details on any associated threats.
For example, a common file like “malware.exe” will have a recognized hash. This makes hashes a quick way to spot suspicious files. However, with advanced malware that can change its code, relying solely on hashes can be tricky.
A Quick Overview Of VirusTotal
VirusTotal isn’t just another online tool; it’s a powerhouse in the cybersecurity landscape. Acting as a meta-scanner, it aggregates data from multiple antivirus engines, providing a holistic view of potential threats. When you submit a file or hash to VirusTotal, it cross-references it with its vast database, offering insights into any associated risks. But the real magic lies in its API. With the VirusTotal API, Python enthusiasts can automate and streamline their malware analysis processes. Instead of manually uploading each file, scripts can be crafted to batch-process files, making the task efficient and thorough. But remember, while VirusTotal is a formidable ally, it’s essential to combine its insights with other analysis techniques, ensuring no threat goes undetected.
Setting Up Your Environment for VirusTotal API with Python
In this guide, our spotlight is on fetching reports using file hashes via the VirusTotal API. We won’t delve into unknown files or tackle polymorphic or metamorphic variants. Consider this a foundational step, paving the way for your future explorations.
Safety first! When dealing with malware samples, always work in a secure environment. Need guidance? Check out this step-by-step tutorial on setting up FlareVM: How to Install FlareVM on VirtualBox.
Next, head to the VirusTotal website to register and grab your API key. This key is essential for our “VirusTotal API with Python” journey.
Post-registration, navigate to your profile section. Spot the “API Key” item on the top right.
Click the “eye” icon. Voilà! Your API key appears. This key grants you access to the API. Right below, you’ll notice details like your daily quota. For free users, it’s capped at 500 lookups/day.
Finally, I posted the code in the GitHub repository, you can download the whole project, or, in the case you want to follow the step-by-step tutorial, you can only download the XLSX file and put it into your project’s folder.
Setup complete! You’re now geared up to harness the VirusTotal API with Python.
Harnessing the Power of vt-py: Python’s Client for VirusTotal
For the task of automation Python stands out as a versatile ally. Specifically, when interfacing with VirusTotal, the Python client vt-py
is a game-changer.
This library streamlines connecting to the VirusTotal API, making it straightforward to submit files, URLs, or hashes for analysis. With just a few lines of code, you can initiate scans, fetch reports and more.
The beauty of vt-py
lies in its simplicity and efficiency. Instead of crafting API requests manually, this client handles the heavy lifting, allowing you to focus on interpreting results and refining your malware detection strategies. Integrating vt-py
into your Python scripts not only accelerates your workflow but also ensures you’re leveraging the full potential of the VirusTotal API.
To install it type in your terminal:
pip install vt-py
If you want, you can read here the full documentation for that library.
I also suggest installing openpyxl in order to get all the hashes from an XLSX file, and you can do it with the following command:
pip install openpyxl
And finally, install jinja2 to manage the templates
pip install jinja2
Now we are ready to continue, but before writing the first line of code, let’s see a bit the API call of interest.
Diving into the ‘Get a File Report’ API
Our focus for this tutorial is the ‘Get a File Report’ API, detailed here. This API expects a GET request with the file’s hash (MD5, SHA1, or SHA256) as input.
If the call succeeds? You get a 200 OK response. If not? Expect a 400 ERROR.
A successful response returns a ‘File’ type. Its detailed structure is available here. But let’s simplify and highlight the key attributes for our “VirusTotal API with Python” exploration:
- meaningful_name: A significant name among all of the file’s names.
- reputation: A score derived from the VirusTotal community’s votes.
- popular_threat_classification: This section offers insights into the file’s threat classification. It includes:
- popular_threat_name: A list detailing how many AV engines identified a specific threat.
- popular_threat_category: A more generic list, categorizing malware types.
- suggested_threat_label: A combination of the above two attributes.
- total_votes: Community votes split into “harmless” and “malicious”:
- harmless: Count of positive votes.
- malicious: Count of negative votes.
This is a foundational overview. You can dive deeper into the attributes using the official documentation.
Remember, our script will store the full response. This ensures you don’t repeat calls, especially since the API has call limits. Saving responses helps avoid hitting those limits unnecessarily.
VirusTotal API with Python: A Step-by-Step Guide
Our script is designed to automate the process of scanning files using the VirusTotal API. Let’s break down its key components:
- Configuration Constants:
API_KEY_PATH = "api_key.txt"
TEMPLATE_DIRECTORY = 'templates'
REPORT_TEMPLATE_NAME = 'hash_report_template.md'
HASH_LIST_FILENAME = "hashes.xlsx"
GENERATED_REPORTS_DIR = "reports"
API_RESPONSES_DIR = "responses"
We define paths for the API key, template directory, and report template (The API key will be saved into a file pointed by API_KEY_PATH). Obviously, you have to create the folders previously, in case you want to make it in the script you can use these few lines of code:
if not os.path.exists(path):
os.makedirs(path)
- Report Generation:
class ReportGenerator:
def __init__(self):
# Initialize Jinja2 environment and load the template
self.env = Environment(loader=FileSystemLoader(TEMPLATE_DIRECTORY))
self.template = self.env.get_template(REPORT_TEMPLATE_NAME)
def generate(self, response):
# Render the template with the given response data
return self.template.render(
meaningful_name=response.get('meaningful_name'),
label=response.get('popular_threat_classification', {}).get('suggested_threat_label'),
reputation=response.get('reputation'),
sandbox_verdicts=response.get('sandbox_verdicts'),
total_votes=response.get('total_votes')
)
The ReportGenerator
class uses the Jinja2 template engine. It takes the response from VirusTotal and renders it into a readable report format.
- Generating Hash List:
def generate_hash_list_from_folder(folder_path, xlsx_filename):
# Create an Excel file with filenames and their MD5 hashes
wb = openpyxl.Workbook()
ws = wb.active
for i, filename in enumerate(os.listdir(folder_path), start=1):
with open(os.path.join(folder_path, filename), "rb") as file:
ws.cell(row=i, column=1).value = filename
ws.cell(row=i, column=2).value = hashlib.md5(file.read()).hexdigest()
wb.save(xlsx_filename)
This is just a bonus function I didn’t use in the example, but you can adapt it for your needs and create the XLSX file as the one I provided you in the example.
In particular, this function computes the MD5 hash for each file in a specified folder and saves them in an Excel file. This list of hashes will be used to query VirusTotal.
- Fetching Report from VirusTotal:
def fetch_report_from_virustotal(client, file_hash):
# Fetch the report of a file from VirusTotal using its hash
return client.get_object(f"/files/{file_hash}")
Given a file’s hash, this function retrieves its report from VirusTotal.
- Reading and Writing Data:
def read_json_file(filename):
# Load data from a JSON file
with open(filename, "r") as f:
return json.load(f)
def save_data(filename, data, is_json_format=False):
# Save data to a file, either as JSON or plain text
with open(filename, "w") as f:
if is_json_format:
json.dump(data, f, indent=4)
else:
f.write(data)
These utility functions help in reading from and writing to files. The save_data function handles both JSON and plain text formats so that it can be used to save the response and the report.
- Main Execution:
def main():
api_key = retrieve_api_key(API_KEY_PATH)
with vt.Client(api_key) as client:
hashes = extract_hashes_from_excel(HASH_LIST_FILENAME)
report_gen = ReportGenerator()
for h in hashes:
try:
response = fetch_report_from_virustotal(client, h)
save_data(f"{API_RESPONSES_DIR}/{h}", response.to_dict(), is_json_format=True)
save_data(f"{GENERATED_REPORTS_DIR}/{h}.md", report_gen.generate(response))
except vt.APIError as e:
print(f"Error with hash {h}: {e}")
This is where the magic happens! We load our API key, iterate through our list of file hashes, fetch their reports from VirusTotal, and save the results. If there’s an error with a particular hash, it’s printed out.
Remember, while this script provides a solid foundation for interacting with the VirusTotal API in Python, it’s just a starting point. You can expand upon it, tailoring it to your specific needs and integrating more advanced features.
Crafting a Custom Template for VirusTotal API Results
To present the results from the VirusTotal API in a structured and readable format, we utilize a template. This template is designed using the Jinja2 templating engine, which allows for dynamic content rendering based on the data provided.
- File Identification:
# Virus Total API Result
**Meaningful Name**: {{ meaningful_name }}
Here, we display a title and the most recognizable name of the file, making it easy to identify at a glance.
- Threat Classification:
## Threat Classification
{% if label %}
Labels: {{ label }}
{% else %}
No labels found.
{% endif %}
This section showcases the threat labels associated with the file. If no labels are found, it indicates as such.
- Reputation Score:
## Reputation
Reputation Score: {{ reputation }}
The reputation score gives a quick insight into the file’s standing within the VirusTotal community.
- Sandbox Analysis:
## Sandbox Verdicts
{% if sandbox_verdicts %}
Verdicts:
{% for verdict, details in sandbox_verdicts.items() %}
- {{ verdict }}
- Category: {{ details["category"] }}
- Confidence: {{ details["confidence"]|default("N/D") }}
{% endfor %}
{% else %}
No sandbox verdicts found.
{% endif %}
Sandbox verdicts provide detailed analysis results. If the file has been analyzed in a sandbox environment, this section lists the verdicts, their categories, and confidence levels.
- Community Votes:
## Total Votes
{% for k, v in total_votes.items() %}
- {{ k }}: {{v}}
{% endfor %}
This section displays the total votes from the VirusTotal community, categorized as “harmless” or “malicious”.
To use this template, save the provided code in a .md
file within the templates
directory specified in the main script. When the script runs, it will dynamically populate this template with the VirusTotal API results, generating a comprehensive report for each file.
Dive into the Code
For those eager to get hands-on, here’s the entire script we’ve meticulously crafted. This code embodies the principles we’ve discussed, offering a seamless integration with the VirusTotal API. Don’t just skim through it; run it, tweak it, and witness firsthand the power of automation in cybersecurity. Let this be your stepping stone to more advanced projects. Grab the code below and set your cybersecurity prowess in motion!
The Script
import vt
import os
import hashlib
import openpyxl
import json
from jinja2 import Environment, FileSystemLoader
# Constants for configuration
API_KEY_PATH = "api_key.txt"
TEMPLATE_DIRECTORY = 'templates'
REPORT_TEMPLATE_NAME = 'hash_report_template.md'
HASH_LIST_FILENAME = "hashes.xlsx"
GENERATED_REPORTS_DIR = "reports"
API_RESPONSES_DIR = "responses"
class ReportGenerator:
def __init__(self):
# Initialize Jinja2 environment and load the template
self.env = Environment(loader=FileSystemLoader(TEMPLATE_DIRECTORY))
self.template = self.env.get_template(REPORT_TEMPLATE_NAME)
def generate(self, response):
# Render the template with the given response data
return self.template.render(
meaningful_name=response.get('meaningful_name'),
label=response.get('popular_threat_classification', {}).get('suggested_threat_label'),
reputation=response.get('reputation'),
sandbox_verdicts=response.get('sandbox_verdicts'),
total_votes=response.get('total_votes')
)
def generate_hash_list_from_folder(folder_path, xlsx_filename):
# Create an Excel file with filenames and their MD5 hashes
wb = openpyxl.Workbook()
ws = wb.active
for i, filename in enumerate(os.listdir(folder_path), start=1):
with open(os.path.join(folder_path, filename), "rb") as file:
ws.cell(row=i, column=1).value = filename
ws.cell(row=i, column=2).value = hashlib.md5(file.read()).hexdigest()
wb.save(xlsx_filename)
def fetch_report_from_virustotal(client, file_hash):
# Fetch the report of a file from VirusTotal using its hash
return client.get_object(f"/files/{file_hash}")
def read_json_file(filename):
# Load data from a JSON file
with open(filename, "r") as f:
return json.load(f)
def extract_hashes_from_excel(xlsx_filename):
# Extract file hashes from an Excel file
wb = openpyxl.load_workbook(xlsx_filename)
ws = wb.active
return [ws.cell(row=i, column=1).value for i in range(2, ws.max_row + 1)]
def save_data(filename, data, is_json_format=False):
# Save data to a file, either as JSON or plain text
with open(filename, "w") as f:
if is_json_format:
json.dump(data, f, indent=4)
else:
f.write(data)
def retrieve_api_key(filename):
# Load the API key from a file
with open(filename, 'r') as f:
return f.read().strip()
def main():
api_key = retrieve_api_key(API_KEY_PATH)
with vt.Client(api_key) as client:
hashes = extract_hashes_from_excel(HASH_LIST_FILENAME)
report_gen = ReportGenerator()
for h in hashes:
try:
response = fetch_report_from_virustotal(client, h)
save_data(f"{API_RESPONSES_DIR}/{h}", response.to_dict(), is_json_format=True)
save_data(f"{GENERATED_REPORTS_DIR}/{h}.md", report_gen.generate(response))
except vt.APIError as e:
print(f"Error with hash {h}: {e}")
if __name__ == "__main__":
main()
The Template
# Virus Total API Result
**Meaningful Name**: {{ meaningful_name }}
## Threat Classification
{% if label %}
Labels: {{ label }}
{% else %}
No labels found.
{% endif %}
## Reputation
Reputation Score: {{ reputation }}
## Sandbox Verdicts
{% if sandbox_verdicts %}
Verdicts:
{% for verdict, details in sandbox_verdicts.items() %}
- {{ verdict }}
- Category: {{ details["category"] }}
- Confidence: {{ details["confidence"]|default("N/D") }}
{% endfor %}
{% else %}
No sandbox verdicts found.
{% endif %}
## Total Votes
Total Votes:
{% for k, v in total_votes.items() %}
- {{ k }}: {{v}}
{% endfor %}
Conclusion
Our dive into automating file checks using the “Virus Total API with Python” is just a glimpse of what’s possible when you combine coding prowess with powerful APIs. If you found this guide helpful and are eager to explore further, don’t forget to follow our updates for more enlightening content.
For those keen on diving right in, the complete code for this tutorial, along with an XLSX file containing sample hashes for testing, is readily available on StackZero’s GitHub profile. Dive in, experiment, and remember: the cybersecurity journey is as much about the tools you use as the knowledge you possess.
Stay connected, and until next time, happy coding!