Monitor all public NPM packages.
-
config/default.json
is the main configuration file. -
Download all public metadata
-
creates folder
medata_yyyyMMdd
for examplemetadata_20240809
intargetMetadataDirectory
-
save snapshot of metadata in format
npm_packages_snapshot_yyyyMMdd.json
for examplenpm_packages_snapshot_20240809.json
-
retry mechanism in case the download/network fails with optional parameters:
- maxRetries - Maximum number of retries (optional).
- delay - Delay between retries in milliseconds (optional).
-
in case of retry it uses
readLastDocumentId
function which reads the json file with downloaded metadata and returns thelastDocumentId
-
the variable
lastDocumentId
is used as astartkey
to continue downloading from last downloaded package metadata -
the retry mechanism terminates when the data is fetched and saved successfully
- Download source code for all public NPM packages
- use
all-the-package-names
package to retrieve a list of all the public package names on npm. Includes scoped packages and updated daily. - replace all
/
symbols with+
symbol to avoid unnecessary sub-directories (for scoped packages) - using pacote - fetches package manifests and tarballs from the npm registry
- worker threads for multithreading to download multiple packages at the same time
CONCURRENCY_LIMIT
uses all available CPU cores
- Find Malware Packages using downloaded metadata
This module provides a function to process a large JSON file and identify any packages that are flagged as "security holding packages." It reads the JSON file line by line, extracts the relevant package names, and writes them to an output file. The module is designed to handle very large JSON files (e.g., 180 GB) with efficient memory usage and logging for monitoring progress.
- Efficient Processing: Processes large JSON files line by line to avoid high memory usage.
- Error Handling: Catches and logs errors during JSON parsing, including the line number where the error occurred.
- Progress Logging: Logs the progress of the operation at regular intervals, making it easier to monitor the process.
- Output File: Extracted package names are written to a specified output file.
Parameters
inputFilePath
: The path to the input JSON file that contains the package metadata.outputFilePath
: The path to the output file where the names of security holding packages will be written.targetSecurityHoldingPackagesDirectory
: The directory where the output file will be stored.
Monitoring
-
The function logs the start time, progress every 100,000 lines, and the end time. The function processes the file in a memory-efficient manner by reading it line by line. It logs progress every 100,000 lines to avoid performance issues due to excessive logging. If needed the logging interval can be ajdusted.
-
Errors encountered during processing are logged with the line number for easier debugging. If the function encounters any issues while processing a line, it will log the error along with the line number, helping to identify and correct any potential issues in the JSON file.
- refactor, clean up and fix the code
- document everything
- scan for downloaded source code for
security holding package
and create a dataset for them - analyase the source code of these packages
- contact NPM for API key
- getLatestPackageVersion from local JSON file when possible to reduce NPM requests
- implement scans to scan for vulnerabilities
- implement how many times a package is used as a dependancy
- implement how many times a package is downloaded