Data Feedback Loop for Autonomous Vehicles
Background
Self-driving cars use a mix of sensors, cameras, and smart algorithms to move around safely.
They need two main technologies to do this: computer vision and machine learning.
- Computer Vision uses cameras and sensors to take pictures and videos of everything around the AV. This includes both static and dynamic objects such as road lines, traffic lights, people, and other cars. The car then uses special techniques to understand these pictures and videos.
- Object Detection: The car uses advanced computer vision methods to quickly and accurately detect and classify the objects present on the surrounding pathway in real time.
- Object Tracking: After the car detects something, it uses object tracking technique to monitor the dynamic objects, which is crucial for path planning and collision avoidance.
- Machine learning: It looks at the data from the cameras and sensors. Then it uses special algorithms to find patterns, make predictions, and learn from new information. This helps the car make smart decisions, handle new situations, and get better over time.
Scope of this analysis is only Computer Vision data, i.e; Visual Perception.
Data Feedback Loop is mandatory in every real time application to improve the performance of the system, where any misidentified or edge case can be corrected and fed back to the model to learn from the updated data.
There are some scenarios, as explained below, where the model needs real time data to predict the object accurately, which are absent in the training dataset.
Examples
- Sometimes heavily occluded ‘Stop’ Signs misidentifies this.
- Person holds ‘Stop’ signs without actually meant for stopping the AV.
- School bus ‘Stop’ sign may predict the AV to stop it.
Besides this, following are some edge case scenarios which also needs human labelling for better model prediction, where the model may struggle to classify the object correctly and may not respond appropriately.
Examples
- Cargo Bikes or Bikes with Trailers: Cargo bikes, bikes with trailers, or bikes carrying large objects can be difficult for models to recognize correctly. The model may interpret the bike as an extended vehicle, especially if the trailer or cargo changes the bike’s shape significantly.
- Motorcycles with Sidecars: Motorcycles with sidecars or small trailers present an unusual shape that can be easily confused with other types of vehicles like compact cars or three-wheeled vehicles.
- Reflections of Vehicles and Bicycles on Shiny Surfaces: Reflections on wet roads, building windows, or shiny car surfaces can create “phantom” objects, where the model might mistakenly interpret reflections as actual vehicles or bicycles.
- Animals Being Transported in or on Vehicles: Animals (like dogs) in the bed of a truck or in a car window can be interpreted as independent objects or mistakenly classified as pedestrians or moving obstacles.
So, here a need of Data Feedback Loop comes into picture.
A data feedback loop for an autonomous vehicle system plays a central role in ensuring continuous learning, adaptability, and improvement in vehicle performance and safety. This feedback loop serves as the engine for refining perception, decision-making, and control systems within autonomous vehicles.
In a high level, the process flow looks like below.
Initially, the CV model gets trained on commercially available or open source dataset that lacks the real world scenarios as described above. Once the model is trained and deployed in AV, it starts capturing the real time scenarios where it fails to predict, or prediction accuracy is below threshold. In this case, the model should be retrained on the newly captured data with human labelled ground truth value. The model gets retrained on enriched dataset with better prediction accuracy.
Problem Statement
To create a dynamic and intelligent data feedback loop that continuously enhances autonomous vehicle systems through real-world data, providing real-time insights, improving decision-making, and supporting iterative learning.
Scope of the Document
- Define the vision, strategy, and roadmap for developing a scalable, secure and robust vision perception data feedback loop that will be responsible for the success of a fully autonomous driving system.
- Design the functional architecture of this system, to enhance system accuracy and responsiveness under diverse driving conditions.
- Identify and allocate necessary resources, including personnel, technology, and budget, to support development and deployment.
- Finally, establish a comprehensive strategy for long-term management and maintenance, focusing on continuous improvement, operational efficiency, and adherence to safety and regulatory standards.
Vision
To build a secure, scalable, and intelligent data platform that empowers autonomous vehicle systems with real-time data insights, enhances decision-making, and accelerates continuous improvement, all while prioritizing safety, privacy, and compliance.
Phase
Scope
Timelines
I
- Stream data ingestion based Data Feedback Loop with Active Learning for existing CV models
- Train existing CV models to handle Edge cases in the same architecture, using Synthetic data generated through GAN models.
- Monitor CV models
- Secure the system by adopting required AI Risk mitigation controls
- MLOps (covered with a limited scope)
Up to 1 yr
II
Additional enhancements + bug fixes with the existing architecture + POC for the new architecture
1 yr -2 yrs
III
More refined architecture to support SAM or Large Vision Models with RAG & Vision Language Models
2yrs and beyond
Assumption
- Initially, the CV model gets trained on commercially available or open source dataset or that lacks the real world scenarios. So, we need to build a data platform with real time data capture, once the CV model gets deployed in AV and starts collecting data, means the data lake or platform shall contain only Data Feedback Loop.
- This data lake is for only image data and to build corresponding Observability system. It excludes other telemetry data emitted by the sensors of AV.
- MLOps is covered in a limited version, assuming that this architecture will use Bosch’ existing MLOps platform.
Business Value [Assigned some hypothetical values]
- Enhanced Autonomy and Safety: Real-world data collected from vehicles helps Bosch identify edge cases and retrain models to better recognize and respond to various driving scenarios, enhancing both driver and pedestrian safety by 15%.
- Reduced Development Time and Increased Model Precision: This rapid iteration improves the precision of its computer vision models, enabling 20% faster deployment of refined updates to its fleet, giving Bosch a competitive edge in the autonomous driving space.
- Scalability of Autonomous Fleet Learning: DFL will leverage data from its global fleet, allowing the company to scale its machine learning efforts based on a diverse set of driving conditions and environments, making the system scalable up to 20%.
- Cost Savings and Efficiency: Bosch will reduce the need for extensive manual testing by using the data feedback loop to automatically detect and correct model errors, thereby saving cost up to 15%.
Workflow for Real time data capture with Data Feedback Loop
- Detect the failure scenario: Data workflow uses real-world driving examples to iteratively run machine learning algorithms, which are then used to train CV models of AV. This model would constantly be running in “shadow mode.” When the driver does something different from it would have done, or the neural network signals that it doesn’t know what to do in the presented scenario, it notes that event as an inaccuracy. Data Engine logs these inaccuracies into its memory, it can retroactively collect them.
- Collect image data having similar failure scenario or contextual examples: Suppose Bosch detects enough inaccuracies under similar circumstances. In that case, Bosch can then search for similar driving conditions found in other cars in the Bosch fleet, even if it didn’t detect an inaccuracy. Bosch can then harvest similar contextual examples.
- Ground truth of the data: Next, the vehicle identifies an inaccuracy. That inaccuracy enters Bosch’s Unit Tests to verify its legitimacy and that it’s not the result of subpar driving by the human driver. If the inaccuracy is deemed legitimate, Bosch then asks its fleet for more examples of where the inaccuracies are found. Those examples are then correctly labelled by a human, and used to train the neural network. The network is then redeployed to the data source to collect more inaccuracies.
- Prepare a dataset combined with predicted and GTD
- Train the CV models with this new dataset: Using this newly formed, well-labelled data set, Bosch can re-train its neural network to better react to the scenario in which those inaccuracies were presented. Once the neural network is re-trained, it can deploy the newly revised self-driving neural network to “shadow mode” and collect new data examples for further inaccuracies.
- Deploy these newly trained models into AVs.
- Iterate the process.
Workflow consists of following phases
Ph-1: Stream data ingestion based Data Feedback Loop with Active Learning for existing CV models
SPh-01: Build Data Feedback Loop
SPh-02: Prepare dataset with new data through data catalogue
SPh-03: Re-train the CV model with new dataset (OOS of this document)
SPh-04: Deploy the model into edge
SPh-05: Get the model inference and model evaluation metrics
SPh-06: Continue with Phase-1.
SPh-07: AI Risk Mitigation Control
SPh-08: Non-Functional Requirements
Strategy
- Prioritize High-Quality Data Collection
- Through AWS Kinesis Video streams or Apache Kafka
2. Develop a Scalable Data Ingestion and Storage System
- Cloud-Native, Scalable Infrastructure: Build a highly scalable, cloud-native platform with an architecture that can handle billions of data points generated daily. Utilize a hybrid storage approach that supports both edge storage for rapid processing and cloud storage for long-term analysis.
- Data Compression and Efficient Transmission: Implement advanced compression techniques and adaptive data transmission protocols to optimize bandwidth and reduce latency, ensuring data is transmitted without overwhelming networks.
- Apply a required data retention and wipe out policy, if applicable.
3. Set up a Data Cleaning and enrichment process
- Centralized Data Lake with Distributed Processing: Develop a data lake for centralized storage, allowing diverse data types (structured, semi-structured, unstructured) and enabling distributed processing frameworks for efficient analysis. These are for building both Data Feedback Loop and Observability system for it.
4. Set up Data Catalogue for centralized governance of data platform
- Maintain data quality, lineage, data discovery and data access framework.
- Initiate data labelling process.
5. Foster a Collaborative Ecosystem and Open Standards
- Open APIs and SDKs for Third-Party Integrations: Develop APIs that enable integration with third-party applications and analytics platforms to support V2X (Vehicle-to-Everything) communications and real-time information exchange.
- Encourage Industry Collaboration: Participate in open-source and industry-wide initiatives to set standardized data formats and protocols, fostering innovation and creating an ecosystem where autonomous data can be leveraged collectively.
6. Build a Transparent, Robust, and Secure Platform
- Compliance-First Design: Design the platform to align with data privacy regulations (GDPR, CCPA) and industry standards (ISO 26262, UL 4600). Implement privacy-preserving techniques like anonymization and differential privacy to protect user data.
- Transparent and Auditable System: Introduce transparency in data handling practices, enabling users to access data logs related to their journeys, creating trust with end-users and aligning with regulatory demands.
- End-to-End Security: Secure data at every layer, with encryption protocols for data at rest and in transit, and stringent access controls to prevent unauthorized data access.
Phase-1
Technical Architecture (SPh-01, SPh-02, SPh-03)
- Image data gets collected from the AV through AWS Greengrass.
- Then these data gets ingested through AWS Outpost for local data processing.
- In a streaming data pipeline for autonomous vehicles, object detection is typically performed on the cloud using AWS services,
- Cloud Processing (for complex tasks):
- AWS SageMaker or AWS Lambda: For more resource-intensive models, raw image data is sent to the cloud. AWS Lambda functions or SageMaker inference endpoints perform object detection, tagging detected objects in metadata before storing or streaming it further.
- Then these data comes to AWS Cloud Infrastructure through Kinesis Video streams or Apache Kafka. Kinesis Video Streams extracts frames from the live video to an S3 bucket. Alternatively, a Lambda function extracts frames of the uploaded video clips and store those into PNG format in another AWS S3- raw data bucket with timestamp.
- AWS EMR shall be used for
- Image preprocessing tasks such as resizing, normalizing, and filtering images.
- Data cleansing operations to remove noisy data, correct image labels, or eliminate corrupted files.
- Extracting metadata (e.g., timestamps, GPS coordinates, lighting conditions) associated with each image. This information is valuable for labelling and training
- Normalizing thousands of images to standardize pixel values, resolutions, and colour spaces for consistency in model training.
- Dimension reduction of high-resolution images to optimize storage and computational requirements using algorithms like PCA on Spark MLlib.
- Using EC2 Spot Instances to reduce costs for compute-heavy tasks, making it suitable for image data processing pipelines that may need significant computational resources.
- Allocating resources dynamically based on data volume and processing demand, providing flexibility for scaling up during peak processing times.
- Perform Usability test on these image data, using either DenseNet or ResNet for any glare and blur detection.
- Extract only usable data and store those into AWS S3- Usable data.
- To query these data, we need to perform ETL and store those in AWS Glue. Detail database design is described in the following section.
- Store the cleaned data in another S3-silver bucket. Complete process should be orchestrated by Airflow.
- Next step is to label these images. This can be done in three ways
- AWS Rekognition is an open source service offered by AWS.
- Any third party annotation softwares such as Scale.ai or V7.ai.
- Through Active Learning (This will be discussed in next section)
- Once images are labelled, corresponding metadata will be stored in any NoSQL DB and images will be stored in an aws S3 bucket.
- Now, we need to check whether we have sufficient images available for model training or not.
- We can use AWS Lambda function to check this condition.
- In case, there are no sufficient images, then we need to augment the available images or need to create synthetic images. Detail synthetic image creation process is described in next section. (Note: If, Synthetic Image generation cannot be ready for first release, then we can use augmented images for this release, and can integrate Synthetic Images in next release.). The integration can be done by calling synthetic image model API endpoint exposed through FastAI, inside the AWS Lambda function.
- Then using AWS Lambda and Amazon Rekognition, apply, data anonymisation principles to adhere to data privacy principles.
- Next step is to prepare the dataset for model training. This can happen in two ways
- Manually: By building a data catalogue for data discovery, a data scientist can create dataset by searching with the criteria. For this, data should be stored in a Graph database with Knowledge Graph.
- There are different sampling techniques for image data selection
- A python code can be written to select the data without any class imbalance.
- Active Learning: Active Learning is a machine learning approach where the model intelligently selects the most informative data points from a large pool of unlabelled data for human annotation. Instead of passively learning from all available data, the model actively queries the data that it finds most challenging or uncertain, thereby improving its performance more efficiently.
- Train models to detect new objects and track the following model evaluation metrics
- Evaluate performance
- Precision,
- Recall,
- F1 Score,
- Mean Average Precision (mAP) and
- IoU.
- Find cases in which performance is low
- Add those to the data unit test. Objective of the unit case is to check the performance of those failed cases are enough improved to get accepted.
- Deploy models to car fleet in shadow mode to fetch similar edge cases
- Retrieve cases from AV fleet.
- Review and label collected data
- Retrain models
- Repeat the above steps.
Database Design
To store and query image data from autonomous vehicles in AWS Glue with images stored in AWS S3, here’s a schema design and sample queries.
1. Schema Design in AWS Glue
Define a Glue table with the following columns for structured storage of image metadata in Amazon S3:
- image_id (String): Unique identifier for each image.
- vehicle_id (String): Identifier for the vehicle that captured the image.
- timestamp (Timestamp): Date and time the image was captured.
- location (String): GPS coordinates or area name.
- day_night (String): ‘Day’ or ‘Night’ based on timestamp or lighting.
- objects_detected (Array of Strings): Objects identified in the image, e.g., [“car”, “bicycle”, “pedestrian”].
- file_path (String): S3 path to the actual image file.
2. Sample Glue Queries
To query images with cars and bicycles taken at night:
— Retrieve all images containing cars and bicycles at night
SELECT image_id, file_path
FROM image_data
WHERE ‘car’ IN objects_detected
AND ‘bicycle’ IN objects_detected
AND day_night = ‘Night’;
— Count the number of nighttime images with cars and bicycles
SELECT COUNT(*) AS car_bicycle_night_images
FROM image_data
WHERE ‘car’ IN objects_detected
AND ‘bicycle’ IN objects_detected
AND day_night = ‘Night’;
AWS Lambda function to check the image count
import boto3
# Initialize S3 client
s3 = boto3.client(‘s3’)
# Constants
S3_BUCKET = ‘your-image-bucket’
S3_PREFIX = ‘images/’ # Folder path in S3
# Function to check image count
def check_image_count():
response = s3.list_objects_v2(Bucket=S3_BUCKET, Prefix=S3_PREFIX)
return response.get(‘KeyCount’, 0)
def lambda_handler(event, context):
image_count = check_image_count()
print(f”Image count in S3 bucket: {image_count}”)
Integration of Synthetic image generation model API endpoint
- Create FastAI api endpoint
pip install fastapi uvicorn torch torchvision
# gan_api.py
from fastapi import FastAPI
import torch
from torchvision.utils import save_image
from io import BytesIO
from fastapi.responses import StreamingResponse
app = FastAPI()
# Load pre-trained GAN model here (replace with your model path)
model = torch.load(“path_to_your_trained_gan_model.pth”)
model.eval() # Set to evaluation mode
@app.get(“/generate-image”)
async def generate_image():
# Generate synthetic image
noise = torch.randn(1, 100) # Example: noise vector for GAN input
with torch.no_grad():
fake_image = model(noise)
# Convert to an image response
buffer = BytesIO()
save_image(fake_image, buffer, format=”JPEG”)
buffer.seek(0)
return StreamingResponse(buffer, media_type=”image/jpeg”)
uvicorn gan_api:app — reload
- API integration
import boto3
# Initialize AWS clients
s3 = boto3.client(‘s3’)
sagemaker = boto3.client(‘sagemaker’)
# Constants
MIN_IMAGE_COUNT = 1000 # Define the minimum required images
s3_bucket = ‘your-image-bucket’
s3_prefix = ‘images/’
# Check image count in S3
def check_image_count():
response = s3.list_objects_v2(Bucket=s3_bucket, Prefix=s3_prefix)
return response.get(‘KeyCount’, 0)
# Trigger synthetic image generation
def trigger_synthetic_generation():
response = sagemaker.start_notebook_instance(NotebookInstanceName=’synthetic-image-generator’)
return response
def lambda_handler(event, context):
image_count = check_image_count()
if image_count < MIN_IMAGE_COUNT:
print(f”Only {image_count} images found. Generating synthetic images…”)
trigger_synthetic_generation()
else:
print(“Sufficient images available.”)
Active learning (needs to be explored more in depth)
This process will be used for both data selection and data labelling.
It is an ML approach that involves an iterative process of selecting and annotating the most informative data to train a model. Given a small set of labelled data and a large set of unlabelled data, active learning improves model performance, reduces labelling effort, and integrates human expertise for robust results, which has improved mean average precision around 3 to times than without active learning in a cost-effective manner.
- Training Pipeline: At first, an image labelling model is set up, trained on a small set of manually labelled data, and will be used in the labelling pipeline.
- Labelling Pipeline: Then, the labelling pipeline takes a small subset of unlabelled data from AWS S3 bucket and outputs annotated images with the cooperation of above image labelling model and human expertise.
- Then, the labelling pipeline and training pipeline can be iterated gradually with more labelled data to enhance the model’s performance.
- In the labelling pipeline, an Amazon S3 Event Notification is invoked when a new batch of images comes into the Unlabelled Datastore S3 bucket, activating the labelling pipeline. The model produces the inference results on the new images. A customized judgement function selects parts of the data based on the inference confidence score or other user-defined functions. This data, with its inference results, is sent for a human labelling job on Amazon SageMaker Ground Truth created by the pipeline. The human labelling process helps annotate the data, and the modified results are combined with the remaining auto annotated data, which can be used later by the training pipeline.
- Model retraining happens in the training pipeline, where we use the dataset containing the human-labelled data to retrain the model. A manifest file is produced to describe where the files are stored, and the same initial model is retrained on the new data. After retraining, the new model replaces the initial model, and the next iteration of the active learning pipeline starts.
Synthetic Image Generation (needs to be explored more in depth)
Gather high-quality AV images (roads, vehicles, bicycles, etc.). Use a dataset like Waymo, BDD100K, or similar.
Preprocess the images (resize, align) and format them to fit StyleGAN2’s input specifications.
Train the model and, generate the images.
Check the QC passes the generated image quality
Pass a sample of these images for manual Human Check.
If it satisfies QC check, then we can apply Active Learning for data selection to train the CV model.
Integrate with the current pipeline
Once the model is ready, we can integrate it with the existing pipeline to generate images, as explained above.
Deploy the model (SPh-04)
We can refer to the existing process flow for ML model deployment into edge.
- Train an existing ML model and make it available in ONNX format
- Create AWS IoT Greengrass components using the generated files in ONNX format
- Deploy IoT Greengrass components to target edge devices.
Monitor model inference and model evaluation metrics (SPh-05)
Once the model is deployed into production, model prediction starts by classifying all the objects found in the pathway, with corresponding Object Detection mode. In this phase, following model evaluation metrics get continuously monitored in order to detection any degradation.
Performance Analysis
- Precision,
- Recall,
- F1 Score,
- Mean Average Precision (mAP)
- Intersection-over-union (IoU),
- Panoptic quality
The performance of these models get deteriorated with the progress of time due to various reasons. The process is called Drift in Machine Learning language. Drift refers to the phenomenon where computer vision models gradually become less effective over time as the environmental factors and target objects undergo changes.
Drifts
There are various kinds of drifts such as
- Data drift: change in statistical data distributional.
- Image drift: Image data drift occurs when image properties change over time. For instance, certain images may have poor lighting, different background environments, different camera angles, etc.
- Occlusion: Occlusion happens when another object blocks or hides an image’s primary object of interest. It causes object detection models to classify objects wrongly and reduces model performance.
- Lack of annotated samples: CV models often require labeled images for training. However, finding sufficient domain-specific images with correct labels is challenging.
- Sensitive use cases: CV models usually operate in safety-critical applications like medical diagnosis and self-driving cars. Minor errors can lead to disastrous consequences.
- Model drift: drop in model performance due to drift.
- Conceptual Drift: the relationship between the target variable and input features changes.
Explainability
- Integrated gradients.
- XRAI
- Grad-Cam
By the course of time, if the production data distribution gets changed from the trained data distribution, then the model should be retrained on the new production data in order to avoid the drift. Various statistical tests such as Kolmogorov-Smirnov (K-S) test, Population Stability Index and Page-Hinkley method can be applied to test these drifts. For instance, in autonomous vehicles the PSI can monitor the distributional changes in object categories between training and real-world driving datasets, ensuring the model’s performance remains stable.
Model drift refers to the phenomenon where a machine learning model’s performance deteriorates over time due to changes in the underlying data distribution. Correcting model drift involves updating the model to adapt to these changes and maintain accuracy.
- Automation Pipeline, Model Training and Production Monitoring
- Cost Management:
The core components of Kubecost are
1. Frontend
2. Prometheus
3. Costmodel
4. AWS Sig V4 proxy
When Kubecost is deployed in AWS EKS
1. Costmodel retrieves public pricing data from aws billing api.
2. Prometheus scrapes K8 cluster and cost- analyzer metrics
3. Prometheus re-writes the metrics into AMP workspace
4. Cost model queries the metrics from AMP through a sidecar container which using aws Sig v4 proxy to authenticate with AMP. It performs cost allocation calculations and exposes the metrics
5. Frontend routes requests to cost model to query cost allocation data then expose AWS EKS cluster cost and efficiency on K8 dashboard
AI Risk Mitigation Control (SPh-06)
The following is a snapshot of different stages of AI system development, system components in each phase, potential security risks for each component and corresponding mitigation control
- Data Operations
- Raw data
- Insufficient access controls
- Missing data classification
- Poor data quality
- Ineffective storage and encryption
- Lack of data versioning
- Insufficient data lineage
- Lack of data trustworthiness
- Data legal
- Stale data
- Lack of data access logs
- Data preparation
- Preprocessing integrity
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to limit IP addresses that can authenticate to your data and AI platform
- Restrict access using private link as a strong control that limits the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Enforce data quality checks on batch and streaming datasets for data sanity checks and automatically detect anomalies before they make it to the datasets
- Capture and view data lineage to capture the lineage all the way to the original raw data sources
- Explore datasets and identify problems
- Source Code Control
- Secure model features to reduce the risk of malicious actors manipulating the features that feed into Model training
- Data-centric MLOps and LLMOps promote models as code
- Monitor audit logs
- Feature manipulation
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to limit IP addresses that can authenticate to your data and AI platform
- Restrict access using private link as a strong control that limits the source for inbound requests
- Secure model features to prevent and track unauthorized updates to features and for lineage or traceability
- Data-centric MLOps.
- Raw data criteria
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to restrict the IP addresses that can authenticate to Databricks
- Restrict access using private link as strong controls that limit the source for inbound requests
- Use access control lists to control access to data, data streams and notebooks
- Data-centric MLOps for unit and integration testing.
- Adversarial partitions
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to restrict the IP addresses that can authenticate to Databricks
- Restrict access using private link as strong controls that limit the source for inbound requests
- Track and reproduce the training data used for ML model training to track and reproduce the training data partitions and the human owner accountable for ML model training, as well as identify ML models and runs derived from a particular dataset
- Data-centric MLOps for unit and integration testing
- Datasets
- Data poisoning
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to restrict the IP addresses that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Enforce data quality checks on batch and streaming datasets for data sanity checks, and automatically detect anomalies before they make it to the datasets
- Capture and view data lineage to capture the lineage all the way to the original raw data sources
- Secure model features
- Track and reproduce the training data used for ML model training and identify ML models and runs derived from a particular dataset
- Share data and AI assets securely
- Audit actions performed on datasets
- Monitor audit logs
- Ineffective storage and encryption
- Encrypt data at rest
- Encrypt data in transit
- Control access to data and other
- Label flipping
- Encrypt data at rest
- Encrypt data in transit
- Control access to data and other objects for metadata encryption across all data assets
- Catalog and governance
- Lack of traceability and transparency of model assets
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Enforce data quality checks on batch and streaming datasets for data sanity checks, and automatically detection anomalies before they make it to the datasets
- Capture and view data lineage to capture the lineage all the way to the original raw data sources
- Secure model features
- Track and reproduce the training data used for ML model training and identify ML models and runs derived from a particular dataset
- Govern model assets for traceability
- Monitor audit logs
- Lack of end-to-end ML lifecycle
- Manage end-to-end machine learning lifecycle for measuring, versioning, tracking model artifacts, metrics and results
- Data-centric MLOps unit and integration testing
- Monitor data and AI system from a single pane of glass
- Model Operations
- ML algorithm
- Lack of tracking and reproducibility of experiments
- Track ML training runs for documenting, measuring, versioning, tracking model artifacts including algorithms, training environment, hyperparameters, metrics and results
- Data-centric MLOps promote models as code and automate ML tasks for cross-environment reproducibility
- Monitor audit logs
- Model drift
- Track training data with MLflow and Delta Lake to track upstream data changes
- Secure model features to track changes to features
- Monitor data and AI system from a single pane of glass for changes and take action when changes occur. Have a feedback loop from a monitoring system and refresh models over time to help avoid model staleness.
- Hyperparameters stealing
- Track ML training runs in the model development process, including parameter settings securely
- Use access control lists via workspace access controls
- Data-centric MLOps employing separate model lifecycle stages by UC schema
- Malicious libraries
- Third-party library control to limit the potential for malicious third-party libraries and code to be used on mission-critical workloads.
- Evaluation
- Evaluation data poisoning
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists to restrict the IP addresses that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Enforce data quality checks on batch and streaming datasets for data sanity checks, and automatically detect anomalies before they make it to the datasets
- Capture and view data lineage to capture the lineage all the way to the original raw data sources
- Evaluate models to capture performance insights for language models
- Trigger actions in response to a specific event via automated jobs to notify human-in-the-loop (HITL)
- Data-centric MLOps unit and integration testing
- Insufficient evaluation data
- Build models with all representative, accurate and relevant data sources to evaluate on clean and sufficient data
- Model build
- Backdoor machine learning/Trojaned model
- SSO with IdP and MFA to limit who can access your data and AI platform
- Use access control lists to limit who can bring models and limit the use of public models
- Data-centric MLOps promote models as code using CI/CD. Scan third-party models continuously to identify hidden cybersecurity risks and threats such as malware, vulnerabilities and integrity issues to detect possible signs of malicious activity, including malware, tampering and backdoors. See resources section for third-party tools.
- Register, version, approve, promote and deploy models and scan models for malicious code when using thirdparty models or libraries
- Manage end-to-end machine learning lifecycle
- Control access to data and other objects
- Run models in multiple layers of isolation. Models are considered untrusted code: deploy models and custom
- Restrict outbound connections from models to prevent attacks to exfiltrate data, inference requests and responses
- Monitor audit logs
- Model assets leak
- Control access to models and model assets
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Data-centric MLOps to maintain separate model lifecycle stages
- Manage credentials securely to prevent credentials of data sources used for model training from leaking through models
- Monitor audit logs
- ML supply chain vulnerabilities
- Build models with all representative, accurate and relevant data sources to minimize third-party dependencies for models and data where possible
- Pretrain a large language model (LLM) on your own IP
- Use hardened runtime for machine learning
- Third-party library control
- Data-centric MLOps promote models as code using CI/CD. Scan third-party models continuously to identify hidden cybersecurity risks and threats such as malware, vulnerabilities and integrity issues to detect possible signs of malicious activity, including malware, tampering and backdoors. See resources section for third-party tools.
- Evaluate models and validate (aka, stress testing) to verify reported function and disclosed weaknesses in the models
- Restrict outbound connections from models to prevent attacks to exfiltrate data, inference requests and responses
- Source code control attack
- Source code control to control and audit your knowledge object integrity
- Third-party library control for third-party library integrity
- Restrict outbound connections from models to prevent attacks to exfiltrate data, inference requests and responses
- Model management
- Model attribution
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Create model aliases, tags and annotations for documenting and discovering models
- Build MLOps workflows with human-in-the-loop (HITL) , model stage management and approvals
- Share data and AI assets securely
- Model theft
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Control access to models and model assets
- Encrypt models
- Secure model serving endpoints to prevent access and compute theft
- Share data and AI assets securely
- Streamline the usage and management of rate-limit APIs
- Manage credentials securely to prevent credentials of data sources used for model training from leaking through models
- Use clean rooms to collaborate in a secure environment
- Monitor audit logs
- Model lifecycle without HITL
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Control access to models and model assets
- Create model aliases, tags and annotations
- Build MLOps workflows with human-in-the-loop (HILP) with permissions, versions and approvals to promote models to production
- Data-centric MLOps promote models as code using CI/CD
- Model inversion
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizationals role to access data
- Restrict access using IP access lists that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Control access to models and model assets
- Encrypt models
- Secure model serving endpoints
- Monitor audit logs
- Model Deployment & Serving
- Model Serving inference requests
- Model inversion
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Control access to models and model assets
- Encrypt models
- Secure model serving endpoints
- Denial of service (DOS)
- SSO with IdP and MFA to limit who can access your data and AI platform
- Sync users and groups to inherit your organizational roles to access data
- Restrict access using IP access lists that can authenticate to your data and AI platform
- Restrict access using private link as strong controls that limit the source for inbound requests
- Control access to data and other objects for permissions model across all data assets to protect data and sources
- Control access to models and model assets
- Store and retrieve embeddings securely to integrate data objects for security-sensitive data that goes into
- Model Serving inference responses
- Lack of audit and monitoring inference quality
- Track model performance to evaluate quality
- Set up monitoring alerts
- Set up inference tables for monitoring and debugging models to capture incoming requests and outgoing responses to your model serving endpoint and log them in a table. Afterward, you can use the data in this table to monitor, debug and improve ML models and decide if these inferences are of quality to use as input to model training.
- Monitor audit logs
- Output manipulation
- Encrypt models for model endpoints with encryption in transit
- Secure model serving endpoints
- Black-box attacks
- Encrypt models for model endpoints with encryption in transit
- Secure model serving endpoints
- Operations and Platforms
- ML operations
- Lack of MLOps — repeatable enforced standards
- Evaluate models to capture performance insights for language models
- Trigger actions in response to a specific event to trigger automated jobs to keep human-in-the-loop (HITL)
- ML platform
- Lack of vulnerability management
- Platform security — vulnerability management to build, deploy and monitor AI/ML models on a platform that takes responsibility seriously and shares remediation timeline commitments
- Lack of penetration testing and bug bounty
- Platform security — penetration testing and bug bounty to build, deploy and monitor AI/ML models on a platform that takes responsibility seriously and shares remediation timeline commitments. A bug bounty program removes a barrier researchers face in working with Databricks.
- Lack of incident response
- Unauthorized privileged access
- Poor SDLC
- Lack of compliance
Non-Functional Requirements(SPh-08)
- Data Engineering
- Data Ingestion Latency: < 100 ms per image.
- Storage Reliability: 99.999% availability for cloud data storage.
- Data Throughput: > 500 images per second.
- Model Training
- Training Time: < 2 hours per model iteration.
- Resource Utilization: 85% GPU efficiency.
- Model Accuracy (Baseline): ≥ 90% on test datasets.
- Model Building and Optimization
- Inference Latency: < 50 ms per image.
- Model Size: ≤ 100 MB for deployment efficiency.
- Energy Consumption: ≤ 15 Watts on edge devices.
- Deployment
- Deployment Time: < 10 minutes per model version.
- Scalability: Capable of 10,000 simultaneous edge deployments.
- Rollback Time: < 5 minutes to revert to previous version.
- Monitoring
- Uptime: 99.9% availability of the monitoring system.
- Error Detection Latency: < 5 seconds to detect critical errors.
- Data Accuracy Drift Detection: Identify 5% drift within 1 hour.
Roadmap
https://docs.google.com/spreadsheets/d/10f6aHshluHjAPn0Q9lad2QRwPUFCfFMjaWJXivurrHM/edit?gid=0#gid=0
Technology Assessment
Project Management
Framework
We can introduce Dynamic System Development Methodology in our project, if Bosch does not follow Scaled Agile Framework.
But a major challenge always comes in requirement finalisation phase. By following frequent stakeholder collaboration, required POC and requirement prioritisation techniques, this challenge can be sufficed, and the requirements shall be frozen at least 3 weeks prior to the starting of PI, so that engineering team will get enough time to brainstorm the scope of next PI.
A small formula which may help in addressing the challenge of technology selection and release deadlines,
- POC for technology assessment should be done prior to one PI of actual planned implementation PI and should be approved by the approver with required InfoSec certificates. For instance, if implementation of data pipeline is planned for PI-2, then the technology stack should be approved within the end of PI-1 along with the design doc.
- In next PI, means in PI-2, the data pipeline implementation should start.
- Each feature must be clearly defined with Acceptance Criteria, OKR, Timelines.
- Give enough privilege to Engg team to come up with effort estimation and any technical dependencies\risk which are not identified.
- Provide required trainings and supporting documents.
- Most Important: Collaborate and communicate continuously to identify any challenges the team is facing.
- Continuously follow up and monitor.
This framework has proved to be succeeded.
Phase- II (Vision after 1 year)
Minor enhancements + Bug Fixes + Integration with Active Learning and Synthetic Image generation endpoints, if not integrated in Phase- I. Integration approaches are provided in respective sections.
Phase- III (Vision after 2 years)
This architecture is proposed based on LAECIPS, a Large Vision Model Assisted Adaptive Edge-Cloud Collaboration framework, developed to improve real-time perception in IoT applications, such as autonomous driving, by combining the strengths of edge devices and cloud resources. By blending edge and cloud models and continuously refining the edge model with cloud feedback, LAECIPS enables scalable, real-time, and accurate IoT-based perception.
Ref: https://arxiv.org/pdf/2404.10498
Limitation of existing edge-cloud deployment model
- The tight coupling between the large and small models limits the system flexibility of the current methods for fully leveraging large vision models.
- The collaboration strategy needs to be further optimized for both high accuracy and low latency, while demonstrating its capability to adapt to the dynamic IoT environment.
- Inference outputs from a large vision model (e.g., SAM) may lack semantic labels and thus need to be combined with the edge model inference results.
Features
- It enables flexible utilization of both large and small models in an online manner to solve this problem.
- We can design a hard input mining-based edge-cloud co-inference strategy that achieves higher accuracy and lower task processing latency.
- The technique enhances model robustness, as the model progressively learns from edge cases, becoming better at handling diverse or ambiguous data.
- A continual training of the small model to fit in with the dynamic environmental changes in the IoT environment.
- We can analyze the theoretical generalization capability of LAECIPS to prove the feasibility of incorporating large vision models, edge small models, and edge-cloud coinference strategies into this framework in a plug-and-play manner.
Aspect
LVMs
(Large Vision Models)
CNNs
(Convolutional Neural Networks)
Performance
Higher accuracy and generalization but slower
Competitive performance, faster inference, task-specific
Computational Cost
High GPU/TPU requirements, costly at scale
More efficient, can run on mobile/edge devices
Scalability
Highly scalable with vast data and new tasks
Scales well on fixed-size tasks, less generalizable
Deployment Complexity
Higher infrastructure cost, complex to deploy
Mature tooling, easier optimization, and deployment
Latency
Slower inference needs optimization
Typically faster, suited to real-time applications
Interpretability
Harder to interpret, post-hoc analysis
More interpretable via feature/activation maps
Cost
Higher deployment and compute costs
Generally cheaper in both training and inference
Use Case Fit
Best for multimodal, few-shot, large-scale tasks
Suitable for specific, well-defined image tasks
We will consider the example of SAM, how it can improve our AV system. is a versatile vision model primarily designed for image segmentation tasks, where the goal is to identify and delineate specific objects or regions within an image.
Features of SAM: Object Segmentation, Instance Segmentation,Interactive Segmentation, Semantic Segmentation, Zero-Shot Segmentation, Part Segmentation,Video Frame Segmentation (with modification), Foreground-Background Segmentation, Complex Scene Segmentation, Fine-Grained and Detailed Segmentation.
Workflow of LAECIPS
- In steps 1 and 2, a lightweight model performs inference on incoming data to produce preliminary results.
- In steps 3 and 4, the hard input mining module classifies this data into “easy” and “hard” inputs based on accuracy.
- Step 5 outputs results for easy inputs directly, reducing latency, while hard inputs are sent to the cloud for enhanced accuracy.
- In steps 6 and 7, the small and large models jointly process hard inputs, and their results are fused for co-inference.
- In steps 8 and 9, co-inference results are returned to the edge and saved in the cloud’s replay buffer. Once the buffer reaches a set threshold, the cloud retrains the edge model using this data, and
- In step 10, the updated model is deployed back to the edge.
In this phase, ML skilled resources are more required as compared to Data Engineer, as data pipeline would be stable by this time.
References