Automating UI Testing with MultiModal LLMs: A Practical Walkthrough

Read Time:
minutes

Key Takeaways

By the end of this blog, you'll understand:

  • The importance of UI testing and its challenges
  • How multimodal AI can revolutionize the testing process
  • Following practical approaches to implement AI for UI testing:
    1. Computer Vision + Multi-modal LLMs
    2. Gemini API Integration
    3. Inference with Llama 3.2-Vision
  • The benefits and limitations of each approach
  • Real-world examples of AI-driven UI testing
  • Future trends in AI-powered UI testing

The complete source code for all the approaches is available on GitHub (Approach 1 & 3) and Hugging Face Spaces (Approach 2).

Why UI Testing Matters More Than Ever

Have you ever launched a feature-rich app, only to watch users struggle with navigation? Or opened an app only to be greeted with frustrating glitches? If so, you're not alone.

UI testing exists to ensure that your product’s interface is not only functional but also intuitive and enjoyable.

From the most basic mobile app to a sophisticated web platform, the way users interact with your product can make or break its success. UI testing ensures that your interface not only works but also offers a seamless experience, making it essential for every software product.

The Fundamentals of UI Testing

UI testing ensures every element, from buttons to menus to text fields works as intended. But it’s not just about catching bugs, it's about creating an experience that is intuitive and frustration free for users.

What UI Testing Involves?

Here’s what typical UI testing looks at,

  • Making sure buttons, menus, and forms work properly
  • Checking that text is easy to read and looks nice
  • Verifying that the app responds well to user actions
  • Ensuring the layout adjusts correctly on different screen sizes

Without thorough UI testing, even the most innovative app can frustrate users to the point of abandonment.

There are different types of UI testing, each designed to make sure the user interface works smoothly and feels right for the user. Major types are,

  1. Manual UI Testing: This is when testers explore the UI themselves, checking for bugs, usability, and how responsive everything feels.
  2. Automated UI Testing: Makes use of automated tools to simulate user actions.
  3. Visual Testing: As the name speaks for itself, it focuses on how the UI looks making sure layouts, colours, and fonts all appear correctly.
  4. Functional UI Testing: Checks that every interactive element, like buttons and forms, works properly and takes you where you are supposed to go.

The Challenges of Traditional UI Testing

While traditional UI testing is crucial, it comes with its own set of challenges:

  • Manual testing often takes significant time and resources.
  • Even the most diligent testers can overlook issues.
  • As applications grow, maintaining effective UI tests becomes increasingly difficult.
  • Modern interfaces are increasingly complex, making comprehensive testing a significant challenge
  • Ensuring cross-device compatibility adds another layer of complexity

The Role of Multimodal AI in UI Testing

To address the challenge of automating UI testing through multimodal approaches, two distinct solution strategies were implemented.

  1. Computer Vision + Multi-modal LLMs:The first strategy was a hybrid solution leveraging a combination of computer vision and large language models (LLMs), but it was specifically designed for images.”
  2. Gemini API Integration:The second strategy was a direct integration with the Gemini API, which works well with both videos and images.
  3. Inference with Llama 3.2-Vision:
  4. The third approach makes use of the latest Llama 3.2 vision models which are excellent at vision tasks.

All three methods make testing easier for developers. They just need to upload a screenshot and mention the UI element they want to test. The system then automatically creates the relevant test cases.

Implementing Multimodal AI for UI Testing

Approach 1: Computer Vision + Multi-modal LLMs

Architecture

The first approach uses a general-purpose, open-source Multimodal Large Language Model (MLLM), OpenGVLab/InternVL2-8B to analyse the screenshot and generate valid testcases. However, to significantly improve the accuracy and efficiency of the solution, I incorporated computer vision (CV) techniques for annotating the UI elements within the images.Here's how it works:

  1. A computer vision model detects and highlights UI components in the screenshot by drawing bounding boxes around them. This preprocessing step helps the language model better understand the specific interface context.
  2. The annotated image, along with optional user-provided text, is passed to a multimodal LLM called OpenGVLab/InternVL2-8B. By guiding the model with both visual cues and textual information, we observed a remarkable 49% improvement in the accuracy and quality of the generated test cases.

Computer Vision for UI Detection

The computer vision model used for detecting UI components was trained on 578 interface images and can recognize the following element classes:

  • Checkbox
  • ComboBox
  • Radio
  • TextArea
  • TextField
  • Toggle
  • Button

The key performance metrics of the CV model are as follows:

  • Mean Average Precision (mAP): 99.5% – This metric reflects the average precision across all element categories.
  • Precision: 95.1% – The accuracy of the positive detections.
  • Recall: 87.7% – The ability of the model to correctly detect all relevant UI elements.

You can view the model details and access it here.

Mind you, now we are not just passing any screenshot, but instead we're feeding the model an image with its UI elements annotated. This provides the language model with a detailed visual context, enhancing its understanding of the interface and leading to more accurate and relevant test case generation.

To illustrate this, let's take a look at a real-world example. Consider the Apple Music Dashboard screenshot below,

Input Image

After processing this image with the computer vision model, the output looks like this,

Annotated image

The model expertly identifies and annotates all the UI elements, providing a detailed visual context that enhances the language model's understanding of the interface.

For example, if we ask the model to test the 'Home' button, the generated output would look something like this,


**Test Case:** Home button

**Description:** The Home button allows users to navigate through different sections of a digital product, typically related to information, favorites, and other interactive features.

**Pre-conditions:**
- The digital product is accessible under the "New" section.
- The user navigates to the "Home" section, where they expect to be able to access their profile, favorite songs, and other frequently used features.

**Testing Steps:**
1. The test case simulates a user navigating to the "Home" section of a digital product using a device.
2. The user selects the "Home" button to open the corresponding section of the product.
3. A visual confirmation should appear on the screen, showing a user-friendly representation of the selected section.
4. The user expects to be able to access their profile icon, favorite songs, and other essential sections related to their preferences and interests.

**Expected Result:** Upon successfully navigating to the "Home" section using appropriate device settings, the user is expected to see a representation of their profile or a list of their favorite songs and other important sections accessible to them.

**Context for the Test Case:** The test case is designed to verify that the "Home" button enables the user to perform necessary navigational actions within a digital platform's user interface.

**Additional Context:** The target digital product shows a "Home" button in its user interface section. This section appears to contain user and feature-specific information. 

**Instructions:**
1. Use any device with a recognized user interface layout to simulate navigating to the "Home" section.
2. Confirm the visual display of selected sections, including the profile icon, favorite songs, and other relevant sections.
3. Ensure that all expected usages and functionalities within the "Home" section are visible to the user.   
  
**Follows the provided format and remains within the scope without additional explanations except for details provided in this task.**


With this approach, we observed a 49% improvement in the overall accuracy and quality of the output. This process ensures that the model is guided by both visual cues and textual context, leading to more precise and relevant test case generation.

Among the numerous open-source multimodal LLMs available, the one that stood out was OpenGVLab/InternVL2-8B. This model integrates visual and linguistic capabilities, enabling it to generate rich test cases based on the input UI data and annotated image components. It effectively handles the dual input (image and text), making it ideal for UI testing scenarios.

This approach represents a significant step forward in automating and enhancing the UI testing process for developers.

  • Vision part - **InternViT-300M-448px**  is an optimized vision foundation model, designed for tasks like feature extraction, especially from complex images. It is a smaller, more efficient version of the InternViT-6B-448px, known for its robust OCR capabilities and dynamic handling of high-resolution images.
  • Language part - **internlm2_5-7b-chat** is an advanced 7-billion parameter model designed to excel in practical applications, particularly in reasoning tasks and tool utilization. It surpasses competitors like Llama3 and Gemma2-9B in reasoning benchmarks, such as Math and General Knowledge tests.

The code was executed on Google Colab, with Gradio integrated to provide a user-friendly interface for processing the screenshots. You can find it on GitHub here.

Challenges with this Approach

  1. A key limitation of this method was that it only processed images, requiring developers to pass a screenshot for every UI element. This made the process quite tedious.
  2. The multimodal LLM occasionally misidentified or hallucinated non-existent UI elements despite the computer vision preprocessing, leading to inaccurate test case generation.
  3. Preprocessing images with computer vision and then passing them to the LLM significantly increased inference times, hindering efficiency when dealing with numerous UI screenshots.
  4. Employing 4-bit quantization to optimize the 8B model's performance reduced its size and computational requirements but noticeably decreased accuracy, compromising the reliability of generated test cases.

Recognising these limitations, we decided to implement a more streamlined and effective approach, which led us to the Gemini API integration.

Approach 2: With Gemini API

The second approach we explored involved a direct integration with the Gemini API. One of the key advantages of this approach is its ability to accept videos, unlike the first approach, which only accepts images.

The Gemini API is capable of running inference on images, series of images, or videos. When provided with such content, Gemini can perform a variety of tasks, including:

  • Describing or answering questions about the content
  • Summarising the content
  • Extrapolating from the content

You can learn more about the Gemini API and how to get started with your own API keys here.

For our project, we utilised the gemini-1.5-pro-latest model. This model, along with Gemini Flash, supports up to approximately an hour of video data. The video must be in one of the following formats:

mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, or 3gpp.

Google offers a free tier for the Gemini API, which is perfect for testing purposes. This free tier offers lower rate limits, with 15 requests per minute. You can find more information about this here.

In our demo, we processed the user-uploaded videos at 30 frames per second (FPS) for faster inference. However, you can also choose to send a limited number of extracted frames from the uploaded video or pass every nth frame. For example, you can modify the following code to pass every 10th frame to the model:


frames_to_pass = frames[::10] # every 10th frame gets passed to the model
response = model.generate_content([prompt] + [frame for frame in frames_to_pass])

The response from the model is then mapped with a well-structured prompt that defines the task to be performed. This ensures a quicker response.

Using the Gradio interface to provide a user-friendly UI for easy operation, this approach,offers a powerful and flexible solution for our project.

To illustrate the output of the Gemini API, let's consider the same Apple Music Dashboard screenshot used in the first approach(can pass images as well as videos here).When we pass this image to the Gemini API and ask it to generate test cases for the 'Home' button, the output would look something like this,

Output

To showcase our work, we have created a Hugging Face Space for this project. You can find it running here,

Hugging face Spaces - https://huggingface.co/spaces/mavihsrr/UI-Test_Case_Generator-MLLM

Feel free to test it out on your own dataset and let us know how it goes.

Approach 3 : Inference with Llama 3.2-Vision

The third approach we explored makes use of the latest Llama 3.2-Vision models, which excel at various vision tasks. Developed by Meta, the Llama 3.2-Vision collection consists of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes, capable of processing both text and image inputs to generate text outputs.

Model Architecture and Training

Llama 3.2-Vision is built upon the Llama 3.1 text-only model, an auto-regressive language model that utilizes an optimized transformer architecture. The tuned versions employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

To support image recognition tasks, Llama 3.2-Vision incorporates a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. This adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.The models were trained on an extensive dataset of 6 billion (image, text) pairs, with a knowledge cutoff date of December 2023. Both the 11B and 90B parameter models support a context length of 128k tokens and utilize Grouped-Query Attention (GQA) for improved inference scalability.

Capabilities and Use Cases

Llama 3.2-Vision models are optimized for a wide range of tasks, including:

  • Visual recognition
  • Image reasoning
  • Captioning
  • Answering general questions about an image

These models outperform many available open-source and closed multimodal models on common industry benchmarks. Their ability to process both images and text as inputs enables additional use cases such as:

  • Visual Question Answering (VQA) and Visual Reasoning
  • Document Visual Question Answering (DocVQA)
  • Image Captioning
  • Image-Text Retrieval
  • Visual Grounding

Output Example

To demonstrate the capabilities of Llama 3.2-Vision, let's consider the same Apple Music Dashboard screenshot used in the previous approaches.

Input

When we pass this image to the Llama 3.2-Vision model and ask it to generate test cases for the 'Home' button, the output would look something like this,


UI Test Cases for the 'Home Button'

Test Case 1: Clicking the 'Home Button'

	Steps:
	1. Click the 'Home button' on the screen.
	2. Verify that the app navigates to the home page or the main menu.
	3. Expected Outcome: The app should smoothly transition to the home page or display the main menu options.

Test Case 2: Accessibility Testing

	Steps:
	1. Use assistive technologies (e.g., screen readers) to interact with the 'Home button'.
	2. Verify that the 'Home button' is accessible and can be navigated using assistive technologies.
	3. Expected Outcome: The 'Home button' should be accessible and usable for users with disabilities.

Test Case 3: Visual Hierarchy and Alignment

	Steps:
	1. Inspect the visual design of the 'Home button' on the screen.
	2. Verify that the 'Home button' is prominently displayed and has a clear visual hierarchy.
	3. Expected Outcome: The 'Home button' should be visually appealing and easy to locate on the screen.

Test Case 4: Color Scheme and Contrast

	Steps:
	1. Inspect the color scheme and contrast of the 'Home button' on the screen.
	2. Verify that the 'Home button' has sufficient contrast and is easily distinguishable from the background.
	3. Expected Outcome: The 'Home button' should have a color scheme and contrast that is visually appealing and accessible.

Test Case 5: Responsiveness

	Steps:
	1. Test the 'Home button' on different devices and screen sizes.
	2. Verify that the 'Home button' is responsive and functions correctly across various devices and screen sizes.
	3. Expected Outcome: The 'Home button' should be responsive and work seamlessly across different devices and screen sizes.

Test Case 6: User Experience
	
	Steps:
	1. Test the 'Home button' with real users.
	2. Evaluate the user experience and gather feedback.
	3. Expected Outcome: The 'Home button' should provide a positive user experience, and users should be able to easily navigate the app using the 'Home button'.

Test Case 7: Error Handling for Unexpected User Behavior
	
	Steps:
	1. Simulate unexpected user behavior (e.g., clicking the 'Home button' multiple times in quick succession).
	2. Verify that the app handles the unexpected behavior gracefully and does not crash or freeze.
	3. Expected Outcome: The app should handle unexpected user behavior without any issues and maintain a smooth user experience.

Test Case 8: Performance Testing

	Steps:
	1. Test the 'Home button' under heavy load or with multiple users.
	2. Verify that the 'Home button' performs well under heavy load or with multiple users.
	3. Expected Outcome: The 'Home button' should perform well under heavy load or with multiple users without any significant performance issues.

Test Case 9: Security Testing

	Steps:
	1. Test the 'Home button' for security vulnerabilities.
	2. Verify that the 'Home button' is secure and does not pose any security risks.
	3. Expected Outcome: The 'Home button' should be secure and free from any security vulnerabilities.

Test Case 10: Compatibility Testing
	
	Steps:
	1. Test the 'Home button' on different browsers and devices.
	2. Verify that the 'Home button' is compatible with various browsers and devices.
	3. Expected Outcome: The 'Home button' should be compatible with different browsers and devices without any issues.

Test Case 11: Usability Testing
	
	Steps:
	1. Conduct usability testing with real users.
	2. Evaluate the user experience and gather feedback.
	3. Expected Outcome: The 'Home button' should be user-friendly, and users should be able to easily navigate the app using the 'Home button'.

Test Case 12: Additional Accessibility Testing
	
	Steps:
	1. Evaluate the 'Home button' for compliance with accessibility standards (e.g., WCAG).
	2. Verify that the 'Home button' is usable for individuals with various disabilities.
	3. Expected Outcome: The 'Home button' should meet accessibility standards and be usable for all individuals, regardless of ability.
	



The Llama 3.2-Vision model accurately identifies the 'Home' button and generates comprehensive test cases, considering various scenarios and edge cases. The output is well-structured and easy to understand, making it a valuable tool for developers and testers.

I have got to say, the new Llama 3.2 collection has truly impressed me, especially the lightweight models. Their performance is exceptional!

Comparative Analysis

Let’s take a closer look at how these approaches stack up against each other.

AspectComputer Vision + Multi-modal LLMsGemini API IntegrationInference with Llama 3.2-VisionInput TypesImages onlyImages and videosImages, videos and textPreprocessingComputer vision model for UI element detection and annotation (mAP: 99.5%, Precision: 95.1%, Recall: 87.7%)NoneSeparately trained vision adapter for image recognitionModel ArchitectureOpenGVLab/InternVL2-8B: Combination of InternViT-300M-448px (vision) and internlm2_5-7b-chat (language)gemini-1.5-pro-latestLlama 3.1 text-only model + vision adapter with cross-attention layersModel Sizes8Bmid-size multimodal model11BPerformance Metrics49% improvement in accuracy and quality of generated test cases compared to using language model aloneProvides improved accuracy and efficiency in test case generation compared to the first approach.Outperforms both the two approaches.Inference TimeIncreased due to preprocessing with computer vision model and passing to LLM

3mins | 15-30secs (Ran it locally) | <1min (Ran it on 1xA40 GPU) || Quantisation Impact | 4-bit quantisation of 8B model reduced size and computational requirements but decreased accuracy | Not applicable | Not applicable || Key Strengths | Detailed visual context from annotated UI elements enhances language model's understanding | Ability to process videos in addition to images, customisable API requests | Advanced architecture,strong performance on visual recognition, image reasoning, captioning, and answering questions about images || Limitations | Only processes images; occasional misidentification of UI elements; increased inference times; decreased accuracy with quantisation | May misidentify other elements or become confused with them | Recent development, may require more integration effort due to advanced architecture || Scalability | Challenges due to preprocessing and increased inference times when dealing with numerous UI screenshots | Scalable due to API infrastructure, supports customising API requests and utilising AI capabilities | Promising scalability prospects due to ability to handle both image and text inputs, strong performance on visual understanding tasks |

In my experience, when it comes to UI testing, prioritising the quality and relevance of generated test cases is crucial. And that is why I find the Llama 3.2-Vision model to be the standout choice. Its sophisticated architecture and impressive performance in visual understanding make it incredibly effective for generating relevant test cases with just a straightforward prompt.

On the flip side, if you're working under tight time constraints and can afford slightly less comprehensive test cases, the Gemini API Integration approach could be a quicker option.

Ultimately, the best choice really lies on the specific needs of your UI testing project. Balancing accuracy, speed, and resource utilisation is key, and taking the time to weigh these trade offs will help guide your decision.

Future Trends

Following are some key trends and technologies that are set to redefine UI testing,

  • AI-Powered Test Automation: AI streamlines automation by dynamically identifying UI elements, adapting to interface changes, and performing visual validations, resulting in more accurate testing.
  • Predictive Analytics in Testing: Making use of historical data, AI can predict potential defects, allowing testing teams to focus on high-risk areas. This proactive approach enhances test coverage and improves overall efficiency
  • Enhanced Test Case Generation: AI algorithms automatically generate comprehensive test cases based on application requirements and past data, minimising manual effort.
  • Multimodal AI Integration: This approach combines text, images, and video analysis, enhancing testing capabilities to validate graphical interfaces and diverse inputs.

Conclusion

As we wrap things up, it's clear that the rise of multimodal large language models is set to transform how we approach UI testing and validation.

The journey doesn’t end here, it’s just the beginning.

If you're looking to build custom AI solutions for your organization or want to transform your AI product ideas into reality, we can help. At Ionio, we specialise in taking your concepts from idea to product. Contact us to start your AI journey today.

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Shivam Mitter
Writer

The guy on coffee who can do AI/ML.

lowkey, i am really enjoying the new Llama 3 collection of models, super fun to work with!
Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.