Introduction

In cognitive process automation, developing self-training modules is crucial. These modules can independently explore, learn, and adapt to complex and unfamiliar environments in the interface. They do this by utilizing the vision capabilities of multimodal models in conjunction with the symbolic programming paradigm.

The DoRa (Discovery and Mapping Operation for Graph Retrieval Agent) framework is a pioneering approach in this field. It leverages the power of multimodality, graph-networks data representation, and Reinforcement Learning to create a generalist agent for exploration.

“We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.” ~ R. Sutton, “The Bitter Lesson”

While developing the DoRA system for learning, we were greatly inspired by Rich Sutton’s pivotal work, “The Bitter Lesson.” His work underlines the importance of using computers in AI research [1]. However, our methodology diverges from the common trend of disregarding human knowledge-based methods for scalability. We believe in the fusion of these concepts. While extensive computing power has frequently driven AI advancements, the insights derived from human knowledge can provide valuable direction, particularly in the early stages of learning.

“…the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.”

Therefore, DoRA adopts a hybrid approach, where initial phases of guided exploration and learnable mapping are deeply informed by human-like understanding. DoRA’s approach is based on imitation learning, where it learns from observing human interactions within an interface and aims to accurately reproduce those actions, even when the interface undergoes changes and for environments alike. Unlike a black-box model that unpredictably generates actions and adapts during inference, DoRA offers a more transparent and observable “recipe” for action generation and presents a learning module from guided demonstration involving exploration and mapping. This foundation is progressively complemented and ultimately superseded by computation-intensive methods such as Reinforcement Learning and neuro-symbolic programming. This integration aims to achieve not just computational scalability, but also a level of interpretability and adaptability often absent in purely computation-driven systems.

This article presents an overview of DoRa’s methodology, which comprises five integral components aimed at enhancing the cognitive capabilities of automation systems.

  • Firstly, the Guided Exploration module is introduced. This module is designed to navigate through the interface from web applications & desktop applications, identifying relevant information and patterns that can be further processed, laying the foundation for future work on a Generalised Explorer Agent using Reinforcement Learning (RL) [2,3].
  • Secondly, the Learnable Mapping and Annotation module is discussed. This component is responsible for establishing meaningful connections and structure within patterns and data elements, facilitating their interpretation and utilization in various automation flows. By enabling the system to learn and adapt these mappings, DoRa enhances its ability to handle diverse and evolving data structures within interface elements across workflows and environments.
  • Thirdly, the framework incorporates a Graph-aided Heuristic Search mechanism, which utilizes normalized scores to retrieve, encode, and reflect the learnable mappings. This approach ensures that the system can efficiently navigate the site-graph structure, prioritizing the most relevant information, minimizing computational overhead, and translating information into the action domain.
  • Fourthly, DoRa integrates Graph Augmented Language Modelling, a technique that leverages knowledge graphs for knowledge-grounded dialogue generation. This component is crucial for grounding and context retrieval from subgraphs, enabling the system to generate more contextually relevant and coherent responses in conversational interfaces.
  • Finally, the transition from Language Modelling to a Neuro-Symbolic Programming Paradigm is explored. This shift represents a significant advancement in cognitive automation, as it allows for the generalization of cognitive tasks through multimodal representation learning during training. By combining symbolic reasoning with graph neural network-based learning, DoRA aims to achieve a more holistic and flexible approach to cognitive process automation.

1 Guided Exploration Module

The Guided Exploration module in the DoRa framework serves as the foundation for autonomous data navigation and pattern identification to model human actions on computer applications across the operating system. This component is crucial for enabling the system to efficiently traverse complex interfaces and extract relevant information and event $e_i$ across the OS environment, hereafter referred as the ‘World Interface’, and ‘interface’ interchangeably, $W$.

At the core of the Exploration module is the concept of guided discovery, where the system is directed to record events of the interface landscape throughout the workflows spanning transformed action steps $a_i$, which need to be interacted with in order to leverage the ‘learning’ facility of the system, such that

where,

The further development of this module involves the integration of Reinforcement Learning (RL) to create a Generalised Explorer Agent in addition to the existing guided discovery abstraction [4]. RL provides a framework for the agent to learn optimal exploration strategies through trial and error, continuously improving its ability to navigate the data environment and make informed decisions about where to focus its attention. Exploration can help an agent gather more information about its environment, which can improve its ability to generalize to new tasks or environments within the interface [6]. This exploration can help agents learn about parts of the environment that may be useful at test time, even if they are not needed for the optimal policy on the training environments. As the context space is sparse, the exploration is guided by a reward function, which quantifies the value of the information discovered during exploration, encouraging the agent to prioritize areas, hereafter called contextual Sample Space, S_c of the interface that is most relevant to the task,t at hand where a task may be composed of many subtasks, t_i which may be employed by the RL agent to derive and abstract the optimal actions, such that for the j^{th} workflow, the subtasks t^j_{i} constitute the task t^j, where action steps a^j_i constitute to form the optimal actions, a^j and the action vector, A

The Guided Exploration module is essential for the overall efficiency and effectiveness of the DoRa framework, as it lays the groundwork for subsequent stages of post-processing and analysis. By enabling the system to identify and focus on the most pertinent information, this module enhances the system’s ability to adapt to diverse environments and extract actionable insights.

2 Learnable Mapping and Annotation Module

The Learnable Mapping and Annotation module is a pivotal component of the DORA framework and is responsible for establishing and refining the relationships between different GUI elements. This module enables the system to interpret and organize the information from events and action steps, discovered during the guided exploration phase, facilitating its use in various automation tasks.

Mapping, $M$, in this context, refers to the process of linking related data points, creating a structured representation,$S_e$ of the information that can be easily navigated and analyzed. In a knowledge graph, mapping involves connecting entities $n$, and their attributes based on their relationships $r$. The learnable aspect of this module implies that these mappings are not static and they can be updated and improved over time as the system encounters new data or as the relationships between data elements evolve.

Annotation, on the other hand, involves adding metadata and labels to the data points, providing additional context and categorization among the GUI elements across the OS. During the post-processing regimes, YOLO and Optical Character Recognition and GPT-4V methods are used to annotate the metadata to map the information gathered in the exploration phase, S_e into transformed Structured “Nodes” vectors, S_n , within the site-graph, G. The Learnable Mapping and Annotation module employs the multi-modality representation learning techniques to continuously refine its understanding of the contextual information within the interface,W the node vectors and the relationships within it. This iterative learning process ensures that the system’s mappings and annotations remain accurate and relevant, even as the underlying data changes.

This module is fundamental to the cognitive capabilities of the DoRa framework, as it enables the system to construct a coherent and adaptable representation of the data landscape. By continually learning and updating its mappings and annotations, the system can maintain a high level of accuracy and efficiency in its automation tasks.

3 Graph-aided Heuristic Search

The Graph-aided Heuristic Search component of the DORA framework is designed to leverage the structured representation of data provided by the Learnable Mapping and Annotation module to efficiently navigate and retrieve relevant information. This search mechanism utilizes heuristic algorithms, which are guided by the graph structure and the normalized scores assigned to different data elements, to prioritize the most promising paths and minimize the search space.

The use of normalized scores is a key feature of this module, as it allows for a standardized comparison of different data points based on their relevance or importance to the task at hand. These scores can be derived from various factors, such as the frequency of occurrence, the strength of relationships, or the relevance to the user’s query. By assigning scores to nodes and edges in the graph, the system can quickly identify the most pertinent information and focus its search efforts accordingly.

The heuristic aspect of the search algorithm is crucial for its efficiency, as it enables the system to make informed decisions about which paths to explore based on the available information and the current context. This approach reduces the computational overhead associated with exhaustive search methods and ensures that the system can retrieve relevant information on time.

The Graph-aided Heuristic Search module is an essential component of the DoRa framework, as it directly impacts the system’s ability to quickly and accurately access the information required for various automation tasks. By optimizing the search process through the use of heuristics and normalized scores, this module enhances the overall performance and effectiveness of the cognitive automation system.

4 Knowledge Graph Augmented Language Modelling

The Knowledge Graph Augmented Language Modelling component of the DoRa framework represents a significant advancement in natural language processing and dialogue generation. This module integrates the structured information from knowledge graphs into the language modeling process, enabling the system to generate more contextually relevant and coherent responses in conversational interfaces.

Grounding and context retrieval from subgraphs are key aspects of this module. By leveraging the connections and relationships encoded in the knowledge graph, the system can ensure that its responses are grounded in the relevant context, providing more accurate and informative node selections.

The integration of knowledge graphs into language modeling also facilitates the generation of knowledge-grounded dialogue, where the system references sub-graphs and nodes filtered in the graph-aided search methods for substantiated response generation [8].By grounding language generation in the rich context provided by knowledge graphs, this module ensures that the optimal node selection is relevant and informative for consumption by AUTONODE.

The scholarly landscape in web automation and robotic process automation is rich with studies employing symbolic algorithms. These techniques have demonstrated notable success in specialized areas and have found application across a diverse range of products, from productivity applications to computer-aided design tools. However, these applications encounter a significant hurdle when it comes to generalizing their solutions. Often, a specialized formal specification needs to be designed for a specific problem, and the heuristics used to solve one problem do not translate well to others [9]. This further brings us to the iteration of introducing a neuro-symbolic programming paradigm.

5 From Language Modelling to Neuro-Symbolic Programming Paradigm

The transition from Language Modelling to a Neuro-Symbolic Programming Paradigm represents a paradigm shift in cognitive automation, as encapsulated in the DoRa framework. This shift involves integrating the flexibility and expressiveness of neural network-based language models with the structured reasoning capabilities of symbolic programming, creating a more holistic approach to cognitive automation.

Multimodal representation learning during training is a key aspect of this transition. By incorporating multiple data modalities, such as text, images, and site graphs, into the learning process, the system can develop a more comprehensive understanding of the task at hand. This multimodal approach enables the system to enhance its cognitive capabilities, applying its learning to a wider range of automation tasks.

The Neuro-Symbolic Programming Paradigm offers several advantages over traditional language modeling approaches. By combining neural networks’ ability to capture complex patterns and relationships with symbolic programming’s logical reasoning and interpretability, the system can achieve a more nuanced and accurate understanding of the data. This integration enables the system to perform tasks that require both deep understanding and precise reasoning, such as natural language understanding, decision-making, and problem-solving.

The shift to a Neuro-Symbolic Programming Paradigm is a critical development in the DoRa framework, as it represents a significant step towards generalizing cognitive automation across different domains and tasks. By leveraging multimodal representation learning and the synergies between neural networks and symbolic programming, the system can achieve a more advanced and versatile level of cognitive automation.

Conclusion

The DORA framework presents a comprehensive and innovative methodology for self-training modules in cognitive process automation. By integrating guided exploration, learnable mapping, graph-aided heuristic search, knowledge graph augmented language modeling, and neurosymbolic programming paradigm, DORA sets a new standard for the development of intelligent automation systems that can be trained on interface-intensive workflows. Future research co-laterally will focus on refining these components and exploring their applications in various domains, with the ultimate goal of achieving more autonomous, efficient, and adaptable cognitive processes towards a generalist explorer agent.