Take a Peak

Microsoft, OpenAI and the future

Since 2016, Microsoft has strived to become an AI powerhouse on the global scale. The goal is to transform Azure into an artificial intelligence augmented machine with superlative capabilities. To this end, they partnered with OpenAI to build their infrastructure and democratize data. As of now, there are several promising results. Such as the infrastructure used by the OpenAI to train its breakthrough models, deployed in Azure to power category-defining AI products like GitHub Copilot, DALL·E 2, and ChatGPT. And Microsoft is not shy about gloating about their progress.


Recently, BitPeak representatives were invited to an event, titled “Azure and OpenAI: Partners in transforming the world with AI”. In this article we will share with you the key points of the Webinar, such as Microsoft strategy, established implementations and use cases, as well as a quick peak into the future of GPT-4.


So, if you are interested in AI, as you should be, you are in luck! Without further ado – let us dive in.




The Microsoft strategy and investments



General Overview of the Strategy


The hosts started strong and put emphasis on the necessity of investments in AI for companies that do not want to be left behind, as constant development creates pressure to progress or become uncompetitive. It was quite an obvious prelude for further promotion of Microsoft’s product, but the sentiment itself is not wrong. AI has come to the mainstream, with decently reliable results and cost-efficiency – and the world is riding on its wave.


A slide from MS presentation representing the importance of the AI



In its 2022 report about AI, creatively titled “The state of AI in 2022—and a half decade in review” McKinsey supports this conclusion and gives their own insights about the future of artificial intelligence. Unfortunately for all the Luddites, the future with AI powered toasters and/or Skynet is confidently coming our way.

So, how does Microsoft prepare for the coming of our future computer overlords? The answer is simple:

  • Research & Technology
  • Partnerships
  • Ethical guidelines




Research & Technology


The obvious Microsoft flagship is the ChatGPT which conquered the globe in lightning-fast time, reaching 100M users in just two months. In comparison, Facebook took 4.5 years to do the same. The chatbot won the minds and hearts through a combination of its ability to conduct nearly human-like conversations, provide code snippets and explanations, as well as very confidently state very incorrect information. And those are some very human competencies that not every person I know possesses.


But, jokes aside, why is ChatGPT so special and different from other chatbots? The concept itself is not new. However, as demonstrated during the webinar, you can ask it to create a meal plan for a particular family with concrete specifications such as portions, cooking style and nutrition. The bot will create (not paste!) such a plan for you and even provide a shopping list if asked. The list may be wrong the first time, but after some prodding you will get what you need and be ready to go to the nearest supermarket.


The example shows that not only does the AI have some real day-to-day uses, not only can it correct itself (or at least provide the second most probable answer based on its parameters), but also provide assistance in a broad range of topics with various capabilities. But, after knowing “why”, let us look closer at “how”.


ChatGPT – one model to rule them all



The first part is its architecture. ChatGPT is a single model with multiple capabilities, often referred to as a „single model for multiple tasks”. This is the result of its underlying architecture and training methodology. Such an approach stands in contrast to the traditional solutions, which involve training separate models for each task. But how does it work exactly?


Transfer learning: ChatGPT leverages transfer learning, where it is pretrained on a large corpus of diverse text data, gaining a general understanding of language, facts, and reasoning abilities. This pretraining step enables the model to learn a wide range of features and patterns, which can be fine-tuned for specific tasks. The shared knowledge learned during pretraining allows the model to be flexible and adapt to various tasks without the need for individual task-specific models.


Zero-shot learning: Owing to its extensive pretraining, ChatGPT possesses the ability to perform zero-shot learning in which the model is trained on a set of labeled examples, but is then evaluated on a set of unseen examples that belong to new classes or concepts. This means it can handle tasks it has not been explicitly trained for, using only the knowledge acquired during pretraining. To achieve this, zero-shot learning relies on the use of semantic embeddings, which represent objects or concepts in a continuous vector space. By using these embeddings, the model can generalize from known classes to new classes based on their similarity in the vector space.


Few-shot learning: ChatGPT can also engage in few-shot learning, where it can learn to perform a new task with just a few examples. In this setting, the model is provided with examples in the form of a prompt, which helps it understand the task’s context and requirements. To achieve this, few-shot learning typically employs techniques like transfer learning, meta-learning, and episodic training. Transfer learning involves adapting a pre-trained model to a new task with limited data, while meta-learning involves training a model to learn how to learn new tasks quickly.


Thanks to this approach chatbot is more efficient when it comes to allocating resources, simpler to deploy, better at generalization and adaptation to new tasks, easier to maintain and able to find and use synergies between its capabilities. Why do other AI models either do not use this approach or are not as proficient in it?


The answer is simple – resources. ChatGPT benefits from an enormous amount of resources, both when it comes to infrastructure that supports its capabilities and the sourcing and parsing of training data.


But simple answers are usually not enough. Below are a few more tricks that the AI uses to answer questions ranging from Bar Exam tasks to trivia from the Eighties Show.


Safety: To increase safety, OpenAI employs Reinforcement Learning from Human Feedback (RLHF). During the fine-tuning process, an initial model is created using supervised fine-tuning with a dataset of conversations where human AI trainers provide responses. This dataset is then mixed with the InstructGPT dataset transformed into a dialog format. To create a reward model for reinforcement learning, AI trainers rank different model responses based on quality. The model is then fine-tuned using Proximal Policy Optimization, with this process iteratively repeated to improve safety.


Fine-tuning: Fine-tuning is achieved through a two-step process: pretraining and supervised fine-tuning. During pretraining, the model learns from a massive corpus of text, gaining a general understanding of language, facts, and reasoning abilities. In the supervised fine-tuning stage, custom datasets are created by OpenAI with the help of human AI trainers who engage in conversations and provide suitable responses. The model then fine-tunes its understanding by learning from these responses, improving its contextual understanding and coherence.


Scaling: Scaling is accomplished primarily by increasing the number of parameters in the model. ChatGPT in its newest iteration has billions of parameters that allow it to learn more complex patterns and relationships within the training data. The transformer architecture enables efficient scaling by leveraging parallelization and distributed computing, allowing the model to process vast amounts of data efficiently.


Reduced prompt bias: To reduce prompt bias, OpenAI explores techniques such as rule-based rewards, where biases in model-generated content are penalized. Another approach is to use counterfactual data augmentation, which involves creating variations of the same prompt and training the model on these diverse prompts to produce more consistent responses.


Transformer architecture: The transformer architecture, introduced by Vaswani et al. in 2017, is the foundation of GPT-4 and other state-of-the-art language models. Key features of this architecture include:

  • Self-attention mechanism: Transformers use a self-attention mechanism that allows the model to weigh different parts of the input sequence and focus on contextually relevant parts when generating output.
  • Positional encoding: Transformers do not have an inherent sense of sequence order. Positional encoding is used to inject information about the position of tokens in the input sequence, ensuring the model understands the order of words.
  • Layer normalization: This technique is used to stabilize and accelerate the training of deep neural networks by normalizing the input across layers.
  • Multi-head attention: This mechanism enables the model to focus on different parts of the input sequence simultaneously, learning multiple contextually relevant relationships in the data.
  • Feed-forward layers: These layers, used after the multi-head attention mechanism, consist of fully connected networks that help in learning non-linear relationships between input tokens.


By leveraging these advanced features, the transformer architecture empowers ChatGPT to generate more contextually accurate, coherent, and human-like text compared to other AI models.






To establish and retain a dominant position in the AI tech-sphere, Microsoft has been actively pursuing strategic partnerships with leading research institutions, startups, and other technology companies. These alliances enable Microsoft to tap into external expertise, share knowledge, and jointly develop cutting-edge AI solutions, broadening their offer of AI-augmented services and tailoring them to their infrastructure. The most important partner is obviously OpenAI, which together with Microsoft develops four main models.



Joint mission and results of the partnership



GPT series models, such as GPT-3 and GPT-4 are series of language models developed by OpenAI consisting of some of the largest and most powerful language models to date, with possibly up to 100 trillion parameters in the case of GPT-4 and respectable 175 billion in the case of GPT-3.


GPT-3 is capable of understanding and generating human-like text based on the input it receives. It can perform various tasks, including translation, summarization, question-answering, and even writing code, without the need for fine-tuning. GPT-3’s capabilities have opened up exciting possibilities in natural language processing and have garnered significant attention from the AI community opening it up to mainstream with obvious day-to-day uses.


Building on the success of GPT-3, OpenAI introduced GPT-3.5 and then GPT-4, with each new iteration bringing significant improvements. GPT-3.5 enhanced fine-tuning capabilities and context relevance, while GPT-4, surpassing all previous models, showcases superior complexity and performance. Leveraging the capabilities of GPT-3 like translation, summarization, and code writing, GPT-4 demonstrates heightened understanding and generation of human-like text, expanding the potential applications of AI in various sectors and daily life.


Codex is an AI model built on top of GPT-3, specifically designed to understand and generate code. It can interpret and respond to code-related prompts in natural language and can generate code snippets in various programming languages. The most notable application of Codex is GitHub Copilot, an AI-powered code completion tool developed by GitHub (a Microsoft subsidiary) in collaboration with OpenAI. Copilot assists developers by suggesting code completions, writing entire functions, and even recommending code snippets based on the context of the developer’s current work. Despite its recent legal troubles, it is no doubt a useful tool.


DALL-E is an AI model that combines the capabilities of GPT-3 with image generation techniques to create original images from textual descriptions. By inputting a text prompt, DALL-E can generate a wide array of creative and often surreal images, showcasing the model’s ability to understand the context of the prompt and generate relevant visual representations. DALL-E’s unique capabilities have implications for many creative industries, such as advertising, art, and entertainment, especially when it comes to lowering the entry threshold.


ChatGPT is a AI model fine-tuned specifically for generating conversational responses. It is designed to provide more coherent, context-aware, and human-like interactions in a chat-based environment. ChatGPT can be used for various applications, including customer support, virtual assistants, content generation, and more. By being more focused on conversation, ChatGPT aims to make AI-generated text more engaging, relevant, and useful in interactive scenarios. And while making jokes or understanding Norman McDonald’s humor may be beyond it (so far), the capability is still uncanny.



Microsoft prepared broad range of tools with obvious real-life uses



It is obvious that Microsoft decided to promote AI, seeing the potential to become a main facilitator and infrastructure provider, while also democratizing the whole process and fulfilling its mission of increasing productivity on a global scale. However, during the event it was strongly stated that the partnership with OpenAI, while productive and important, is only part of the range of services offered by Microsoft. The company uses its machine modeling muscles in a variety of ways, presented below, with both old services with AI augmentation and new propositions aimed at increasing productivity.



If ChatGPT is all-in-one shop, then Microsoft prepared whole commercial district






Now, with figures such as Elon Musk and Bill Gates cautioning against AI and its growth the question of ethics in research and development appears. And while it is rather improbable that ChatGPT, being just a weighed statistical model becomes Roko’s Basilisk – the dangers of automation, unethical data sourcing and increased dependence on quick and easy answers generated by ChatGPT – remain.


So what steps are taken during development of new generation of AI models to ensure that it does more good than bad and won’t go Skynet on the general populace?


Ethical principles: Microsoft has established a set of ethical principles that guide the development and deployment of AI. These principles include fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.


Bias detection and mitigation: Microsoft uses a combination of algorithms and human reviewers to detect and mitigate bias in its AI services. For example, it has developed tools that can identify and correct biased language in chatbots like ChatGPT.


Data privacy and security: Microsoft has strict policies and procedures in place to protect the privacy and security of user data. It also provides users with tools and settings to control how their data is used.


Explainability and transparency: Microsoft aims to make its AI services more explicable and transparent to users. It has developed tools like the AI Explainability 360 toolkit, which allows developers to understand and explain the decisions made by AI models.


Partnerships and collaborations: Microsoft collaborates with governments, NGOs, and academic institutions to ensure that its AI services are used for the social good. For example, it partners with organizations like UNICEF and the World Bank to develop AI solutions that address social and environmental challenges.


Responsible AI initiative: Microsoft has launched a Responsible AI initiative to promote the development and deployment of AI that is ethical, transparent, and trustworthy. The initiative includes a set of tools and resources that developers can use to build responsible AI solutions.


But all of those did not prevent the chatbot from being implicated in a civil libel case filed by Victorian Mayor Brian Hood who claims the AI chatbot falsely describes him as someone who served time in prison as a result of a foreign bribery scandal. Additionally, there are some questions about the regulations about data privacy that may be breached by ChatGPT, which resulted in it being banned in Italy.


The watchdog organization being the bad referred to „the lack of a notice to users and to all those involved whose data is gathered by OpenAI” and said there appears to be „no legal basis underpinning the massive collection and processing of personal data in order to 'train’ the algorithms on which the platform relies”. It is also telling that the AI researcher apologized and committed to working diligently and rebuilding violated trust.


So, while artificial intelligence presents enormous opportunities, and both Microsoft and OpenAI try to conduct their research in an ethical way, it is important to stay informed and watchful about potential dangers and opportunities.


To end the section about Microsoft’s strategy and development of AI products, the most important part must be mentioned – pricing.


The answer for the questions about using GPT for business is simple – tokenization


The prices itself can and probably will change, as demand stabilizes, but the “pay-as-you-go” model is promising and allows for great flexibility as well as somehow predictable costs. Additionally, there are few AI models to choose from, either focusing on “reasoning” ability or cutting costs.






All in all, Microsoft’s AI strategy and partnership with OpenAI have the potential to significantly shape the future of AI technology and its applications across various industries. By democratizing AI, integrating AI capabilities into its products, and fostering strategic collaborations, Microsoft is poised to remain at the forefront of the AI revolution, driving innovation and enabling unprecedented advancements in the field. Most importantly for the company, they want users to depend on their productivity increasing services and providers of AI-based solutions to depend on their infrastructure and processing power.


This is a natural extension of Microsoft business strategy, but differently than Azure or Power BI – their hegemony in the AI-sphere is as of now nearly uncontested. Even Google seems to be unable to find the right answer, perhaps because their own AI, Bard, has a habit of providing the wrong ones. For us, mere mortals, all is left to do is keep abreast of developments, hope that ethics prevail during the research and be prepared for a world run with or by AI.







Artificial Intelligence Microsoft and OpenAI

How Microsoft acts to become the most important provider of AI backed services? Read and learn!

Read more arrow

Data Vault 3.0 – The summary


After the second part of the article series about Data Vault where we talked about data modelling and architecture, we return to you with quik look into naming conventions as well as the summary of the topic. It is great opportunity to learn something new, or just refresh your knowledge about Data Vault.




Naming convention


As we have already seen, the Data Vault is a multitude of tables with different structures and purposes. With hundreds of such objects in the warehouse, it is impossible to use them if we do not set the right naming rules.


Below is a sample set of prefixes for Data Vault objects:


Layer Data Vault object Name prefix
RDV Hub H_
RDV Satellite S_
RDV Multiactive satellite SM_
RDV Relational link L_
RDV Hierarchical link LH_
RDV Non-hierarchical link LT
BDV Satellite BS_
BDV Multiactive satellite BSM_
BDV Relational link BL_
BDV Hierarchical link BLH_
BDV Non-hierarchical link BLT
Other PIT PIT_
Other Bridge BR_
Other View V_<DV_object_prefix>


In addition to prefixes, it is worth standardizing the naming of related objects such as satellites around a common HUB and the naming of links. It is worth naming technical and business columns consistently. A dictionary of abbreviations and a dictionary of column prefixes and suffixes can be introduced.





If you’ve made it this far, you should already have a rough idea of what Data Vault is, how to create it, and what its advantages are. In my opinion, in order for the methodology to be used correctly it is also necessary to be aware of its disadvantages in order to prepare for their mitigation. For me, the fundamental disadvantage of Data Vault is the multiplicity of tables in the model and the difficulty in connecting them. Let’s say we want to write a cross-sectional query that retrieves data from three business hubs. Let’s say we need data from 2 satellites connected to each of these hubs (that’s already 9 tables). In addition, there are links between the hubs, and if there are satellites attached to the links, they also have to be included, which gives a total of (9+4) 13 tables that we have to involve.


This creates challenges in several areas:

  • Performance
  • Difficulty in writing SQL queries for the model
  • Difficulty in documenting the model


Of course, each of these points can be addressed, but it requires additional work that one should be aware of.


The fragmentation of tables is, on the one hand, a disadvantage that I mentioned above, but on the other hand, it also has its advantages. For data warehouses with multiple consumers, many sources, and many critical processes, fragmentation helps to minimize the impact of any errors in data feeding. For example, we read a small dictionary from a CSV file and based on it, calculate a column in the Data Vault satellite. When this file does not appear or appears with an error, we will not feed only that one satellite in the data warehouse.


The rest of the data warehouse will work correctly, and the processes based on it. In the case of choosing a different modeling approach, where broad tables are created, a problem with one small element can cause a problem with feeding one of the most important data warehouse tables, delaying most critical processes. Fragmentation also makes data storage more efficient – we store data immediately after it appears. There are no situations where we wait for data from, for example, five sources, which we then combine in ETL and store. It is clear that in such an approach, ETL can only start after all the input data has appeared, so the writing is delayed by this waiting time, unlike in Data Vault.


Fragmentation also helps in developing a data warehouse in many independent teams and releasing such changes. Data Vault is very „agile” and greater gradation of data and feeding processes means we have fewer dependencies between teams. It looks completely different when we have critical and broad tables in the model and many teams that modify them. In such cases, conflicts are not difficult, and the effort required for integration and regression testing is much greater.


How to effectively manage a Data Vault model? I don’t want to give advice on when to create a new satellite and under what rules, because in my opinion it must be tailored to the company and how the data warehouse is to be developed. However, I would like to draw attention to the elements that must be addressed in order not to fail during the development of a Data Vault model consisting of hundreds of tables.


First of all, the production process should be described, which establishes the rules for developing the data warehouse, from the moment the data requirements appear to the implementation stage and then maintenance. I will not go into details here because this is a topic for a separate article, but I will only emphasize the fact that the model must be properly documented, that the rules for development (adding additional tables to the model) should be defined, that object and column naming should be consistent, and that a framework should be created to automate the feeding of DV objects (calculating keys, hdif, partitioning, etc.). It is also best for such a fragmented model to refer to something at a more generalized level. In the company, a high-level Corporate Data Model should be created, which the fragmented model must be consistent with (we always model down: CDM -> Data Vault Model).


The Data Vault model is a business-oriented approach to data, not source systems. Business concepts are usually constant, while IT systems live and change much more often. If we want to have a consistent model that does not change with the exchange of the IT system underneath, then Data Vault is the right choice. However, is it recommended for every organization? Definitely not. If you want to integrate several dozen or hundreds of data sources in the company, and if the company does not have dozens or hundreds of critical processes, then Data Vault is unecessary. The overhead required for a proper solution preparation can also be significant. The larger planned data warehouse is, the more certain the Return On Investment (ROI). ROI increases when:

  • the number of source systems is large
  • source systems change frequently
  • the number of planned critical processes is significant
  • we plan to develop the model in many independent teams


So is Data Vault right for you? To answer your question you will need thorough understanding of your business needs and strategy, as well as knowledge about adventages and weaknesses od Data Vault. However, after reading our Data Vault series, you should be much better equipped to start answering the question.


This concludes the third and final part of our series of articles about Data Vault and its implementation. However, if you are curious about experts opinion and insights about data science, integration of data engineering solutions and synergizing technological and business strategy during data transformation – you are in luck!

Our experts create comperhensive and informative articles about the data analytic business. So tune in on our site and social media linked below to not miss valuable content.


And if you have additional questions about data – let’s talk about it!


Data Vault Part 3 - Summary

What is Data Vault? How to implement it in your organization and how to harness its power in your organization? Take a look and learn!

Read more arrow

Data Vault 2.0 – data model

After the first part of the article  series about Data Vault where we introduced the concept and the basicis of its architecture, we return to you with more in-depth look into data modeling. We will analyze concepts such as Business keys (BKEYs), hash keys (HKEYs), Hash diff (HDIF) and more!



Data Vault – technical columns



Business Key (BKEY)


In contrast to traditional data warehouses, Data Vault does not generate artificial keys on its own, nor does it use concepts such as sequences or key tables. Instead, it relies on a carefully selected attribute from the source system, known as the Business Key (BKEY). Ideally, the BKEY should not change over time and be the same across all source systems where the data is generated. While this may not always be possible, it greatly simplifies passive model integration. Furthermore, in the context of GDPR requirements, it is not advisable to choose business keys that contain sensitive data as it can be challenging to mask such data when exposing the data warehouse.


Examples of BKEYs may include the VAT invoice number, the accounting attachment number, or the account number. However, finding a suitable BKEY may not be an easy task. One best practice is to check how the business retrieves data from source systems and which values are used when entering data into the source system. Typically, these values, as they are known to the business, are good candidates for BKEYs. Often, the same data is processed in multiple source systems. For instance, in an organization with several systems for processing tax documents (invoices, receipts), natural document numbers (receipt/invoice numbers) may be used in some, while an artificial key (attachment number) may be used in others. In some cases, a sequential document number and an equivalent natural number are also used. In such situations, using an integration matrix can help identify the appropriate BKEY.


Matrix showcasing potential BKEY keys



As we can see from the matrix, there are several potential BKEY keys, but only the document number appears in the majority of the sources from which we retrieve document data. If we use a BKEY key based on the document number, the data in the Data Vault model will naturally integrate. However, what will we get for data from „System 2„? For this data, we need to design an appropriate same-as link (a Data Vault object) that will connect the same data. More on this in the later part of the article.


It is important that the same BKEY keys from different source systems are loaded in the same way. Even if we want to format such a key, for example, by adding a constant prefix, we should do it in the same way for data from all sources.



Hash key (HKEY)


In the DV model, all joins are performed using a hash key. The hash key is the result of applying a hash function (such as MD5) to the BKEY value. The hash key is ideal for use as a distribution key for architectures with multiple data nodes and/or buckets. Through distribution, we can efficiently scale queries (insert and select) and limit data shuffling, as data with the same BKEY values are stored on the same node (having received the same HKEY).


Example BKEY and HKEY:




Hash diff (HDIF)


In Data Vault objects that store historical data (SCD2), HDIF represents the next versions of a record. HDIF is calculated by computing a hash value on all the meaningful columns in the table.





Date and hour of record loading.





Indication that a record has been deleted. It is important to note that in Data Vault 2.0 it is not recommended to use validity periods (valid from – valid to) to maintain historical records. As this requires costly update operations that are not efficient, especially for real-time data. In addition, for some Big Data technologies, update operations may not be available, which further complicates the implementation of validity periods. Instead, Data Vault recommends an insert-only architecture based on technical columns such as LoadTime and DelFlag to indicate when a record has been deleted.





For Data Vault tables that receive data from multiple sources, the source column allows for additional partitioning (or sub-partitioning) to be established. Proper management of the physical structure of the table enables independent loading of data from multiple sources at the same time.

Different types of Data Vault objects have different sets of technical columns, which will be discussed further in the article.




Passive integration:


In classic warehouses, there are often so-called key tables in which keys assigned to business objects on a one-off basis are stored. Loading processes read the key table and, based on this, assign artificial keys in the warehouse. There are also sequences based on which keys are assigned, and sometimes a GUID is used.


All these solutions require additional logic to be implemented so that the value of the keys can be assigned consistently in the warehouse model. Often, these additional algorithms also limit the scalability of the warehouse resource. Passive integration is the opposite of this approach. Passive integration involves calculating a key on the fly during a table feed based only on the business key. With a deterministic transformation (hash function on BKEY), we can do this consistently in any dimension, e.g:


  • model dimension – the same BKEY in different warehouse objects will give us the same hkey so we can feed them independently and then combine them in any consistent way


  • time dimension – feeding the same BKEY at different points in time will give us the same result. Records powered up a year ago and today will get the same HKEY. Clearing the data and feeding it again will also have no effect on the calculated values (unlike, for example, in the case of sequences)


  • environment dimension – the same BKEY will have the same HKEY on different environments which facilitates testing and development.


The above is possible, but only if we choose the BKEY correctly, so the necessary effort should be made to make the choice optimal. We should consistently calculate it with the same algorithm for all HUB objects in the model. The exception can appear when we know that we have potential BKEYs in different formats in the source systems, but a simple transformation will make it consistent. It is important that this transformation is of the 'hard rule’ type.


For example:


In system 1 we have the key BKEY: „qwerty12345”


In system 2 we have the key BKEY: „QWERTY12345”


We know that business-wise they mean the same thing. In this case, we can apply a „hard rule” in the form of a LOWER or UPPER function to make the keys consistent.


Unfortunately, there are also situations where we have completely different BKEYs in different systems, for example:


In system 1 we have the key BKEY: „qwerty12345”


In system 2 we have the key BKEY: „7B9469F1-B181-400B-96F7-C0E8D3FB8EC0”


For such cases, we are forced to create so-called same-as links, which we will discuss later in this article.



Physical objects in Data Vault



Data Vault objects appear in the same form in both the RDV and BDV layers. The differences between them are only in the way the values in these objects are calculated (Hard rules and Soft rules). The objects of each layer should be distinguished at the level of naming convention and/or schema or database



  1. HUB
  2. LINK
    • Standard
    • Effectiveness
    • Multiactivity



  1. Business HUB
  2. Business LINK
  3. Business SATELLITE
    • Standard
    • Effectiveness
    • Multiactivity



HUB type objects


Hubs in the Data Vault warehouse are objects around which a grid of other related objects (satellites and links) is created. A Hub is a 'bag’ for business keys. A Hub cannot contain technical keys that the business does not understand, the keys must be unique. Examples of HUBs could be: customer, bill, document, employee, product, payment, etc.


We feed the Hubs with keys (BKEY) from the source systems, one BKEY can represent data from multiple source systems. We can use some rules to calculate BKEY but only those that meet the hard rules (usually UPPER, LOWER, TRIM). We never delete data from the HUB, if a record has disappeared from the source systems then its key should remain in the HUB. Even if the data is loaded into the hub in error, we do not need to delete unnecessary keys.



Example HUB structure, description of technical columns one chapter earlier.



Satellite type objects


It stores business attributes. We can have satellites with history (SCD2) or without history (SCD0/SCD1). We create a new satellite when we want to separate some group of attributes. We can do this for a number of reasons:


a) we want to store data of the same business importance (e.g. address data) in one place


b) we want to separate fast-changing attributes into a separate satellite. Fast-changing attributes are those that change frequently causing duplication of records in the satellite. Examples of such attributes could be e.g. interest rate, account balance, accrued interest, etc.


c) we want to segregate attributes with sensitive data for which we will apply restrictive permission policies or GDPR rules.


d) we want to add a new system to the warehouse and create a new satellite for it


e) others that for some reason will be optimal for us



Data Vault is very flexible in this respect. However, be sure to document the model well.



Example of a satellite with data recorded in SCD2 mode:




Multiactive satellite – a specific type of a satellite where the key is not only BKEY but also a special multiactivity determinant (one of the substantive attributes). An example of such a satellite is a satellite storing address data where the multiactivity determinant is the type of address (correspondence, main, residential).


We have one BKEY (e.g. login in the application) and several addresses. We can successfully replace the multiactivity satellite with a regular one by adding a multiactivity determinant column to the hashkey calculation. My experience shows that it is better to limit the use of multiactivity satellites for reasons of model readability and reading efficiency.


 Example of a multiactivity satellite with data recorded in SCD2 mode



Link type objects


Link objects come in several versions:


Relational link – represents relationships between two or more objects which can be powered by complex business logic. Relationships must be unique – this is achieved by generating a unique hash for the relationship which is calculated from the hashes of the records it links. A link does not contain business columns (the exception is an nonhistorized link).


If we want to show history then we need to attach a satellite with a timeline to the link (effectivity satellite). The performance satellite can also contain additional business columns describing relationships.





Hierarchical link – used to model parent-child relationships (e.g. organisational structure) This type of link can of course also store history. To achieve that – just add an efficiency satellite to the link.



An example of an organisational structure in the Data Vault model using a hierarchical link and an efficiency satellite:




Non-historicised link  (also known as transactional links) – a link that may contain business attributes within it, or may be associated with a satellite which has these attributes. The important thing is that it stores information about events that have occurred and will never be changed (like a classic fact table). Examples of such data are: system logs, invoice postings that can only be changed/withdrawn with another posting (storno accounting), etc.



and example of a Non-historicised link




Link same as – allows you to tag different BKEY keys in the HUB table that essentially mean the same thing business-wise. I have mentioned this in previous chapters when describing the selection of the optimal BKEY. It is very important to note that this link only combines BKEY keys that business mean the same thing, we do not use the same as to register relationships other than mutually explicit relationships. We can use advanced algorithms to calculate often non-obvious links and record the results of the calculation in the link.



Examples of “same as” link



Links such as „same as” can be used in situations when we want to indicate often non-obvious business relationships, but also in very mundane situations. For example, when two systems have completely different business keys that represent the same thing, or when a key changes over time and we want to capture and record that change.


PIT facility – The Data Vault model is fragmented, as we have many subject satellites correlated to HUBs. Queries in the warehouse often involve several HUBs and satellites correlated with them. Selecting data from a specific point in time can be a challenge for the database. To improve read performance we use Point In Time (PIT) objects. A PIT table is something like a business index.


The important point is that we create PITs for specific business requirements. We define a set of source data (hubs, satellites), combine selected tables of hubs, links and satellites in such an arrangement as the business expects, e.g. for a selected moment in time (selected timeline or other business parameter). These are objects that we can reload and clean at any time, depending on the requirements of the recipient and the limitations of the hardware/system platform. The PIT is constructed from keys that refer to the hub and satellites so that we can retrieve data from these objects with a simple „inner join„.


A PIT facility can also refer to links instead of HUBs and satellites attached to a link.


BRIDGE object – works similarly to the PIT object with the difference being that it does not speed up access to data on a specific date but speeds up reading of a specific HKEY. Like PIT objects, BRIDGE objects are also created for the specific requirements of the data recipient. Bridge objects contain keys from multiple links and associated HUBs.





The raw Data Vault model is not an easy model to use, it is difficult to navigate without documentation and therefore should not be made widely available to end users. The PIT as well as the Bridge objects help the end-user to read the DataVault data efficiently but it is important to remember that they are not a replacement for the Information Delivery (Data Mart) layers. They should be considered more as a bridge and/or optimisation object to produce higher layers. Of course, creating a PIT/Bridge object also costs money, so this optimisation method is used where there are many potential customers.


This concludes the second part of our series of articles about Data Vault and its implementation. Next week, you will be able to read about naming convention. Additionally, you will be able to find the summary of the information provided so far! To make sure you will not miss the next part of the series, be sure to follow us on our social media linked below. And if you have additional questions about data – let’s talk about it!


Data Vault Part 2 - Data modeling

What is Data Vault? How to implement it in your organization and how to harness its power in your organization? Take a look and learn!

Read more arrow



Data Vault, compared to other modelling methods is relatively new. There are not many specialists with experience when it comes to data warehouses in this architecture. The lack of practical knowledge often results in solutions that only partially comply with the guidelines. This results in achieved results not fulfilling expectations and not supporting business strategy properly. Implementation and performance are especially problematic and require in-depth consideration.


But if you are curious about enormous potential of Data Vault as a Data Governance tool – you came to right place. Tomasz Dratwa, BitPeak Senior Data Engineer and Data Governance expert with several years of experience in implementing and developing Data Vaults decided to write down the most vital issues that need to be considered while building DV in your organization. Issues such as implementation of modelling from the architecture level to the physical fields in the warehouse. We are sure that they will help anyone who considers a warehouse in a Data Vault architecture.


The article is mostly for people who have some experience in dealing with databases and data warehouses before. It does not explain the basics of creating a data warehouse, modeling, foreign keys, or what SCD1 and SCD2 are. For those unfamiliar with the concepts, the article may be a challenging lecture. However, for those well-versed in dealing with databases and data warehouses, or just determined and able to access the google – this will most certainly be a very valuable lecture.



What is Data Vault?


Data Vault is a set of rules/methodologies that allow for the comprehensive delivery of a modern, scalable data warehouse. Importantly, these methodologies are universal. For example, they allow for modeling both financial data warehouses where data is loaded on a daily basis, and where backward data corrections are important, as well as warehouses collecting user behavioral data loaded in micro-batches. Data Vault precisely defines the types of objects in which data is physically stored, how to connect them, and how to use them. Thanks to these rules, we can create a high-performance (in terms of reading and writing) fully scalable (in terms of computing power, space, and surprisingly, also manufacturing!) data warehouse. Proper use of Data Vault enables us to fully leverage the scaling capabilities of Cloud, Big Data, Appliance, RDBMS environments (in terms of space and computing power). Additionally, the structure of the model and its flexibility allows for parallel development of the data warehouse model by multiple teams simultaneously (e.g., in the Agile Nexus model).



The two logical layers of the integrated Data Vault model are:


  • Raw Data Vault – raw data organized based on business keys (BKEY) and „hard rules” transformations (explained later in the article).


  • Business Data Vault – transformed and organized data based on business rules.



Both layers can physically exist in one database schema, and it’s important to manage the naming convention of objects appropriately. An issue which I will explain later. The Information Delivery layer (Data Marts) should be built on top of the above layers in a way that corresponds to the business requirements. It doesn’t have to be in the Data Vault format, so I won’t focus on Information Delivery design in this article.


Currently, Data Vault is most popular in Scandinavian countries and the United States, but I believe it is a very good alternative to Kimball and Immon and will quickly gain popularity worldwide.


Data Vault is „Business Centric” data model, which follows the business relationships rather than the systems and technical data structure in the sources. The data is grouped into areas, of which the central points are the so-called Hub objects (which will be discussed later). The technical and business timelines are completely separated. We can have multiple timelines because the time attributes in Data Vault are ordinary attributes of the data warehouse and do not have to be technical fields. On the other hand, Data Vault ensures data retention in the format in which the source system produced it, without loss or unnecessary transformations. It seems impossible to reconcile, yet it can be done.


Data Vault is a single source of facts, but the information an often be multi-faceted. Variants are necessary, because the same data is often interpreted differently by different recipients, and all these interpretations are correct. Facts are data as it came from the source; Such data can be interpreted in many ways, and with time, new recipients may appear for whom calculated values are incomplete. With time, the algorithms used for calculations may also degrade. Data Vault is fully flexible and prepared for such cases.


Data Vault is based on three basic types of objects/tables:


  • Hub: stores only business keys (e.g. document number).
  • Relational Link: contains relationships between business keys (e.g. connection between document number and customer).
  • Satellite: stores data and attributes for the business key from the Hub. A satellite can be connected to either a Hub or a Link.



An example excerpt from a Data Vault model:


As you can see, the Data Vault model is not simple. Therefore, it is recommended to establish the appropriate rules for its development and documentation during the planning phase. It is also important to start modeling from a higher level. The best practice is to build a CDM (Corporate Data Model) in the company, which is a set of business entities and dependencies that function in the enterprise. The Data Vault model should refer to the high-level CDM in its detailed structure. Additionally, it is worth defining naming conventions for objects and columns. It is also necessary to document the model (e.g. in the Enterprise Architect tool).




Data Vault 2.0 – Architecture


In this article, we will focus only on the portion of the architecture highlighted on the diagram. To this end I will explain what the RDV and BDV layers are, how to model them logically and physically, and how to approach data modeling in relation to the entire organization. We will also discuss all types of Data Vault objects, good and bad practices for creating business keys, naming conventions, explain what passive integration is, and discuss hard rules and soft rules. I will try to cover all the key aspects of Data Vault, understanding of which enables the correct implementation of the data warehouse.


High-level diagram of a data warehouse architecture based on Data Vault.



Buisness hard and soft rules


A crucial aspect of a data warehouse is the storage and computation of facts and dimensions. To optimize this process, it’s very important to understand the differences between hard and soft rules transformations. Typically, the lower levels of any data warehouse store data in its least transformed state. This is due to practical considerations, as storing data in the form it was received in is crucial. Why? Because it allows us to use that data even after many years and calculate what we need at any given moment. On the other hand, some transformations are fully reversible and invariant over time, such as converting dates to the ISO format or converting decimal values from Decimal(14,2) to Decimal(18,4). These data transformations in Data Vault are called Hard Rules. Sometimes, we also consider irreversible transformations (for example trimming) as Hard Rules, but we must ensure that the data loss doesn’t have a business or technical impact. All other computations that involve column summation, data concatenation, dictionary-based calculations, or more complex algorithms fall under soft rule transformations. Data Vault clearly defines where we can apply specific transformations.



Raw Data Vault and Business Data Vault


In logical terms, the Data Vault model is divided into two layers:


Raw Data Vault (RDV) – Which contains raw data, with solely hard rules allowed for calculations. Despite this, the RDV model is fully business-oriented, with objects such as Hubs, Links, and Satellites arranged according to how the business understands the data. Technical data layouts, as found in the source system, are not allowed in this layer. This is known as the „Source System Data Vault (SSDV)”, which provides no benefits, such as passive model integration, which will be discussed later. This layer stores a longer history of data according to the needs of the data consumers. It is also a good practice to standardize the source system data types in this layer, for example, by having uniform date and currency formats.


Business Data Vault (BDV) – which allows for any type of data transformation (both hard and soft rules) and arranges the data in a business-oriented manner. The source of data for this layer is always the RDV layer. The fundamental rule of Data Vault is that the BDV layer can always be reconstructed based on the RDV layer. If all objects in the BDV layer are deleted, a well-constructed Data Vault model should allow for its re-population.


Both layers are accessible to users of the data warehouse and their objects can be easily combined. It is recommended to store tables from both the RDV and BDV layers in the same database (or schema) and differentiate them with an appropriate naming convention. 


This concludes the first part of our articles about Data Vault and its implementation. Next week, you will be able to read about data modelling. To make sure you will not miss the next part of the series, be sure to follow us on our social media linked below. And if you have additional questions about data – let’s talk about it!



Data Vault Part 1 - Introduction

What is Data Vault? How to implement it in your organization and how to harness its power in your organization? Take a look and learn!

Read more arrow


As Artificial Intelligence develops, the need for more and more complex models of machine learning and more efficient methods to deploy them arises. The will to stay ahead of the competition and the interest in the best achievable process automation require implemented methods to get increasingly effective. However, building a good model is not an easy task. Apart from all the effort associated with the collection and preparation of data, there is also a matter of proper algorithm configuration.


This configuration involves inter alia selecting appropriate hyperparameters – parameters which the model is not able to learn on its own from the provided data. An example of a hyperparameter is a number of neurons in one of the hidden layers of the neural network. The proper selection of hyperparameters requires a lot of expert knowledge and many experiments because every problem is unique to some extent. The trial and error method is usually not the most efficient, unfortunately. Therefore some ways to optimise the selection of hyperparameters for machine learning algorithms automatically have been developed in recent years.


The easiest approach to complete this task is grid search or random search. Grid search is based on testing every possible combination of specified hyperparameter values. Random search selects random values a specified number of times, as its name suggests. Both return the configuration of hyperparameters that got the most favourable result in the chosen error metric. Although these methods prove to be effective, they are not very efficient. Tested hyperparameter sets are chosen arbitrarily, so a large number of iterations is required to achieve satisfying results. Grid search is particularly troublesome since the number of possible configurations increases exponentially with the search space extension.


Grid search, random search and similar processes are computationally expensive. Training a single machine learning model can take a lot of time, therefore the optimisation of hyperparameters requiring hundreds of repetitions often proves impossible. In business situations, one can rarely spend indefinite time trying hundreds of hyperparameter configurations in search for the best one. The use of cross-validation only escalates the problem. That is why it is so important to keep the number of required iterations to a minimum. Therefore, there is a need for an algorithm, which will explore only the most promising points. This is exactly how Bayesian optimisation works. Before further explanation of the process, it is good to learn the theoretical basis of this method.



Mathematics on cloudy days

Imagine a situation when you see clouds outside the window before you go to work in the morning. We can expect it to rain during any day. On the other hand, we know that in our city there are many cloudy mornings, and yet the rain is quite rare. How certain can we be that this day will be rainy?


Such problems are related to conditional probability. This concept determines the probability that a certain event A will occur, provided that the event B has already occurred, i.e. P(A|B). In case of our cloudy morning, it can go as P(Rain| Clouds), i.e. the probability of precipitation provided the sky was cloudy in the morning. The calculation of such value may turn out to be very simple thanks to Bayes’ theorem.



Helpful Bayes’ theorem


This theorem presents how to express conditional probability using the probability of occurrence of individual events. In addition to P(A) and P(B), we need to know the probability of B occurring if A has occurred. Formally, the theorem can be written as:



This extremely simple equation is one of the foundations of mathematical statistics [1].


What does it mean? Having some knowledge of events A and B, we can determine the probability of A if we have just observed B. Coming back to the described problem, let’s assume that we had made some additional meteorological observations. It rains in our city only 6 times a month on average, while half of the days start cloudy. We also know that usually only 4 out of those 6 rainy days were foreshadowed by morning clouds. Therefore, we can calculate the probability of rain (P(Rain) = 6/30), cloudy morning (P(Clouds) = 1/2) and the probability that the rainy day began with clouds (P(Clouds|Rain) = 4/6). Basing on the formula from Bayes’ theorem we get:



The desired probability is 26.7%. This is a very simple example of using a priori knowledge (the right-hand part of the equation) to determine the probability of the occurrence of a particular phenomenon.



Let’s make a deal


An interesting application of this theorem is a problem inspired by the popular Let’s Make A Deal quiz show in the United States. Let’s imagine a situation in which a participant of the game chooses one of three doors. Two of them conceal no prize, while the third hides a big bounty. The player chooses a door blindly. The presenter opens one of the doors that conceal no prize. Only two concealed doors remain. The participant is then offered an option: to stay at their initial choice, or to take a risk and change the doors. What strategy should the participant follow to increase their chances of winning?


Contrary to the intuition, the probability of winning by choosing each of the remaining doors is not 50%. To find an explanation for this, perhaps surprising, statement, one can use Bayes’ theorem once again. Let’s assume that there were doors A, B and C to choose from. The player chose the first one. The presenter uncovered C, showing that it didn’t conceal any prize. Let’s mark this event as (Hc), while (Wb) should determine the situation in which the prize is behind the doors not selected by the player (in this case B). We look for the probability that the prize is behind B, provided that the presenter has revealed C:



The prize can be concealed behind any of the three doors, so (P(Wb) = 1/3). The presenter reveals one of the doors not selected by the player, therefore (P(Hc) = 1/2). Note also that if the prize is located behind B, the presenter has no choice in revealing the contents of the remaining doors – he must reveal C. Hence (P(Hc|Wb) = 1). Substituting into the formula:



Likewise, the chance of winning if the player stays at the initial choice is 1 to 3. So the strategy of changing doors doubles the chance of winning! The problem has been described in the literature dozens of times and it is known as the Monty Hall paradox from the name of the presenter of the original edition of the quiz show [2].


Bayesian optimisation


As it is not difficult to guess, the Bayesian algorithm is based on the Bayes’ theorem. It attempts to estimate the optimised function using previously evaluated values. In the case of machine learning models, the domain of this function is the hyperparameter space, while the set of values is a certain error metric. Translating that directly into Bayes’ theorem, we are looking for an answer to the question what will the f function value be in the point xₙ, if we know its value in the points: x₁, …, xₙ₋₁.


To visualize the mechanism, we will optimise a simple function of one variable. The algorithm consists of two auxiliary functions. They are constructed in such a way, that in relation to the objective function f they are much less computationally expensive and easy to optimise using simple methods.


The first is a surrogate function, with the task of determining potential f values in the candidate points. For this purpose, regression based on the Gaussian processes is often used. On the basis of the known points, the probable area in which the function can progress is determined. Figure 1 shows how the surrogate function has estimated the function f with one variable after three iterations of the algorithm. The black points present the previously estimated values of f, while the blue line determines the mean of the possible progressions. The shaded area is the confidence interval, which indicates how sure the assessment at each point is. The wider the confidence interval, the lower the certainty of how f progresses at a given point. Note that the further away we are from the points we have already known, the greater the uncertainty.



Figure 1: The progression of the surrogate function



The second necessary tool is the acquisition function. This function determines the point with the best potential, which will undergo an expensive evaluation. A popular choice, in an acquisition function, is the value of the expected improvement of f. This method takes into account both the estimated average and the uncertainty so that the algorithm is not afraid to „risk” searching for unknown areas. In this case, the greatest possible improvement can be expected at xₙ = -0.5, for which f will be calculated. The estimation of the surrogate function will be updated and the whole process will be repeated until a certain stop condition is reached. The progression of several such iterations is shown in Figure 3.



Figure 2: The progression of the acquisition function



Figure 3: The progression of the four iterations of the optimisation algorithm



The actual progression of the optimised function with the optimum found is shown in Figure 4. The algorithm was able to find a global maximum of the function in just a few iterations, avoiding falling into the local optimum.

Figure 4: The actual progression of the optimised function



This is not a particularly demanding example, but it illustrates the mechanism of the Bayesian optimisation well. Its unquestionable advantage is a relatively small number of iterations required to achieve satisfactory results in comparison to other methods. In addition, this method works well in a situation where there are many local optima [3]. The disadvantage may be the relatively difficult implementation of the solution. However, dynamically developed open source libraries such as Spearmint [4], Hyperopt [5] or SMAC [6] are very helpful. Of course, the optimisation of hyperparameters is not the only application of the algorithm. It is successfully applied in such areas as recommendation systems, robotics and computer graphics [7].




[1] „What Is Bayes’ Theorem? A Friendly Introduction”, Probabilistic World, February 22, 2016. (provided July 15, 2020).

[2] J. Rosenhouse, „The Monty Hall problem. The remarkable story of math’s most contentious brain teaser”, January. 2009.

[3] E. Brochu, V. M. Cora, i N. de Freitas, „A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning”, arXiv:1012.2599 [cs], December. 2010




[7] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, i N. de Freitas, „Taking the Human Out of the Loop: A Review of Bayesian Optimization”, Proc. IEEE, t. 104, nr 1, s. 148–175, January 2016, doi: 10.1109/JPROC.2015.2494218.



Smarter Artificial Intelligence with Bayesian Optimization

How to enhance Artifical Intelligence? Learn how to use Bayes’ theorem to optimize your machine learning models with us!

Read more arrow


Data Factory is a powerful tool used in Data Engineers’ daily work in Azure cloud service. The code-free and user-friendly interface helps to clearly design data processes and improve Developer experience. It has many functionalities and features, which are constantly developed and enhanced by Microsoft.


The tool is mainly used to create, manage and monitor ETL (Extract-Transform-Load) pipelines which are the essence of the data engineering world. Therefore, I can confidently say that Data Factory has become the most integral tool in this field in Azure. But have you ever thought about the cost, that the service generates each time it is run? Have you ever done a deep dive into consumption run details, in order to investigate and explain the final price you have to pay each month for the tool?


Whether you have hundreds of long-running daily pipelines or use Data Factory for 10 minutes, once a week in your organization, it generates costs. Therefore, it is a good practice to know how to deal with it and create well-designed, cost-effective pipelines. In this article, you will find out how the small details can double your monthly invoice for Data Factory service. Azure is a pay-as-you-go service, which means that you pay only for what you actually used. However, the pricing details might overwhelm at first sight, and I hope the article will help you understand it more deeply. When you open official website (here or here) you can see that costs are divided into two parts: Data Pipeline and SQL Server Integration Services. In this article I will discuss only the Data Pipeline part, so let’s analyze it together.



Data Pipeline

First of all, it is important to realize that you are not only charged for executing pipelines, but the cost for Data Pipeline is calculated based on the following factors:

  1. Pipeline orchestration and execution
  2. Data flow execution and debugging
  3. Number of Data Factory operations (e.g. pipeline monitoring)



Pipeline orchestration


You are charged for data pipeline orchestration (activity run and activity execution) by integration runtime hours. Azure offers three different integration runtimes which provide the computing resources to execute the activities in pipelines. The below table presents the cost for each integration runtime.



Type Azure Integration Runtime Price Azure Managed VNET Integration Runtime Price Self-Hosted Integration Runtime Price
Orchestration 1$ per 1 000 runs 1$ per 1 000 runs 1.5$ per 1 000 runs
*the presented prices are for West Europe region in March 2022, source.



Orchestration refers to activity runs, trigger executions and debug runs. If you run 1000 activities using Azure Integration Runtime you are charged $1. The price seems to be low, but if you have a process that runs a lot of activities in loops many times a day, you could be surprised how much it could cost at the end of the month.


If you want to study existing pipelines in Data Factory, I recommend you to check the value in Data Factory/Monitoring/Metrics section by displaying charts Succeeded activity runs and Failed activity runs. The sum of these values is a total number of activity runs. The below picture presents how you can check the statistics for Data Factory instance for last 24 hours.






As you can see in the above example, the pipelines are executed every 3 hours and the total number succeeded activity runs is 8320. How much does it cost? Let’s calculate:


Daily price: 8320/1000 * $1 = $8.32


Monthly price: 8320/1000 * $1 * 30 days = $249.6



Pipeline executions


Every pipeline execution generates cost. Pipeline activity is defined as an activity which is executed on integration runtime. The below table presents the pricing of execution Pipeline Activity and External Pipeline Activity. As demonstrated in the below table, the price is calculated based on the time of execution and the type of integration runtime.



Type Azure Integration Runtime Price Azure Managed VNET Integration Runtime Price Self-Hosted Integration Runtime Price
Pipeline Activity $0.005/hour $1/hour $0.10/hour
​External Pipeline Activity $0.00025/hour $1/hour $0.0001/hour
*the presented prices are for West Europe region in March 2022, source.



Depending on the type of activity that is executed in Data Factory, the price is different, as illustrated in Pipeline Activity and External Pipeline Activity sections in the table above. Pipeline Activities use computing configured and deployed by Data Factory, but External Pipeline Activities use computing configured and deployed externally to Data Factory. In order to show which activity belongs where, I prepared the below table.



Pipeline Activities External Pipeline Activities
Append Variable, Copy Data, Data Flow, Delete, Execute Pipeline, Execute SSIS Package, Filter, For Each, Get Metadata, If Condition, Lookup, Set Variable, Switch, Until, Validation, Wait, Web Hook Web Activity, Stored Procedure, HD Insight Streaming, HD Insight Spark, HD Insight Pig, HD Insight MapReduce, HD Insight Hive, U-SQL (Data Lake Analytics), Databricks Python, Databricks Jar, Databricks Notebook, Custom (Azure Batch), Azure ML, Execute Pipeline, Azure ML Batch Execution, Azure ML Update Resource, Azure Function, Azure Data Explorer Command



Rounding up


While executing pipelines, you need you know that execution time for all activities is prorated by minutes and rounded up. Therefore, if the accurate execution time for your pipeline run is 20 seconds, you will be charged for 1 minute. You can notice that in the activity output details in the billingReference section. The below pictures present an example of executing Copy Data activity.








The section billingReference in output details of execution of the activity holds information like meterType, duration, unit. The pipeline was executed on self-hosted integration runtime and lasted 1/60 min = 0.016666666666666666 hour, although the time of execution was 20 seconds.



Inactive pipelines


It was really surprising for me, that Azure charges for each inactive pipeline which has no associated trigger or zero runs within one month. The fee for it is $0.80 per month for every pipeline, so it is crucial to delete unused pipelines from Data Factory especially when you deal with hundreds of pipelines. If you have 100 unused pipelines in your project, the monthly fee is $80 and the yearly cost is $960.



Copy Data Activity




Copy Data Activity is one of the options in Data Factory. You can use it to move the data from one place to another. It is important to know that in Settings you can change the default Auto value to 2. By doing so, you can decrease the data integration unit to a minimum, if you copy small tables. In general, the value of units can be in the range of 2-256 and Microsoft has recently implemented a new feature for Auto option. When you choose Auto, it means that Data Factory dynamically applies the optimal DIU setting based on your source-sink pair and data pattern.


The below table presents the cost of consumption of one DIU per hour for different types of integration runtime.



Type ​Azure Integration Runtime Price Azure Managed VNET Integration Runtime Price Self-Hosted Integration Runtime Price
Copy Data Activity $0.25/DIU-hour $0.25/DIU-hour $0.10/hour
*The presented prices are for West Europe region in March 2022, source.



Let’s estimate cost of a pipeline that has only Copy Data Activity.




If Copy Data Activity lasts 48 seconds, the copy duration time is rounded up to 1 minute, so the cost is equal to:


1 minute * 4 DIUs * $0.25 = 0.0167 hours * 4DIUs * $0.25 = $0.0167


As you can see the price $0.0167 seems to be low, but let’s consider it more deeply. If you execute the pipeline for 100 tables every day, the monthly cost is equal to:


$0.0167 * 100 tables *30 days = $50.1


If you execute the pipeline for 100 tables every single hour, the monthly cost is equal to:


$0.0167 * 100 tables * 30 days * 24 hours = $1,202.4



The most crucial part of creating the pipeline solution is to keep in mind that even if you handle small tables, but do it very often, it could dramatically increase the total cost of the execution. If it is feasible, I recommend preparing the data upfront and using one large file instead. You can just code a simple Python script.





The next factor that could be relevant in regard to pricing is Bandwidth. If you want to transfer the data between Azure data centers or move in or out the data of Azure data centers you can be additionally charged. Generally, moving the data within the same region and inbound data transfer is free, but the situation could be different in other cases. The price depends on the region, internet Egress and differs for Intra-continental or Inter-continental data transfer.


For example, if you transfer 1000 GB data between regions within Europe, the price is $20, but in South America it is $160. When it is necessary to move 1000 GB from Europe to other continents the price is $50, but from Asia to other continents it’s $80. Therefore, think twice before you decide where to locate your data and how often you will have to transfer it. As you notice, there are many factors contributing to the bandwidth price. You can find the whole price list in Azure documentation.


Data Flow




Data Flow is a powerful tool in ETL process in Data Factory. You can not only copy the data from one place to another but also perform many transformations, as well as partitioning. Data Flows are executed as activities that use scale-out Apache Spark clusters. The minimum cluster size to run a Data Flow is 8 vCores. You are charged for cluster execution and debugging time per vCore-hour. The below table presents Data Flow cost by cluster type.



Type Price
General Purpose $0.268 per vCore-hour
Memory Optimized $0.345 per vCore-hour
*the presented prices are for West Europe region in March 2022, source.



It is recommended to create your own Azure Integration Runtimes with a defined region, Compute Type, Core Counts and Time To Live feature. What is really interesting, is that you can dynamically adjust the Core Count and Compute Type properties by sizing the incoming source dataset data. You can do it simply by using activities such as Lookup and Get Metadata. It could be a useful solution when you cope with different dataset sizes of your data.


To sum up, in terms of Data Flows in general you are charged only for cluster execution and debugging time per vCore-hour, so it is significant to configure these parameters optimally. If you want to use one basic cluster (general purpose) for one hour and use a minimum number of Core Count, the total price of execution is equal to:


$0.268 * 8 vCores * 1 hour = $2,144


The monthly price is equal to:

$0.268 * 8 vCores * 30 days * 1hour = $64.32



There are four bottlenecks that depend on total execution time of Data Flow:


  1. Cluster start-up time
  2. Reading from source
  3. Transformation time
  4. Writing to sink


I want to focus on the first factor: cluster start-up time. It is a time period that is needed to spin up an Apache Spark cluster, which takes approximately 3-5 minutes. By default, every data flow spins up a new Spark cluster, based on the Azure Integration Runtime configuration (cluster size etc.). Therefore, if you execute 10 Data Flows in a loop each time, a new cluster is spun up, ultimately it can last 30-50 minutes just for start-up clusters.


In order to decrease cluster start-up time, you can enable Time To Live option. The feature keeps a cluster alive for a certain period of time after its execution completes. So, in our example each Data Flow will reuse the existing cluster – it starts only once, and it takes 3-5 minutes instead of 30-50 minutes. Let’s assume that the cluster start-up lasts 4 minutes.



Scenario 1 – Estimated time of executing 10 Data Flows without Time To Live Scenario 2 – Estimated time of executing 10 Data Flows with Time To Live
Cluster start-up time 40 min 4 min (+ 10 min Time to Live)
Reading from source 10 min 10 min
Transformation time 10 min 10 min
Writing to sink 10 min 10 min



The table above presents two scenarios of execution 10 Data Flows in one pipeline, but the second option has Time To Live feature that lasts 10 minutes.


Cost of executing the pipeline in scenario 1:

70 mins/60 * $0.268 * 8 vCores = $2.5


Cost of executing the pipeline in scenario 2:

44mins/60 * $0.268 * 8 vCores = $1.57


It easy to see that the price in scenario 1 is much higher than in scenario 2.

The most crucial part of using Time to Live option is the way of executing the pipelines. It is highly recommended to use Time To Live only when pipelines contain multiple sequential Data Flows. Only one job can run on a single cluster at a time. When one Data Flow finishes, the second one starts. If you execute Data Flows in a parallel way, then only one Data Flow will use the live cluster and others will spin up their own clusters.


Moreover, each of them will generate extra cost from Time To Live feature, because clusters will wait unused for a certain period of time when they finish. In consequence, the cost could be higher than without Time To Live feature. In addition, before implementing the solution make sure if Quick Re-use option is turned on in integration runtime configuration. It allows to reuse a live cluster for many Data Flows.



Data Factory Operations


The next actions that generate cost are the „read”, „write” and „monitoring” options. The below table presents the pricing.


Type Price
Read/Write $0.50 per 50 000 modified/referenced entities
Monitoring $0.25 per 50 000 run records retrieved
the presented prices are for West Europe region in March 2022, source.


Read/write operations for Azure Data Factory entities include „create„, „read„, „update„, and „delete„. Entities include datasets, linked services, pipelines, integration runtime, and triggers. Monitoring operations include get and list for pipeline, activity, trigger, and debug runs. As you can see, every action in the data pipeline generates cost, but this factor is the least painful one when it comes to pricing, because 50 000 is really a huge number.





I would like to present you one feature that could be helpful in finding bottlenecks in your existing solution in Data Factory. First of all, every executed pipeline is logged in Monitor section in Data Factory tool. Logs contain a data of every step of the ETL process, including pipeline run consumption details, but there they are stored for only 45 days in Monitor. Nevertheless, it is feasible to calculate an estimated price of Pipeline orchestration and Pipeline execution.


I found PowerShell code on Microsoft community website that generates aggregated data of pipelines run consumption within one resource group for defined time range. I strongly believe that the code can be useful for costs estimation of your existing pipelines. It is worth mentioning that this method has some limitations and for example it doesn’t contain information about consumption of Time To Live in Data Flows. In the picture below you can see this information in the red box.





I hope you found this article helpful in furthering your understanding of pricing details and the features that could be significant in your solutions. Microsoft is still improving Data Factory and while preparing this paper I needed to change two paragraphs due to the changes in Azure documentation. For example, from January 2022, you will no longer need to manually specify Quick Re-use in Data Flows when you create an integration runtime and that is great information. I found a funny quote that could describe Azure pricing in general: You don’t pay for Azure services; you only pay for things you forget to turn off – or in this case – “turn on”.


The pricing explanation of Azure Data Factory

See how to optimize the costs of using Azure Data Factory!

Read more arrow

Digital Fashion — Clothes that aren’t there

Sitting in a cozy café in your favorite t-shirt, with one click you change into a shirt and put on a jacket. You can start a conference with your future client. Such perspective is becoming more and more real, and closer than ever, due to concept of Digital Fashion.



Pic. 1. Source



With the development of new technologies, especially 3D graphics (rendering, 3D models and fabric physics), the term is becoming increasingly popular. And what is Digital Fashion really? It is simply digital clothing – a virtual representation of clothing created using 3D software and then „superimposed” on a virtual human model.



Gif. 1. The Fabricant



Digital Fashion seems to be the next step in the development of the powerful e-commerce and fashion markets. Online stores started with descriptions and photos; now 360° product animations have become the norm, and digitally created models’ faces and bodies are increasingly being used for promotional graphics. The time for virtual fitting rooms and maybe even our own virtual wardrobes is coming. Actually, this (r)evolution has already taken its first steps. Let us just look at AR app projects of brands such as Nike (2019) or the collaboration of Italian fashion house Gucci with Snapchat (2020).


Gif. 2. Application for virtual shoe fitting. Source


Where did the need for this type of solution come from? The main, but not the only, factors giving rise to this type of application are:


On-line work and social relations – more and more events are moving or taking place simultaneously in the virtual world. The same applies to professions and even social gatherings. Remote working „via webcam” is no longer the domain of the IT industry, but increasingly appears in the entire sectors of the economy.


Environmental consciousness — digital clothes and accessories do not require farmland or animal husbandry for fabric and leather, as well as 93 billion cubic meters of water to produce textiles, laundry detergents, or global distribution routes. Designed once anywhere in the world, they can be globally available in no time.


The rapid increase in the popularity of items that do not exist in the real world – NFTs (non-fungible tokens) and people adopting digital alter egos.


The new generations are natives of technology. They largely communicate, and thus express themselves, in the virtual world. A perfect example of this trend is the success of fashion house Balenciaga’s campaign done in cooperation with the game Fortnite. Digital-to-Physical Partnerships will become more and more common.


Above, I have only outlined the emerging niche of Digital Fashion. It is also worth mentioning Polish achievements in this field – those interested may refer to the VOGUE article on the Nueno digital clothing brand and the article on Personally, I am extremely curious what virtual reality will bring to the e-commerce and fashion market in the coming years.


Pic.2. Digital Clothes made by STEPHY FUNG.


VR/DF Application — Big Picture

The rapid development of the Digital Fashion niche observed in recent years gives us huge, still largely undiscovered opportunities for the development of new products and services in this area. From designers specializing only in Digital Fashion, through professionals selecting textures for virtual fabrics, to programmers responsible for the unique physics of clothes. Personally, my favorite option would probably be to turn off gravity – you are sitting safely in a chair, and the shirt you’re wearing is acting like you’re in outer space. So naturally, space is created for apps that showcase emerging products and for marketplaces where customers will be able to view and purchase them.


For the purpose of this article, we will take on the challenge of creating just such a solution – an AR app connected to a digital clothing marketplace. The application will give the user the to create their own virtual styling, and clothing brands, as well as related brands, to officially sell their products and NFT.


Basic application principles

In theory, the operation is very simple – the application collects data about the user’s posture from the camera image, then processes it in real time using a library for human pose estimation (technology: OpenCV + Python). The collected data is actually just points in 3D space. They are transferred to the 3D engine, in which a virtual model of the User is created. The 3D model of the character itself is invisible, but interacts with visible clothes and/or accessories (technology: Blender 3D + Python). Ultimately, the user sees himself with the digital clothing superimposed.


Pic. 3. Diagram of the components of the application responsible for the virtual scene.


At this point, it is worth clarifying two terms:



POSE ESTIMATION — pose estimation is a computer vision technique that predicts movements and tracks the location of a person. We can also think of pose estimation as the problem of determining the position and orientation of a camera relative to an object. This is usually done by identifying, locating and tracking a number of key points on a person, such as the wrist, elbow or knee.



RIGGING(skeletal animation) means equipping a 3D model of a human, animal or other character with jointed limbs and virtual bones.These form a skeleton inside the model, which makes it much easier and more efficient for the animator to maneuver – movements of the bones affect the movement of the 3D model.


The exchange of information between the program making the pose estimation and the skeleton inside the human model is the basis of the created application. Data packets about the position of characteristic points on the body, which are x, y, z parameters in space, will be connected with the same points in rigging of the 3D model of the figure.



Pic. 4. Overlaying points from pose estimation on the joints of a 3D human model.



General guidelines for business objectives

The proposed solution does not go in the direction of a virtual avatar (i.e. it does not position itself as a replacement for a person’s image). We are interested in the environment around the person, in the surroundings – clothes, accessories, interiors, etc. – what is around is already a product. Following the proverb „closer to the body than the shirt”, the closest and always fashionable product are clothes – hence we will strongly focus on this segment of the market.


The question arises – what if the user wants to change their eye color? From there it’s close to swapping your hand for that of the Terminator after the fight in the final scene. I identify such needs as very interesting (e.g. in Messenger filters), but infantile. I would describe the proposed solution as a place of man + product, rather than man + visual modification of man. This is intended to imply an image of greater maturity, professionalism and brand awareness. In practice, it is meant to be a place where existing brands can sell products right away. The product focus is also meant to clearly differentiate this solution from the filters familiar from TikTok/Instagram, or animated emoticons on iOS.


Clothing in Metaverse

Just how fresh and hot the topic of digital clothing, and the entire emerging market associated with it is, is indicated by the huge interest generated by the Connect 2021 conference, during which the CEO of Facebook, or, for some, META, presented the Metaverse (’meta’- beyond, and 'universum’- world). This is the concept of a new internet combining the 'internet of things’ with the 'internet of people’. Mark Zuckerberg explained in an interview with The Verge that the Metaverse is „an embodied internet where instead of just viewing content – you are in it”. The author of the term itself is Neal Stephenson, who used it nearly thirty years ago in his cyberpunk book Snow Crash. In it, he describes the story of people living simultaneously in two realities – real and virtual.


The question is not „will it happen?” but rather „when and how it will happen?” As augmented, and virtual reality technologies become increasingly present in our lives, the world that now surrounds us on a daily basis will migrate into the Metaverse. Offices, pubs, gyms, flats are all now our mundane lives and will also be present in digital life. At the center, however, will always be people and their experiences. But what would interactions with others be like without the right attire? A „burning” t-shirt of your favorite band at a virtual concert; a waterfall dress during a New Year’s Eve meta-ball, or a golden shirt at a business meeting summarizing a successful project – although it sounds like science-fiction, this series of articles is an attempt to respond to such needs.



Gif.3. Digital clothing in Metaverse



The evolution of the e-commerce market towards Digital Fashion has already begun. This is possible thanks to the dynamic development of technologies such as Pose Estimation, 3D graphics, and hundreds of other smaller, but very important, innovations appearing every day. In this article, we’ve given an overview of what digital clothing is and the opportunities it presents – for software developers on the one hand, and designers and graphic designers on the other.


In the future articles we will focus on technical issues related to the created application and market. Those interested can count on a large dose of code in Python associated with Pose Estimation and Blender 3D. There will also be plenty of news related to Digital Fashion and Metaverse.


Clothes that aren't there. AR and Python in the Digital Fashion.

Sitting in a cozy café in your favorite t-shirt, with one click you change into a shirt and put on a jacket.

Read more arrow


AWS Glue is an arsenal of possibilities for data engineers to create ETL processes with Amazon resources. It supports a setup of calculating units where jobs can be in the form of Python or Spark scripts made from scratch or using AWS Glue Studio with an interactive visual designer. The designer has a simple interface and comes up with helpful set of ready to use transformations. Still, it also presents some limitation and problems.

The Limitations

The visual designer automatically generates a script for every added transformation. This script can be modified, however, any change to it will block the possibility for further visual development as user code cannot be translated into visual transformations.


Currently there are 15 available transformations, like Select Fields, Join, or Filter. Those basic operations cover up most of typical data operations, yet there is always a need for more complex calculations. In those situations, SQL and Custom transformations come to the rescue. First one extends the job’s capabilities only to SQL functions. Second one allows to create a new transformation with user made Python function that can only accept one parameter and always need to return DynamicFrameCollection.


If there is a need to extend a job with additional parameters they need to be added in the job’s configuration, yet they are also needed to be added manually to the script. If a developer builds the job with visual templates, it makes them impossible to do the development further in the visual designer, as a proper visual operation to add jobs’ parameters into script is not implemented.


The Problems 

Some transformations, like SelectFields, do not handle empty datasets in a proper manner. If empty dataset needs to be processed, those transformations will return an empty object without headers. This in turn will lead to an error in the next step, if any processing is applied on the indicated columns.


There are several problems with the web interface itself, i.e., a significant amount of used visual transformation leads to a complete slowdown of the designer, or if someone wants to change the data type for only one column in ApplyMapping with selection menu, this sometimes causes unexpected changes in all other columns.


Data preview is a great addition to AWS Glue Studio as it allows to observe how parts of data are processed through every transformation. However, if there is any error in a job, it prints a general error message and restarts itself to print the same message on and on. This does not allow to really validate the error, which sometimes forces you to stop viewing the Data preview and run the job in standard mode.


AWS Glue– Tips for Beginners Part II. Limitation of AWS Glue Studio

AWS Glue is an arsenal of possibilities for data engineers to create ETL processes with Amazon resources.

Read more arrow

Introduction to Case Study

AWS Glue is, amongst other AWS services, a great choice for a Big Data project. Alone or even with other services, like AWS Step Function and AWS EventBridge, it may help create a fully operational system for data analysis and reporting. The service provides ETL functionalities, facilitates integration with different data sources and allows a flexible approach to development.


In the following paragraphs I present a review of AWS Glue features and its functionalities based on a real example of integration with external databases and loading data form there to S3 buckets. Whole purpose of this exercise is to present technical side of the service using a practical case and building a simple solution step by step.

The Connection

In the reviewed case, the data source is a PostgreSQL database which is an external resource from AWS. It stores few tabular datasets that are supposed to be moved to Amazon S3. Someone could create a connection to scan this database directly In a form of a script, but here we can use AWS Glue Connections. It allows to create a static connection to databases which stores connection’s definition, the chosen user and its password. It delivers a possibility to connect external databases, Amazon RDS, Amazon Redshift, MongoDB and others.


Based on the established connection in AWS Glue, it is possible to scan databases to know what tables are available there. Developers can use AWS Glue Crawlers which may analyse whole databases model for a chosen database schema to create an internal representation of tables. A Crawler can be run manually or based on a schedule to scan one or more data sources. A successful scan of Crawler creates metadata in Data Catalog for Databases and Tables.

Databases and Tables

Databases in AWS Glue serve a purpose of containers for inferred Tables. Tables are just metadata and they reference actual data in an external source, i.e., their data are not saved in Amazon storage. In a situation where inferred Tables are created with Crawler scanning internal Amazon resources, those Tables would also act only as references. This means that deleting Tables in AWS Glue would only lead to deletion of metadata in Data Catalog, but not to deletion of physical resources on external databases or S3. What developers must also remember is that Tables from external resources are not available for ad-hoc queries using Amazon Athena, even though scanned Databases exists in Amazon Athena.

The Jobs

AWS Glue lets developers create Spark or simple Python jobs, where jobs’ settings can be modified to select type of workers, number of workers, timeouts, concurrency, additional libraries, job parameters and so on. Developers may create a job by writing and passing scripts using Amazon platform or using recent feature in AWS Glue Studio to create jobs with a visual designer.


Picture presents a Glue Studio job in a visual form (left) and its representation in code (right).



Continuing with the case study, in the above picture there is a visually created job that would import data from PostgreSQL databases into S3 bucket. In this simple example, there are only three operations used (left side of the picture): Data source, Transform and Data target. Those operations and additional other built-in transformations simplify the process of creating Glue jobs. First operation directly creates a data frame from an external table by simply indicating Database and Table created in the previous steps. Then, by “filter” transformation, only specific data are saved into S3 bucket with the last operation.


All those three steps can be done manually just by the means of passing parameters in the visual designer. Moreover, visual transformations will generate a ready to run script (right side of the picture). This script can be modified, but that irreversibly switches off a possibility of further modification using the visual designer. This limitation only allows creation of simplest jobs or a start-up of bigger jobs.


The above steps show the features of AWS Glue. Some of them could be omitted, if one would like to create his/her own way of connecting to a different data source using credentials stored in AWS Secrets Manager instead of creating Connection in AWS Glue. Additionally, there are a couple more useful functions of AWS Glue that were omitted in this article, like Workflow, or Triggers. Apart from the nice sides of AWS Glue, there are some disadvantages that need to be taken into consideration. Those will me mentioned in next article about AWS Glue.


AWS Glue – Tips for Beginners. Part I – Review of the Service

AWS Glue is, amongst other AWS services, a great choice for a Big Data project.

Read more arrow
Load more