Posted on Leave a comment

How does AIoT’s voice technology come to the fore?

AIoT combines AI technology and IoT technology. In addition to implementing technological innovations, the application and real landing of the main technologies is also a key issue in this field.


is the new technology and production process (for example, early cars) by being applied and It is improved by application, and then it is used for further application and adoption, thus creating positive feedback or gain-increasing utility. – “The Essence of Technology”, Brian Arthur

In the last article, we talked about the slogan that AIOT is not a general term.He has his own user value and business value logic. From the beginning of this article, we will talk about the main techniques used in AIOT. It is also the combing and precipitation since I started (keng). If there is any problem, please welcome the exchange.

I know such a strong technical field in the AI field, understand its technical principles and technical boundaries, and combine market demand to output product services more efficiently.

The core content of this article has the following points:

First, we start with the industry chain of AIOT and have a macroscopic understanding of the entire AIOT industry from a macro perspective;

Secondly, this article will focus on the technical principles of voice technology in the field of AIOT and the method of landing product service.

The remaining major technical modules will be updated in subsequent articles.

AIoT Industry Chain

The AIoT industry chain mainly includes the following parts:

Upstream: Hardware: chip manufacturer, communication module, etc.; Software: AI Technology, IoT Technology

Midstream: Operating System, App, Cloud Service

Downstream:Channel (online/offline)

AIoT Industry Chain

From the above picture, we can get a general understanding of the whole picture of the entire AIOT industry. It is a hard and soft all-inclusive industry involving the most modules. So as a product manager, there is a lot of room to play in this area.

What do IMOT products do?

Different types of AIOT products have different technical understanding requirements, such as Alibaba Cloud IoT, Tencent Cloud IoT, its business goal is to do ecology, do Hydroelectric coal, the main output of the PaaS layer,Direct service developers, so the technical requirements of the product is very high, generally have several years of relevant development experience.

For the front-end interactive experience, the technical requirements for the experience layer products that are in direct contact with the user are not so high, and the lower the downstream of the industry chain, the lower the technical requirements for the product. Combined with the above industrial chain structure map can be divided into three major blocks:

Hardware products: responsible for the entire terminal hardware experience. Need to understand from hardware definition, design to the final mass production full link, this section does not expand here, the next step is to talk about hardware products;

Software products: for the entire IoT The software service experience is responsible. This is a big module, and then there are app products, system products, IoT platform products, if there are online channels, there are e-commerce products, etc.;

AI algorithm products: for the entire AI experience Be responsible for. According to the technology link, it can be subdivided into acoustic front-end products, ASR products, NLP products, and TTS products. Regarding the technology of this piece, let’s expand on it.

AIoT product function and industry chain diagram

IAOT Voice Technology

For product managers, to understand the main technical aspects of voice technology, you can:

Quick convergence issues to help developers improve positioning and modify problem efficiency;

Output stable products, understand the technical principles and boundaries, in order to quickly output stable product services. This is the most basic requirement for both C-side and B-side customers.

Here we take the example of a user using a voice-controlled device control light (see the following flowchart for details):

Voice Control Smart Home Flow Chart

When the user issues the “Turn On Light” command, the following steps are taken:

Step1 Pickup

According to the usage scene, it is also divided into near-field pickup (generally within 3m) and far-field pickup (generally 3-5m). This part is technically called the acoustic front end.

The main principle is to accurately obtain the user’s voice information through a single mic or mic array, and prepare for the next ASR (speech recognition). It mainly includes the following technical points (but not limited to, the whole link involves many technical links. The following mainly extracts the main technical points related to the product experience):

VAD(Voice Activity Detection ), voice activity detection. Analyze using the audio feature, etc. to determine the start and end points of the sound. For the product, it is often encountered that an instruction is not recognized. For example, “turning on the light” only recognizes that “hit” causes the last skill to be missed, and the user’s intention cannot be completed. This may be the VAD abnormal truncation problem;

AEC (Acoustic Echo Cancellation), echo cancellation. If the current device is playing audio and other audio content while using mic pickup, then mic will replay the sound played by the device and avoid echoing after playing.For the product, this is an experience point for assessing the inevitable assessment of an intelligent voice device that needs to be played by audio. For example, when playing music, there is often an echo problem, which may be because the AEC algorithm is not doing well;

BF(Beam Forming), wave speed forming. It is used to enhance the voice in a single direction, weakening the unrelated sound, making the sound sound cleaner. For the product, this is the core technology point to improve recognition in noisy environments. If your product identification is poor in a noisy environment, you can start from this point.

Step2 ASR (Automatic Speech Recognition)

This step is mainly to convert the voice information of the front-end pickup into text information, and throw the processed text information to the next NLP ( Natural language processing) to do the processing. The main assessment indicators identification rate and false wake-up. On this point, hanniman teacher has a more in-depth explanation, there is not much to explain here.

Step3 NLP (Natural Language Processing)

The purpose of natural language processing is to convert text information into machine language, to clarify user intent, and to trigger the user for the next step. Prepare for the intended intent. On the product operation side, it will be mainly divided into the following parts:

Domain, that is, the domain, such as music and smart home, are considered as one field. The domain is equivalent to the category. For example, if I want to create a TV control skill, I will first create a TV domain;

Intent, that is, the user wants to let the machine do Things. For example, in the example of “turning on the light” in this chapter, “turning on the light” is the intention of the user’s behavior, but the same control intention may have different opinions. For example, “turning on the light” can say “turn the light on” Or “lights are turned on”, you need to introduce a thing called Pattern, he is to solve different arguments or sentence patterns, product operators can configure several commonly used sentences or statements, and then enumerate and generalize through algorithms. ;

Slot,That is, the word slot, in this case, “open” and “light” are word slots.

Step 4 Platform Forwarding

Voice Vendor IoT Platform→ Vendor IoT Platform→ Vendor Equipment. Because the smart home field is quite special, from the perspective of users, a user may have different brands of smart home devices; from the market perspective, the current smart home market has a wide variety of products and fragmentation.

Take the Tmall Elf as an example, and now has access to the 600+ brand. Simply connecting with skills is not conducive to vendor operations management and user experience. Therefore, most voice vendors will also make a management platform for smart homes.

After the previous NLP processing information is passed to the voice manufacturer’s IoT platform, the voice manufacturer’s IoT platform will pass the information to the corresponding three-party vendors according to the user’s already bound smart home brand and device capabilities. The IoT platform finally delivers control information to the appropriate control device to complete the entire control link.

Step 5 TTS (Text To Speech)

As the name suggests, it is to convert text into speech.If your central control device has a Speaker, when the entire control link is completed, you can broadcast a result voice to enhance the entire product experience and complete the experience loop.

Voice Technology*AIoT

The above technologies are arranged in a combination of ways. It is easy to calculate the technical solutions that can be provided to customers for 25 different products, and for C-end user products. There are countless services. But the biggest problem for many companies today is how to find their own sky in this “Wangyanghai” (even the air conditioning has integrated voice capabilities, many people can’t understand).

Let me talk about my view of voice technology on AIOT landing (the following methods are equally applicable to C and B):

First of all, efficiency, everything is efficient. The first priority indicator for the application of all new products or technologies is to increase efficiency compared to the original service. What is efficient? Efficient to do the same thing who spends the shortest time. For example, the user scene of “I want to see Hunan Satellite TV after turning on the TV” is as follows. Here are three different types of TV operation path comparison:

Traditional TV:Remote control channel key → left and right button to switch 3-4 pages (except Hunan province, other provinces may put suddenly satellite TV after 3 or 4 pages) → up and down button to select Hunan Satellite TV → click determine. Probably need to operate 5-6 steps;

Smart TV (without voice): My Apps → TV Cat App → Search Hunan Satellite TV → click OK. It is a big operation of 4 steps;

Voice TV: One sentence “xxx, I want to watch Hunan Satellite TV”, and even can avoid the words directly. Just 1 step.

Second, cost, consider the energy and cost consumed per unit of time. Based on the previous point, “Whether the time spent doing the same thing is short” is not enough, because the speed does not mean low cost, so we must also consider the energy and cost consumed in a unit of time.

For example, if you spend 20 blocks for 2 hours, and spend 60 bucks can only be as short as 1.5 hours, the price is not high.

Take the smart air conditioner with integrated voice capabilities as an example. At present, the price of such air conditioners on the market is between ¥6999 and ¥9999, which is mainly for the high-end market. And the price of a voice module is around tens of blocks.This cost is fully affordable and has increased its bargaining space.

For example, in the field of small household appliances, where the unit price is generally low, this cost may have a lot of cost pressure. Therefore, the current voice module is more widely used in home appliances such as televisions and air conditioners. So in addition to the user scenario, the cost dimension is also an important consideration;

Finally, influence, consider the impact of doing this on the outside. That is, the feedback between your product and the user/customer is mainly divided into positive influence and negative influence:

positive influence, for example, close to the real TTS experience, natural human-machine dialogue experience;

Negative influences, such as Amazon’s Echo ghost event in the last few months.

can be measured from the perspective of qualitative (satisfaction, etc.) and quantitative (daily, retained, etc.), but in order to facilitate the following extension, we can simply record positive influence as positive, negative influence Negative.

AIoT products Service Formula

Summary: I judge the quality of an AIOT product service, combined with three elements, and summarize it as the formula shown in the figure:

AIoT Product Service = Efficiency / Cost * Impact Force

With this formula we can easily conclude that a good AIOT product service needs to be efficient, low cost and positively influential, and this service will multiply by your positive influence. The way to grow.

By the same token, we can quickly get a poor AIOT product service is determined by what factors.

In order to facilitate everyone’s understanding, we mentioned above example of,Is the current air conditioning integrated voice capability a good AIOT product service?

First of all, voice control is more convenient than the physical remote control of air conditioners. If you switch to the cooling mode, you should press the “mode” button twice in the initial state, and the voice can be used in one sentence. Solve, the efficiency is increased by 50%;

Secondly, for the manufacturer, assuming a module cost is ¥50, the current air conditioner with voice capability is generally priced between ¥6999~¥999999, The price of 6999 and the gross profit margin of 35% (currently several air conditioner manufacturers have a gross profit at this level, high-end models definitely have a higher gross profit) to calculate the cost of almost 1%, fully covered;

Again, regarding influence, let’s not say that voice control is more efficient than remote control in some scenes, users buy a more expensive air conditioner, and voice ability (except control, but also ask the weather, etc.), although it may be usual Basically not useful, but at least there is a show of capital. For example, if a guest comes, it can be said that “I can control the air conditioner with voice control”. Compared with the high-end air conditioner without voice function, it has certain added value.Suppose that you can score a satisfaction score of -5~5, at least give a score of 3;

Finally, through the formula, we can calculate the AioT product service score of the entire air-conditioning integrated voice capability is 150 points. From this perspective, air conditioning integrated voice capabilities are of positive value.


In “The Essence of Technology,” author Brian Arthur (founder of Complexity Science) believes that:

technology is New technologies and production processes (eg, early cars) are improved by being applied and applied, and then gaining further application and adoption, thereby creating positive feedback or incremental benefits.

The current IoT industry is still in its early stages, and understanding the “new technology” of voice technology can make us more comfortable. I wish all my colleagues can use this “new technology” to create more positive feedback or incremental benefits.


Leave a Reply