What is DarkBert?

DarkBERT is a new AI model that was specifically trained with data from the Darknet. In contrast to large language models such as ChatGPT and Google Bard, which were trained with data from the open web, the developers of DarkBERT exclusively used data from the dark web for training. More specifically, DarkBERT was trained using data from hackers, cybercriminals and other fraudsters.

DarkBERT is based on the RoBERTa architecture, an AI method developed by Facebook researchers in 2019. RoBERTa is a “robustly optimized method for preprocessing Natural Language Processing (NLP) systems” that improves on BERT (Bidirectional Encoder Representations from Transformers), released by Google in 2018. A team of researchers from South Korea scanned the Tor network to collect data to train this comprehensive language model. By feeding RoBERTa with data from the dark web over a period of almost 16 days, the researchers were able to develop DarkBERT.

Despite the unusual origin of the training data, DarkBERT has already outperformed other large language models. The researchers do not currently plan to make DarkBERT available to the public, but are accepting requests for academic purposes. DarkBERT will likely allow law enforcement and researchers to better understand the dark web as a whole.

DarkBERT could be the future of AI models that are trained in a specific area to make them more specialized. Given its popularity so far, it wouldn’t be surprising if we see similar AI models developed in this way in the future.

Es wurde kein Alt-Text für dieses Bild angegeben.
DarkBERT: Illustration of the pretraining process and evaluation scenarios (Image: DarkBERT: A Language Model for the Dark Side of the Internet)

Why is a DarkBERT even needed?

In the context of cybersecurity and law enforcement, DarkBERT represents a remarkable tool. It has proven its power in tests on the dark web, with its domain knowledge easily outperforming popular models such as BERT, a now slightly outdated model compared to more powerful transformer models such as GPT . DarkBERT, a post-trained version of RoBERTa, was trained over two weeks on two different data sets: once with raw crawled data and the other with prepared data.

However, DarkBERT’s primary target audience is not cybercriminals, but rather law enforcement and cybersecurity organizations that search the dark web to combat cybercrime. According to existing research, the predominant topics on the Darknet are fraud and data theft, although the Darknet is also used for anonymous discussions within organized crime.

What is the Dark Web?

It is important to note that the Darknet or Deep Web is an area of the Internet that traditional search engines such as Google do not cover and is typically inaccessible to the average user because it requires specialized software.

While DarkBERT can be an effective tool for combating cybercrime, the possibilities of surfing the web anonymously are also of interest to many other people, especially those who value their privacy and do not want to make their data available to the big technology companies that collect data and have made personalized advertising their business model. Journalists, dissidents and those politically persecuted use the Darknet, for example, to access regionally blocked and censored content.

Why DarkBERT makes sense but is not accessible

Overall, DarkBERT is a versatile and powerful tool that not only helps combat cybercrime, but can also help promote a better understanding of the darknet and the activities that take place there.

There are some notable benefits of DarkBERT:

  • It has the ability to identify websites that offer ransomware or leak sensitive data.
  • It can search various forums on the dark web and alert you to illegal information exchanges.
  • Despite the fact that it was trained on darknet data, DarkBERT has already outperformed other major language models.

Despite these positive aspects, there are also concerns about using DarkBERT:

  • Because DarkBERT was trained on data from the dark web, some applications and underlying messages could potentially be ethically or legally questionable.
  • Data quality and consistency on the dark web are often inadequate or incomplete, which can affect the effectiveness of DarkBERT.

DarkBERT represents a promising tool for studying the dark web and identifying cybercriminals and politically persecuted individuals. However, scientists and security experts must continue to work on improving DarkBERT to address potential risks and ethical concerns.

Accessibility Toolbar