Electronic Discovery and Information Governance – Tip of the Month: AI for an Eye – The Limits of Using Technology in Data Analytics – Technology
A company is the victim of a data breach by a malicious actor and is now required to identify and notify an unknown number of people affected by the breach. In order to determine who to notify, the company must identify documents containing protected information, extract data about the data subjects from those documents and use that data to determine who to notify and by what means. This process requires a large and complex review of document data from sources with varying degrees of consistency and accessibility, ranging from scanned paper files to spreadsheets containing data for thousands of people.
As with any data review project, the most important cost factor is the number of working hours required for individual document reviewers to review the documents. Technology can increase efficiency and reduce the overall cost of reviewing data, but it has limits. The types of technologies available and the limitations of those technologies are key considerations when deciding how much to rely on technology to structure a review of data arising from a cybersecurity incident.
What types of technologies are there?
Several types of technologies could be used in the data review project described above. Text recognition1and search technology are familiar and proven technologies useful in every data or document review project. In addition to these technologies, new shape and shape recognition software2 driven by artificial intelligence (AI) can be useful for extracting data from uniformly structured and predictable document types. These technologies hold promise for identifying and extracting data from large document pools but, as noted below, suffer from significant shortcomings that may limit their usefulness.
Where is the technology most useful?
In data review projects where data mining is a project requirement, traditional text recognition and the use of search terms are effective and efficient tools for excluding documents unlikely to contain information that must be extracted. This threshold step is used in almost all data review or eDiscovery projects and is essential in establishing the scope of the data review.
In addition to the proven use of text recognition and search terms, data engineers can create a Python script3codes for extracting data from documents with a high degree of uniformity. For example, some types of Excel spreadsheets or .csv files can be structured in exactly the same way, even though the spreadsheets contain huge amounts of data. These types of files often lend themselves to automated checkout and do not require manual review.
Emerging shape and pattern recognition technologies can also be used in data mining projects, but these technologies have significant limitations and their usefulness strongly depends on the nature of the documents involved.
Where do you need a pair of eyes?
AI-based technology abhors irregularity. The automated retrieval technologies described above struggle with document sets that are not uniform. PDF files where the pages are oriented in different directions or where the pages are missing or out of order can block even advanced shape and shape recognition technologies. It is possible to carefully prepare documents for automated retrieval, but it is a laborious and time-consuming process that leads to inconsistent and sometimes unsatisfactory results. For complex or irregular files, such as those scanned from hard copies or originally compiled by hand, it is best to rely on human reviewers.
Consider the source
Since the usefulness of the technology depends on the format and characteristics of the documents when examining the data, it is important to consider how your data is stored and what types of documentation might be disclosed during a cybersecurity incident. . Files generated at a local office or franchise can be digitized on paper or otherwise compiled by individual office workers. This often contributes to a lack of uniformity in these documents, especially when data is collected from individual facilities at the national or global level. Even small variations in page order, orientation, or scan quality can cause problems with shape and shape recognition technology. On the other hand, if the project involves examining and extracting automatically generated or structured data into spreadsheets or .csv files, the technology can significantly reduce the number of man-hours required to review. and extract the data.
What are the timing considerations?
Technology can dramatically increase the speed at which data is extracted from a set of documents. However, it is not an instant process. Generating code and preparing documents for automated retrieval can take weeks of dedicated work. While most of the source documents lend themselves to automated retrieval, the time is well worth it. Manual review of a large number of documents can take several weeks and will be significantly more expensive than the full-time work of a small team of data engineers.
If you are unsure whether documents can be reviewed using technology, it may be a good idea to start a manual review early on and assess the documents’ suitability for automated retrieval on an ongoing basis. EDiscovery vendors can set up a manual review team and start reviewing documents within days. It’s faster to start a manual review and siphon documents into a separate workflow using technology on appropriate documents than it is to exhaust AI capabilities and then set up manual review. This is especially true if the data review relates to a cybersecurity incident, as timely notification of affected individuals is a requirement in many jurisdictions.
Using technology in a data review undoubtedly has its advantages, but the technology, as it currently exists, has limitations. The types of records involved and the time frame within which the review must be completed are key considerations in assessing whether the technology will benefit a particular data review project. The technology continues to improve, but, for now, it is important to carefully assess whether the technology can actually provide benefits in a data review project.
1 Text recognition technology, often referred to as OCR (âoptical character recognitionâ), is a technology commonly used to make source documents searchable.
2 Form and pattern recognition technologies rely on OCR technologies to assess whether documents are forms, such as purchase orders, patient records, or invoices. These types of documents generally have the same categories of data in the same place in the form, which facilitates automated extraction.
3 Python is a popular general-purpose programming language used by data engineers to create scripts to perform automated processes, including data extraction, deduplication, and data cleansing.
Visit us on mayerbrown.com
Mayer Brown is a global provider of legal services comprising law firms that are separate entities (the âMayer Brown Practicesâ). The Mayer Brown Firms are: Mayer Brown LLP and Mayer Brown Europe – Brussels LLP, two limited liability companies established in Illinois in the United States; Mayer Brown International LLP, a limited liability company incorporated in England and Wales (authorized and regulated by the Solicitors Regulation Authority and registered in England and Wales under number OC 303359); Mayer Brown, a SELAS established in France; Mayer Brown JSM, a partnership of Hong Kong and its associated entities in Asia; and Tauil & Checker Advogados, a Brazilian law partnership in which Mayer Brown is associated. âMayer Brownâ and the Mayer Brown logo are registered trademarks of Mayer Brown Practices in their respective jurisdictions.
Â© Copyright 2020. The Practices of Mayer Brown. All rights reserved.
This article by Mayer Brown provides information and commentary on legal issues and developments of interest. The foregoing does not constitute a complete treatment of the matter at hand and is not intended to provide legal advice. Readers should seek specific legal advice before taking any action on the matters discussed in this document.