NOMAD TCO Kaggle Part 1

Recently, a high quality dataset of formation energies and bandgaps for a family of sesquioxide materials was posted on Kaggle by the Fritz Haber Institute of the Max Planck Society (link). For those not in a Materials Science or Chemistry related field, transparent conducting oxides (TCOs) are a crucial class of functional materials which combine optical transparency with electrical conductivity. These materials enable everything from capacitive sensing in the touchscreens of smartphones, to the top electrode of solar cell devices. Given the scope of technological devices which rely on such properties, and there exists considerable demand for even more, there is a pressing need to develop new classes of materials which have favorable properties and are relatively inexpensive. Suitable TCOs today rely on rare earth elements, whose price and supply can fluctuate wildly.

In this competition, we are challenged to produce a quantum machine learning model for predicting the optical performance (related to the band gap) and the stability (represented by the formation energy) of candidate sesquioxide materials based on their stoichiometric ratios and unit cell structures. Direct analytical methods for solving for bandgaps and formation energies do not exist, due to the need to solve a many body Schrödinger equation. Approximations for these methods based on clever atomic potentials are also extremely computationally expensive. Thus, computationally tractable and efficient methods for ranking candidate materials by their properties would go a long way to accelerating research in this field.

In this blog, I’ll give some hints about working with geometry files and extracting meaningful material features based on these three-dimensional structures. I’ll mostly be following along with a recent publication by Ward et al..

Let’s start by importing some helper libraries and the main .csv file, which contains a structure index and the stoichiometry of the structure. Real materials are composed of a specific arrangement of the constituent atoms in a unit cell, which is then repeated to form a periodic lattice. The geometry file accompanying the main .csv contains the positions of the atoms in one-unit cell, along with the basis vectors used to construct a periodic array of cells. These are inherently three-dimensional structures and contain a rich set of information about bonding and symmetry. Here, I’ll show you how to load some of these files line by line, courtesy of Tony Y.

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np

import plotly.plotly as py
import plotly.graph_objs as go
import plotly as plyt



train=pd.read_csv('./train.csv')

def get_xyz_data(filename):
    pos_data = []
    lat_data = []
    with open(filename) as f:
        for line in f.readlines():
            x = line.split()
            if x[0] == 'atom':
                pos_data.append([np.array(x[1:4], dtype=np.float),x[4]])
            elif x[0] == 'lattice_vector':
                lat_data.append(np.array(x[1:4], dtype=np.float))
    return pos_data, np.array(lat_data)

idx = train.id.values[100]
fn = "./train/{}/geometry.xyz".format(idx)
train_xyz, train_lat = get_xyz_data(fn)

Indium = go.Scatter3d(
    x=[i[0][0] for i in train_xyz if i[1]=='In'],
    y=[i[0][1] for i in train_xyz if i[1]=='In'],
    z=[i[0][2] for i in train_xyz if i[1]=='In'],
    mode='markers',
    name='Indium',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(90,180,172, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)

Oxygen = go.Scatter3d(
    x=[i[0][0] for i in train_xyz if i[1]=='O'],
    y=[i[0][1] for i in train_xyz if i[1]=='O'],
    z=[i[0][2] for i in train_xyz if i[1]=='O'],
    mode='markers',
    name='Oxygen',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(216,179,101, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)

Aluminum = go.Scatter3d(
    x=[i[0][0] for i in train_xyz if i[1]=='Al'],
    y=[i[0][1] for i in train_xyz if i[1]=='Al'],
    z=[i[0][2] for i in train_xyz if i[1]=='Al'],
    mode='markers',
    name='Aluminum',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(166,97,26, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)

Gallium = go.Scatter3d(
    x=[i[0][0] for i in train_xyz if i[1]=='Ga'],
    y=[i[0][1] for i in train_xyz if i[1]=='Ga'],
    z=[i[0][2] for i in train_xyz if i[1]=='Ga'],
    mode='markers',
    name='Gallium',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(128,205,193, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)



data = [Indium, Oxygen, Aluminum, Gallium]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='unitcell_orig')

Above, I have plotted the position of each atom in the unit cell according to its physical position. Feel free to rotate the plot around and get a feeling for the structure and the symmetry. Individual elements are color coded and can be activated/deactivated as desired.

The unit cells themselves may not occupy a perfect square, in fact this is very uncommon! When extracting information about these materials, we would like the features to be descriptive of the material, rather than just the unit cell. Thus, we need to package our cell in such a way that periodicity can be handled efficiently. Remember, earlier we mentioned how the periodicity is encoded in the basis vectors for the material. Transforming our structures to a reduced coordinate form, where the positions of the atoms are represented in terms of the basis vectors, rather than real space vectors, will encode all the unit cell information in a square with boundaries from zero to one. Repeating this reduced unit cell in unit steps along any dimension will allow us to encode periodicity without worrying about oblique shapes.

As mentioned above, we are given the three crystal lattice vectors, which together can form a basis for the crystal: $(a_1, a_2, a_3)^T$ . We can describe the position of an atom in the crystal R(x,y,z), where x,y,z are the real space coordinates, in terms of the basis vectors themselves:

$R(x,y,z)=a_1x+a_2y+a_3z$

Taking the inverse of the basis vector matrix, we can solve for the reduced form coordinates:

$r(x,y,z)=A^{-1}R(x,y,z)$

from numpy.linalg import inv

A=np.transpose(train_lat)
B = inv(A)

xyz=[]

for i in enumerate(train_xyz):
    
    r = np.matmul(B, i[1][0])
    xyz.append(r)
    
xyz=np.array(xyz)

trace1 = go.Scatter3d(
    x=xyz[:,0],
    y=xyz[:,1],
    z=xyz[:,2],
    mode='markers',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(217, 217, 217, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)


data = [trace1]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')

min_x=np.min(xyz[:,0])
max_x=np.max(xyz[:,0])
min_y=np.min(xyz[:,1])
max_y=np.max(xyz[:,1])
min_z=np.min(xyz[:,2])
max_z=np.max(xyz[:,2])

print(min_x, max_x, min_y, max_y, min_z, max_z)

-7.11507675694e-20 0.966 -5.42101086243e-20 0.966 -1.08420217249e-19 0.966

As promised, the reduced coordinates fill a unit square parameterized by the lattice vectors of the crystal. This will allow us to build periodic arrays of the unit cell simply by stacking unit squares, for an arbitrary unit cell/basis vectors.
We would like to encode this three-dimensional information that in some intuitive fashion gives us an idea about the stability or electronic structure of the material. The formation energy or stability of a compound is related to the energies contained in the bonds of the material. One could simply construct a graph for the crystal structure, where connectivity is determined by some cut off (or perhaps inspired by the ionic radii of the elements). However, this often requires extensive hand tuning for each structure, and is a therefore low throughput and fragile descriptor. Encoding periodicity is also relatively challenging, as this network analysis cannot be performed in reduced coordinates, since bond lengths are expressed in real space!

To overcome these challenges, we will consider the construction of Wigner-Seitz Cells for each element in the crystal. Each cell is composed of the points which are closer to the position of a particular atom than any other atom center. Such a construction gives an intuitive expression for the available volume of each element. Neighboring elements which share a large face should therefore participate in a relatively large amount of bonding. Efficient methods for calculating these cells can be done through a voronoi tessellation by constructing vectors from each atom to all others and taking the perpendicular bisectors.

We will use a python package tess, which is a python interface to the voro++ package. This package has built in support for periodic cells.

import tess as ts

cntr=ts.Container(xyz, limits=((0,0,0),(1,1,1)), periodic=True)

cells=[v.centroid() for v in cntr]
cells=np.array(cells)
np.shape(cells)

(80, 3)

As expected, for an 80 atom structure we have produced a tessellation with 80 unique cells.

We will now pick three random cells, [0,1,15] and plot their volume in 3D.

cell_vertices=[v.vertices() for v in cntr]
cell_vertices=np.array(cell_vertices)

cell_vertices0=np.array(cell_vertices[0])
cell_vertices1=np.array(cell_vertices[1])
cell_vertices2=np.array(cell_vertices[2])

import random
r = lambda: random.randint(0,255)

trace0 = go.Mesh3d(x=cell_vertices0[:,0],
                   y=cell_vertices0[:,1],
                   z=cell_vertices0[:,2],
                   alphahull=0,
                   opacity=0.4,
                   color='#%02X%02X%02X' % (r(),r(),r()))


trace1 = go.Mesh3d(x=cell_vertices1[:,0],
                   y=cell_vertices1[:,1],
                   z=cell_vertices1[:,2],
                   alphahull=0,
                   opacity=0.4,
                   color='#%02X%02X%02X' % (r(),r(),r()))


trace2 = go.Mesh3d(x=cell_vertices2[:,0],
                   y=cell_vertices2[:,1],
                   z=cell_vertices2[:,2],
                   alphahull=0,
                   opacity=0.4,
                   color='#%02X%02X%02X' % (r(),r(),r()))


py.iplot([trace0, trace1, trace2], filename = 'voronoi surface')

And now we’ll plot all of the cells:

trace=[]
i=0
for cell in cell_vertices:
    tmp=np.array(cell)
    trace.append(go.Mesh3d(x=tmp[:,0],
                   y=tmp[:,1],
                   z=tmp[:,2],
                   alphahull=0,
                   opacity=0.4,
                   color='#%02X%02X%02X' % (r(),r(),r())))
    

py.iplot(trace, filename = 'voronoi surface2')

Note that the space is still in reduced coordinates. In order to interpret the information in our vornoi cell and compare it to tessellations of other structures (with different lattice basis vectors), we will need to convert to a common set of coordinates: real space. To do this, we’ll simply multiply by the original crystal lattice vectors, as described above.

trace=[]
i=0
for cell in cell_vertices:
    tmp=np.array(cell)
    
    for i in range(np.shape(tmp)[0]):
        tmp[i,:]=np.matmul(A, tmp[i,:])
    
    trace.append(go.Mesh3d(x=tmp[:,0],
                   y=tmp[:,1],
                   z=tmp[:,2],
                   alphahull=0,
                   opacity=0.4,
                   color='#%02X%02X%02X' % (r(),r(),r())))
   

    

py.iplot(trace, filename = 'voronoi surface2_realspace')

Finally, we get our vornoi tessellated cells back in real space coordinates and plot the results. As you can see, this method neatly handles oblique and oddly shaped unit cells while accounting for periodicity. In the next following posts I will demonstrate how one can engineer features from these cells.

NOMAD TCO Kaggle Part 1

Crystal Voronoi Tessellation

NOMAD TCO Kaggle Part 1

Crystal Voronoi Tessellation