NOMAD TCO Kaggle Part 2

This is a continuation from Part 1 where I showed you, given a list of atomic positions and the basis vectors of the unit cell, one could produce a periodic tessellation of that crystal using the fantastic voro++ package. The container object from this tesslation carries a lot of useful information, including the list of neighboring cells, the face areas of those cells, and the list of vertices for each face of the cells. When we left off we had just converted the vertices back to real space.

As in part 1, go ahead and load structure 2100, convert the atomic positions into reduced form, and create a tess container with periodic boundary conditions. We will be calculating a lot of features based on the geometry of the cells in this container, and for the most part we will be using the amount of shared face area is a weighting factor for differences in elemental properties. This is motivated by the fact that elements which share a large degree of available surface area are likely participating in a high degree of bonding.

Within the container object we have two particularly important lists of lists, the positions of the verticies in each cell, and the sets of vertices which contribute to each face. Thus, we first need to convert all the verticies from reduced coordinates back into real space. Using these new coordinates and the list of which vertices correspond to each face, we will then recalculate the face areas in real space. We will loop through all the tessellated cells when performing this construction.

First we’ll define a few helper functions to determine the unit normal of a face with an arbitrary number of vertices, and the total area of the polygon using those vertices. I’m using the excellent Stack Overflow response by Tom Smilack, which you can find here.

def unit_normal(a, b, c):
    x = np.linalg.det([[1,a[1],a[2]],
         [1,b[1],b[2]],
         [1,c[1],c[2]]])
    y = np.linalg.det([[a[0],1,a[2]],
         [b[0],1,b[2]],
         [c[0],1,c[2]]])
    z = np.linalg.det([[a[0],a[1],1],
         [b[0],b[1],1],
         [c[0],c[1],1]])
    magnitude = (x**2 + y**2 + z**2)**.5
    return (x/magnitude, y/magnitude, z/magnitude)


#area of polygon poly
def poly_area(poly):
    if len(poly) < 3: # not a plane - no area
        return 0
    total = [0, 0, 0]
    N = len(poly)
    for i in range(N):
        vi1 = poly[i]
        vi2 = poly[(i+1) % N]
        prod = np.cross(vi1, vi2)
        total[0] += prod[0]
        total[1] += prod[1]
        total[2] += prod[2]
    result = np.dot(total, unit_normal(poly[0], poly[1], poly[2]))
    return abs(result/2)

Here we’ll read out the vertices and face_vertices of the container into two temporary lists, and define an empty list for the real space face areas (face_areas_r). Note that we perform the conversion from reduced to real space inside the for loop, and append the list of new areas for each cell into face_areas_r.

vertices=np.array([v.vertices() for v in cntr])
face_vertices=[v.face_vertices() for v in cntr]

face_areas_r=[]

for j in range(len(face_vertices)):    
    areas=[]    
    tmp=np.array(vertices[j])
    
    for i in range(len(face_vertices[j])):
        corrected=[]        
        dummy=tmp[face_vertices[j][i], :]
        
        for k in range(len(dummy)):    
            corrected.append(np.matmul(A, dummy[k]))
            
        areas.append(poly_area(corrected))
        
    face_areas_r.append(areas)

We now have the list of areas for every face in real space, which will allow for direct comparison between different unit cells/geometries. From Ward et al. some simple descriptors we can calculate are some statistics from the coordination each atom experiences in the structure (ie, number of faces shared for each tessellated cell). This tells us something about the local bonding environment experienced by each atom in the material.

However, some faces may have an exceptionally small shared area, and do not truly represent a bonding state despite being potentially close in real space. Thus, a naive counting of faces may significantly overestimate the local coordination of each element. Instead, for each tessellated cell we’ll calculate the square of the summed face areas divided by the sum of the square. Then we will calculate the maximum, minimum, mean, and mean absolute deviation for each cell in the structure.

coord=[]

for v in face_areas_r:
    
    num=np.sum(v)**2
    denom=np.sum(np.power(v, 2))
    coord.append(num/denom)
    
features_coord=[np.max(coord), np.min(coord), np.mean(coord), np.sum(np.abs(coord-np.mean(coord)))/len(coord)]

Next, let’s consider how elemental properties vary between face sharing cells in the structure. For example, given a central Oxygen atom, we can consider the electronegativity difference of all face sharing elements. To penalize the contribution of neighbors with very small shared areas, we’ll normalize all the descriptors by the area of the shared face.

Let’s define some dictionaries with interesting elemental properties:

electronegativities={"Al":1.61, "In":1.78, "Ga": 1.81, "O":3.44}
electron_affinities = {"Al" : 0.432, "In": 0.3, "Ga": 0.43,  "O": 1.461 }
covalent_radius = {"Al" : 125, "In": 155, "Ga": 130,  "O": 60 }
valence = {"Al" : 3, "In": 3, "Ga": 3,  "O": 2 }
melting_T = {"Al" :660.32, "In": 156.6, "Ga": 29.76,  "O": -218.3 }
mendeleev_num={"Al" :73, "In": 75, "Ga": 74,  "O": 87 }
atomic_weight={"Al" :26.98, "In": 114.8, "Ga": 69.72,  "O": 15.99 }
effective_local_charge={"Al" :4.066, "In": 8.470, "Ga": 6.222,  "O": 4.453 }
heat_capacity={'Al': 24.2, 'Ga': 25.86, 'In': 26.74, 'O': 29.378}   
heat_of_fusion={'Al': 10700.0, 'Ga': 5590.0, 'In': 3260.0, 'O': 222.0}
heat_of_vaporization={'Al': 294, 'Ga': 254, 'In': 231.8, 'O':  6.820}  
first_ionization = {'Al': 577.5, 'Ga': 578.8, 'In': 558.3, 'O': 1313.9}
second_ionization = {'Al': 1816.7, 'Ga': 1979.3, 'In': 1820.7, 'O': 3388.3}    
third_ionization = {'Al': 2744.8, 'Ga': 2963, 'In': 2704, 'O': 5300.5}      
thermal_conductivity = {'Al': 235.0, 'Ga': 29.0, 'In': 82.0, 'O': 0.02658}
molar_volume={'Al': 10.00, 'Ga': 11.803, 'In': 15.76, 'O': 22.4134}
chemical_hardness={'Al': 2.77, 'Ga': 2.9, 'In': 2.8, 'O': 6.08}  
polarizability = {'Al': 57.74, 'Ga': 51.4, 'In': 68.7, 'O': 6.1}  

And let’s define a helper function, which given a list of atoms, neighbors for each atom, and a look-up dictionary of a particular property of interest, will return some statistical metrics for the property differences experienced by each cell.

def calculate_local_properties(train_array, dictionary, neighbors, core_atom, face_areas_r):
    
    dict_core=[dictionary[x] for x in core_atom]
    dict_tot=[]
    
    for i in range(len(face_areas_r)):
        
        element_list_neigh=train_array[neighbors[i], 1]
        areas=face_areas_r[i]
        values_n=np.array([dictionary[x] for x in element_list_neigh])
        dict_tot.append(np.sum(areas*np.abs(values_n-dict_core[i]))/np.sum(areas))
        
    features=[np.max(dict_tot), np.min(dict_tot), np.mean(dict_tot), np.sum(np.abs(dict_tot-np.mean(dict_tot)))/len(dict_tot)]
    return features

Given a particular cell in the structure with a unique core atom, the function will grab all of its neighbors, the shared face areas for those atoms, and the corresponding elemental property of the core atom and each neighbor. We then take the difference between the core atom and neighbor property for all face sharing cells, weighted by the normalized face area. Summing over all faces we then have a list of the summed-face-weighted property difference for all cells.

Finally, the maximum, minimum, mean, and mean absolute deviation across all cells are tabulated and returned. We can then loop through our predefined dictionaries:

local_properties=[electronegativities,
                  electron_affinities,
                  covalent_radius,
                  valence,
                  melting_T,
                  mendeleev_num,
                  atomic_weight,
                  polarizability]

neighbors=[v.neighbors() for v in cntr]
train_array=np.array(train_xyz)
core_atom=train_array[:,1]



for dictionary in local_properties:
    
    property_name=[ k for k,v in locals().items() if v is dictionary][0]    
    features=calculate_local_properties(train_array, dictionary, neighbors, core_atom, face_areas_r)

Now let’s consider the actual bonding distances between neighboring atoms. It’s relatively intuitive to imagine that if all the atoms in a material have very long bonding distances, this structure is unlikely to be very unstable. The energy contained with a bond scales as a power of distance (depending on the potential involved), so having some statistical descriptors of the overall bonding scale within a structure is likely to be a very powerful predictor of formation energy.

We discussed in Part 1 that it is possible to define a network of bonding elements from the raw atomic positions in real space, simply using some cutoff to specify connectivity. This cutoff could be set against some physically inspired value, like covalent radii. One can then simply calculate the mean values of the edges of this weighted graph.

However, consider if two atoms are relatively close due to a very dense unit cell, but share a very small face area. We should not consider this bonding distance when tabulating statistics as these elements are unlikely to actual participate in any kind of bonding in the real material and will not affect the overall formation energy. Our tessellation provides us a facile way to weight the relative distances between elements by their likelihood to participate in a bond.

Consider the following short code, which computes the distance between a core atom and all of its neighbors, and weights it by the normalized shared face area. As before, we can then compute the maximum, mininum, mean, and mean absolute deviation as features in our model.

bond_dist=[]

for i,atom in enumerate(train_xyz):
    
    neighbor_atoms=train_array[neighbors[i], 0]
    dist=[]

    for k in range(len(neighbor_atoms)):
        dist.append( np.linalg.norm(atom[1][0]-neighbor_atoms[k]))
        
    bond_dist.append(np.dot(dist, face_areas_r[i])/np.sum(face_areas_r[i]))   

features=[np.max(bond_dist), np.min(bond_dist), np.mean(bond_dist), np.sum(np.abs(bond_dist-np.mean(bond_dist)))/len(bond_dist)]

That’s all for this time! In part 3, I’ll show you how to calculate some more exotic features based on network analysis of the graph defined by the neighboring voronoi cells.

NOMAD TCO Kaggle Part 2

Basic Feature Extraction from Tessellations

NOMAD TCO Kaggle Part 2

Basic Feature Extraction from Tessellations