How to use python to set a variable to a value from a vector attribute table for GRASS

How to use python to set a variable to a value from a vector attribute table for GRASS

I'm adding to a python script for GRASS I want to set a variable to equal a value from a field in the vector's attribute table using a for loop to cycle through each row. This is one of my first GRASS scripts, I'm much more comfortable with arc. For arcpy I would use a search cursor and getValue:

vector = "river.shp" lines = arcpy.SearchCursor(vector) for line in lines: attributeValue = str(line.getValue("FIELD"))

The existing script has the right for loop already (for line in vector), its just a matter of figuring out the last line, how to have it read the field in the attribute table of the line segment it is currently cycling through.

You have two options, one in pure Python and the second, more classical, parsing the commands of GRASS GIS.

1) in pure Python

If I look at GRASS Programmer's Manual: Python (with the version 6.4.3) I still haven't found a way to write data (attributes) to GRASS vectors from Python.

But it's possible and easier with the version 7: GRASS 7 Programmer's Manual and PyGRASS

2) more classical: parsing the commands of GRASS GIS

You can use Python to run the standard GRASS vector commands (db.execute,v.db,v.db.update, etc., look for example at the Python scripts of Antonio Alliegro in Programatione and GIS: Python (in Italian) , Python Scripts For GRASS GIS or pgis with the class gVect()

They use SQL (SQL support in GRASS GIS) but if you use the dbf driver, due to his limitations, some commands not supported (no problem if you use the SQLite driver).

As an example of the process with the dbf driver:

  1. creation of a new column with the sum of two attributes values

  2. change an attribute value

A table:

# first column ZN = grass.read_command("", flags="c", map="geochimcal", col="ZN") ZN=(ZN.split("
")) ZN= ZN[0:(len(ZN)-1)] print ZN ['40', '55', '65', '158', '44', '282', '62', '83', '84', '97', '61', '58', '40', '54', '75', '129', '77', '87', '74', '47', '58', '73', '64', '46', '63'] # second column PB=grass.read_command("", flags="c", map="geochimcal", col="PB") PB=(PB.split("
")) PB= PB[0:(len(PB)-1)] print PB ['17', '9', '16', '40', '16', '166', '18', '22', '37', '69', '62', '19', '17', '23', '33', '72', '19', '19', '39', '21', '30', '8', '37', '21', '20'] #add a column to the table: grass.read_command("v.db.addcol",map="geochimcal",col="SOMME int") # sum calculation SOMME = range(len(PB)) for i in range(len(PB)): SOMME[i]=int(ZN[i])+int(PB[i]) print SOMME [57, 64, 81, 198, 60, 448, 80, 105, 121, 166, 123, 77, 57, 77, 108, 201, 96, 106, 113, 68, 88, 81, 101, 67, 83] # populate the new column for i in range(len(ZN)): query="UPDATE geochimcal SET SOMME=" + str(SOMME[i]) + " WHERE cat = " + str(i+1) grass.write_command("db.execute", stdin = query)

Change a value:

query="UPDATE geochimcal SET SOMME=" + str(0) + "WHERE PH =" + str(6.9) grass.write_command("db.execute", stdin = query)

It is much easier with GRASS GIS 7 and PyGRASS (vector attributes) or workshop pygrass

For a general kernel it is difficult to interpret the SVM weights, however for the linear SVM there actually is a useful interpretation:

1) Recall that in linear SVM, the result is a hyperplane that separates the classes as best as possible. The weights represent this hyperplane, by giving you the coordinates of a vector which is orthogonal to the hyperplane - these are the coefficients given by svm.coef_. Let's call this vector w.

2) What can we do with this vector? It's direction gives us the predicted class, so if you take the dot product of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.

3) Finally, you can even learn something about the importance of each feature. This is my own interpretation so convince yourself first. Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation. For example if only the first coordinate is used for separation, w will be of the form (x,0) where x is some non zero number and then |x|>0.

Step 1: Examining the input data

1.1 Visual and statistical data inspection

In the Catalog, select the point map with the sample points. Press the right mouse button and select Properties from the context-sensitive menu. The Point Map - Properties dialog box is opened.

Double-click the point map in the Catalog. The Display Options - Point Map dialog box is presented. Accept the defaults by clicking the OK button. The point map is opened. Check the distribution of the sample points visually. Another way to investigate whether your points are randomly distributed, or appear clustered, regular, or paired, etc., is by doing a Pattern Analysis.

If you decide that there are enough sample points available and that the distribution of points is good enough to do a Kriging interpolation, measure the shortest and longest distance between two sample points in the point map.

You can measure the length of both the shortest and the longest point-pair vector with the Measure Distance button (the pair of compasses) from the toolbar of the map window. You need this distance information later on for defining the number of lags and the lag spacing in the Variogram Surface operation (step 1.2), the Spatial Correlation or the Cross Variogram operation (step 2.1).

If you have sufficient sample points to do a proper Kriging interpolation, split up your data set in two parts. Use one part for interpolation and the other part for verification of the interpolated map.

Next, calculate the variance of the sample data set.

  • When the value data that you want to interpolate is stored in a column of the map's attribute table: open the attribute table. When values are stored in the point map itself, open the point map as a table via the context-sensitive menu. From the Columns menu in the table window, choose the Statistics command. In the Column Statistics dialog box:
    • select the Variance function, and
    • select the input variable for which you want to calculate the variance.
    • Press the OK button. You can use this calculated variance as an indication for the sill when modelling the variogram (step 3.1).

    When the variation of the variable under study is not the same in all directions, then anisotropy is present. In case of suspected anisotropy, calculate a Variogram Surface with the Variogram Surface operation.

    • Select the point map in the Catalog. Press the right mouse button and choose Statistics, Variogram Surface from the context-sensitive menu.
    • The Variogram Surface dialog box is opened. In case the point map is linked to an attribute table, choose the attribute column with the sample data. Enter the lag spacing, the number of lags and a name for the output map.
    • Select the Show check box and press OK. The variogram surface map is calculated.

    The output map can best be viewed in a map window using representation Pseudo while a histogram has been calculated. To view the coordinates and the position of the origin in the output raster map, you can add grid lines, where the grid distance equals the specified lag spacing. It is important to recognize the origin of the plot/output map.

    • Semi-variogram values close to the origin of the output map are expected to be small (blue in representation Pseudo ), as values of points at very short distances to each other are expected to be similar. When there is no anisotropy, semi-variogram values will gradually increase from the origin into all directions. You will thus find circle-like shapes from the origin outwards where the color gradually changes from blue at the origin to green and red further away from the origin.
    • Your input data is supposed to be anisotropic when you find an ellipse-like shape of low semi-variogram values (blue in representation Pseudo ) in a certain direction going through the origin. In this direction, the semi-variogram values do not increase much. However, in the perpendicular direction, you find a clear increase of semi-variogram values: from blue at the origin to green and red further away from the origin. If anisotropy is present you should use Anisotropic Kriging.

    You can measure the direction of anisotropy with the Measure Distance button from the toolbar of the map window, e.g. by following a 'line' of blue pixels going through the origin of the plot. You need this angle later on in step 2.1 (Spatial Correlation bidirectional method) and step 4.3 (Anisotropic Kriging).

    • The semi-variogram values in the output map can be compared to the overall variance of your input data (calculated in step 1.1)
    • When no points are encountered in a certain directional distance class, the semi-variogram value of that cell/pixel in the output surface will be undefined.
    • When you find many undefined semi-variogram surface values in between few rather large semi-variogram values, you should consider to increase the lag spacing. Mind that results will be more reliable when say more than 30 point pairs are found in individual directional distance classes.
    • When you find very many undefined semi-variogram values mainly at the outer parts of the surface, you should consider to reduce the lag spacing.
    • When using an input point map with very many points, calculations of a large surface may take quite long. It is advised, to start using the operation with rather few lags and/or a rather small lag spacing.

    One can perform a Kriging operation (i.e. Universal Kriging) while taking into account a local drift or trend that is supposed to exist within the limiting distance defined around each pixel to be interpolated. Very often you know the trend already:

    • when you carried out another interpolation technique like Moving surface or a Trend surface, before you decided that you wanted to use the Kriging interpolation method, or
    • when you know the natural behavior of the variable (e.g. soil textures near a river levee are very often more extreme than in a backswamp area).

    If a global trend is present in the sample set, subtract the trend for the input data with a TabCalc statement. Perform Ordinary Kriging on the de-trended data set and use MapCalc to add both output maps again. This method is an alternative for Universal Kriging. However, a major disadvantage of this alternative is that the error map is incorrect.

    If you decide that the variable under study is sparsely sampled, find out if there is another variable that is better sampled and has many corresponding sample points (identical XY-coordinates).

    Open the attribute table that is linked to the point map. Find out if there are two columns with value domains and corresponding XY- coordinates. If there is a second variable, calculate the variance of both variables individually and the correlation between the two columns.

    When the two variables are highly correlated, you can use the better-sampled variable and the relationship between the two variables to help to interpolate the sparsely sampled one with CoKriging.

    If the correlation between the two variables is low it is advised to use another interpolation technique or not to interpolate at all.

    • The correlation between the two variables should make sense. In other words, it should be based on physical relationships/laws (e.g. temperature and relative humidity, temperature and height)
    • If the variable is poorly sampled and there is no other variable that can help to interpolate the sparsely sampled one, it is not wise to interpolate because the results do not make much sense. Taking the average of the sample values is as good as interpolating.
    • If the variable is sparsely sampled, you may consider to go back into the field and take more measurements.

    Visualise Categorical Variables in Python using Bivariate Analysis

    Bivariate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.

    Categorical & Continous: To find the relationship between categorical and continuous variables, we can use Boxplots

    Boxplots are another type of univariate plot for summarising distributions of numeric data graphically. Let’s make a boxplot of carat using the pd.boxplot() function:

    The central box of the boxplot represents the middle 50% of the observations, the central bar is the median and the bars at the end of the dotted lines (whiskers) encapsulate the great majority of the observations. Circles that lie beyond the end of the whiskers are data points that may be outliers.

    The boxplot above is curious: we’d expect diamonds with better clarity to fetch higher prices and yet diamonds on the highest end of the clarity spectrum (IF = internally flawless) actually have lower median prices than low clarity diamonds!

    Impact Analysis of Land Use on Traffic Congestion Using Real-Time Traffic and POI

    This paper proposed a new method to describe, compare, and classify the traffic congestion points in Beijing, China, by using the online map data and further revealed the relationship between traffic congestion and land use. The data of the point of interest (POI) and the real-time traffic was extracted from an electronic map of the area in the fourth ring road of Beijing. The POIs were quantified based on the architectural area of the land use the congestion points were identified based on real-time traffic. Then, the cluster analysis using the attributes of congestion time was conducted to identify the main traffic congestion areas. The result of a linear regression analysis between the congestion time and the land use showed that the influence of the high proportion of commercial land use on the traffic congestion was significant. Also, we considered five types of land use through performing a linear regression analysis between the congestion time and the ratio of four types of land use. The results showed that the reasonable ratio of land use types could efficiently reduce congestion time. This study makes contributions to the policy-making of urban land use.

    1. Introduction

    As the urbanization of China is accelerating, the expense of urban land leads to excessive concentration of public functions, causing the growing occurrences of congestions in urban traffic. The land use affects the attracted direction, the ratio of traffic flow, and the travel model, which are the factors related to public traffic demand. Reasonably planning urban land is essential to ensure the efficient operation of urban traffic. Therefore, understanding the correlation between the land use and the traffic congestions can help optimize urban traffic.

    The traffic data on congestions involves large-scale and complex space-time information, making mining the traffic data difficult. Besides, the source of the traffic data is not readily available. Previous studies [1–6] have focused on traffic flow through traditional methods (conventional four-step travel demand model) without considering geographic information (the coordinates of longitude and latitude, the categories, and the specific location information). The traditional four-step travel demand model typically operates on the individual survey data that is of high cost, low accuracy, and low efficiency. So it is necessary for us to develop a method of judging congestion point using geographic information systems (GIS), which serves as a quick, precise alternative to the conventional four-step models.

    In literature, limited models on traffic congestions have been proposed to investigate the relation between traffic congestions and urban land use. For example, Wingo Lowdon established the economic model on how transportation, location, and urban land use affected the travel of consumers from their residences to workplaces [7]. Alonso [8] improved this model by considering the value of the urban land, finding that the value of different urban plots was negatively correlated to the transportation cost to the city center. Izraeli and McCarthy [9] (1985) found that the residential land had an effect on congestion there was a significant positive correlation between population density and commuting time. Handy [10] analyzed the impact of land use on travel characteristics and discovered that the frequency of traveling decreased as the density of land use increased and that the distance of traveling increased as the speed of traveling decreased. Gordon et al. [11] (1989) analyzed satellite data of 82 US metropolitan areas in 1980 to extract the information on the densities of different types of the land use (the type of the resident, the industry, and the commerce). When considering the employment rate at that time, it was found that the increment of the industrial density would lower the car commuting time, as well as the residential and commercial densities. Ewing et al. [12] (2003) investigated the impact of land use on commuting time and pedestrian delay using cross-sectional data of 83 metropolitan statistical areas at the years of 1990 and 2000. The results showed that the commuting time during these two years was negatively correlated with the mixed land utilization index and was positively correlated with the street accessibility.

    Unlike the urban land attribute data that is complex for analyzing and classifying, the point of interest (POI) data, which is closely related to urban land attributes and urban planning guidance, can be easily quantified and analyzed. Yu and Ai [13] discussed the characteristics of the spatial distribution of urban POI data and proposed a model to estimate network kernel density for providing guidance to land planning. Ma et al. [14] proposed a visual search model for POIs of highway transportation to help reduce the costs of transportation. Liu et al. [15] computed the attractiveness of POIs according to the number of times of taxis stopping nearby the POIs.

    No literature has integrated the real-time traffic data to the POI data. Using the urban road network data, the real-time traffic data, and the POI data, this study explored the correlation between traffic congestion and different attributes of urban land use and established the evolution of urban traffic congestion geographic model. The outcomes of this study would contribute to the policy-making in the planning of urban land use.

    2. Introduce and Extraction of POI

    2.1. Introduce of POI

    In a Geographic Information System, the POIs include the houses, the scenic spots, the shops, and the mailbox. The data from the POIs contains the coordinates of the longitude and latitude, the categories, the specific location information, and the User Identification (UID). This study used the electronic map of Beijing POIs since it recorded an enormous amount of information on city locations.

    2.2. Classification and Extraction of POI

    The default classification of Beijing POIs is of 16 categories as the first class and 96 categories as the second class, most of which are not related to the traveling of residents. This study conducted a detailed survey to obtain a better classification of Beijing POIs well representing resident travel purposes. The result shows that “work,” “school,” “shopping,” “leisure,” and “return home” are the primary traveling purposes of Beijing residents, accounting for about 85% of the total travel, as listed in Table 1. Thus, new categories of POI for classification, including education, work, shopping, residential, and recreation, were created, as shown in Table 2.

    Let $u:mathbb R^n omathbb R$, $xmapsto u(x)=x^Tx$. There exists a linear application $ell_x:mathbb R^n omathbb R$, called the gradient of $u$ at $x$, such that

    when $z o0$. To compute $ell_x$, note that $ u(x+z)=(x+z)^T(x+z)=x^Tx+z^Tx+x^Tz+z^Tz=u(x)+2x^Tz+o(|z|), $ hence $ ell_x(z)=2x^Tz. $ Every linear form $ell$ on $mathbb R^n$ has the form $ell:zmapsto w^Tz$ for some $w$ in $mathbb R^n$ hence one often identifies $ell$ with $w$ (technically, this is identifying the dual of $mathbb R^n$ with $mathbb R^n$). In the present case, one may identify the gradient $ell_x$ of $u$ at $x$ (a linear application from $mathbb R^n$ to $mathbb R$) with the vector $2x$ (an element of $mathbb R^n$), and indeed, one often reads the formula $ ( ext u)(x)=2x. $

    Conclusions and future work

    This paper suggests a method that predicts and detects cyber-attacks by using both machine-learning algorithms and the data from previous cyber-crime cases. In the model, the characteristics of the people who may be attacked and which methods of attack they may be exposed to are predicted. It has been observed that machine-learning methods are successful enough. The SVMs linear method is the most successful of these methods. The success rate of predicting the attacker who will make a cyber-attack in the model is around 60%. Other artificial intelligence methods may be able to try to increase this ratio. In our approach, it is concluded that it is necessary to draw attention to especially malware and social engineering attacks. It was found that the higher the levels of the victim’s education and income are, the less the probability of cyber-attack is. The primary focus of this study is to lead law enforcement agencies in the fight against cyber-crime and provide faster and more effective solutions in detecting crime and criminals. New training and warning systems can be created for people with similar characteristics by the evaluation of the characteristics of the attack victims emerged in our analysis study.

    For future works crime, criminal, victim profiling and cyber-attacks can be predicted using deep learning algorithms and the results can be compared. Based on the talks with other authorized units having crime databases, cyber-crime data of other provinces may also be obtained to use for comparison with this study. The data of other provinces can be compared to similar studies. Intelligent criminal-victim detection systems that can be useful to law enforcement agencies in the fight against crime and criminals can be created to reduce crime rates.

    Classification of SDC methods¶

    SDC methods can be classified as non-perturbative and perturbative (see HDFG12).

    • Non-perturbative methods reduce the detail in the data by generalization or suppression of certain values (i.e., masking) without distorting the data structure.
    • Perturbative methods do not suppress values in the dataset but perturb (i.e., alter) values to limit disclosure risk by creating uncertainty around the true values.

    Both non-perturbative and perturbative methods can be used for categorical and continuous variables.

    We also distinguish between probabilistic and deterministic SDC methods.

    • Probabilistic methods depend on a probability mechanism or a random number-generating mechanism. Every time a probabilistic method is used, a different outcome is generated. For these methods it is often recommended that a seed be set for the random number generator if you want to produce replicable results.
    • Deterministic methods follow a certain algorithm and produce the same results if applied repeatedly to the same data with the same set of parameters.

    SDC methods for microdata intend to prevent identity and attribute disclosure. Different SDC methods are used for each type of disclosure control. Methods such as recoding and suppression are applied to quasi-identifiers to prevent identity disclosure, whereas top coding a quasi-identifier (e.g., income) or perturbing a sensitive variable prevent attribute disclosure.

    As this practice guide is written around the use of the sdcMicro package, we discuss only SDC methods that are implemented in the sdcMicro package or can be easily implemented in R. These are the most commonly applied methods from the literature and used in most agencies experienced in using these methods. Table 6 gives an overview of the SDC methods discussed in this guide, their classification, types of data to which they are applicable and their function names in the sdcMicro package.

    Table 6 SDC methods and corresponding functions in sdcMicro
    Method Classification of SDC method Data Type Function in sdcMicro
    Global recoding non-perturbative, determinitic continuous and categorical globalRecode , groupVars
    Top and bottom coding non-perturbative, determinitic continuous and categorical topBotCoding
    Local suppression non-perturbative, determinitic categorical localSuppression, localSupp
    PRAM perturbative, probabilistic categorical pram
    Micro aggregation perturbative, probabilistic continuous microaggregation
    Noise addition perturbative, probabilistic continuous addNoise
    Shuffling perturbative, probabilistic continuous shuffle
    Rank swapping perturbative, probabilistic continuous rankSwap

    If you use l-1 penalty on the weight vector, it does automatic feature selection as the weights corresponding to irrelevant attributes are automatically set to zero. See this paper. The (absolute) magnitude of each non-zero weights can give an idea about the importance of the corresponding attribute.

    Also look at this paper which uses criteria derived from SVMs to guide the attribute selection.

    Isabelle Guyon, André Elisseeff, "An Introduction to Variable and Feature Selection", JMLR, 3(Mar):1157-1182, 2003.

    is well worth reading, it will give a good overview of approaches and issues. The one thing I would add is that feature selection doesn't necessarily improve predictive performance, and can easily make it worse (beacuse it is easy to over-fit the feature selection criterion). One of the advantages of (especially linear) SVMs is that they work well with large numbers of features (providing you tune the regularisation parameter properly), so there is often no need if you are only interested in prediction.

    If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:

    Inheritance is typically describe as an "is-a" relationship. So when you derive a Dog from an Animal , we can say that a Dog is a Animal . However, when you derive a Cat from a Dog , unless this is some other planet, we can't correctly say that a Cat is a Dog . Further, we can, without error, invoke Tina.woof('ferocious') to cause "Tina just did a ferocious woof". Since no Persian cat I've ever seen has been known to "woof" either ferociously or not, this is an alarming and surprising result.

    Better would be to derive both Dog and Cat types from Animal . If you have other animals which don't have names, breeds or genders, you could have some intermediate class such as Pet that would capture the additional detail not already in Animal . Otherwise, just add those attributes to Animal .

    Finally, we can put the speak method in the base Animal class. One simple illustration is this:

    There's more that could be improved, but I hope that helps until others weigh in on your question and give more expansive answers.