Post 52 - Is that a log or a big snake?

02 Dec 2023

Task 8, Day 02, `Log analysis` O Data, All Ye Faithful

Data Science

Often incvolves programming, statistics, and the use of AI to examine and understand trends and patterns. Data scientist roles:

Role	Description
Data Collection	Collecting raw data (i.e. List of recent transactions)
Data Processing	Turning raw data from collection into standard format
Data Mining (Clustering/Classification)	Create relationships between data, find patterns and correlations
Analysis (Exploratory/Confirmatory)	Bulk of analysis. Look for answers and projections
Communication (Visualisation)	Visualize data as charts, tables, maps, etc.

Data Science in Cybersecurity

Analysing data like log events leads to an intelligent understanding of ongoing events in the org. Other uses include:

SIEM: SIEMs collect and correlate large amounts of data to give a wider understanging of an org.
Threat trend analysis: Track and understand emerging threats.
Predictive analysis: Analyse historical events to create a potential picture of future threats. Forwarned = prepared.

Jupyter Notebooks

Open source documents with code, text, and terminal functionality; easily shared and executed across systems. Good way to demonstrate and explain proof of concepts in Cybersecurity. Notebooks consist of cell that can be executed one at a time, step by step. Cells indicated by “In [#]” where number indexes to next higher number after successful completion. Can rearrange cells by grabbing around the cell number.

Lists are used to store a collection of values as a variable.

Pandas

Series - Data structure of key-value pair.

Turn data set from list variable (transportation) into a series by:

creating a new variable transportation_series
invoke Pandas’ Series function pandas.Series
provide variable transportation to function
print out series print(transportation_series)

transportation_series = pd.Series(transportation)

print(transportation_series)

Print also returns dtype: which include:

objects = strings(words)
int64 = integers(numbers)

DataFrames are a grouping of series similar to spreadsheet or database. Provide data similar to:

data = [['Ben', 24, 'UK'],
        ['Jacob', 32, 'US']
        ['Alice', 19, 'Germany']]

Convert to dataframe variable df by:

df = pd.DataFrame(data, columns=['Name', 'Age', 'Country of Residence'])

Return all data by calling variable (i.e. df).

Can also return a specific line with df.loc[#] where # is the line number (starting with 0).

Can then group dataframes by column, row, or comparing with groupby.

Group by two columns and sum them similar to:

df.groupby(['Department'])['Prize'].sum()

Group by two columns and provide summary of data in percentile:

df.groupby(['Department'])['Prize'].describe()

Matplotlib

Allows creation of visualisations of data such as bar charts, histograms, pie charts, waterfalls, etc. If using within Jupyter notebook begin with %matplotlib inline otherwise it generates plots elsewhere. Create plots with plot() function. Similar to:

plt.xlabel('Months of the year')
plt.ylabel('Number of toys produced')
plt.title('A Line Graph Showing Number of Toys Produced Monthly')
plt.plot(['January', 'February', 'March', 'April'],[8,14,23,40])

Combine Pandas and Matplotlib to create a bar graph based on a csv file similar to:

spreadsheet = pd.read_csv('drinks.csv')

drinks = spreadsheet['Drink']
votes = spreadsheet['Vote']

plt.figure(figsize=(10, 6)) # adjust as necessary for readability

plt.barh(drinks, votes, color='skyblue') # barh for horizontal, barv for vertical.
plt.xlabel('Number of Votes')
plt.ylabel('Name of Drink')
plt.title('Bar Graph Showing Favorite Drinks')
plt.gca().invert_yaxis() # optionally invert the y-axis.

The Task

Analyse a packet capture saved as a csv. Answers gotten.