02 Dec 2023
Log analysis O Data, All Ye FaithfulOften incvolves programming, statistics, and the use of AI to examine and understand trends and patterns. Data scientist roles:
| Role | Description |
|---|---|
| Data Collection | Collecting raw data (i.e. List of recent transactions) |
| Data Processing | Turning raw data from collection into standard format |
| Data Mining (Clustering/Classification) |
Create relationships between data, find patterns and correlations |
| Analysis (Exploratory/Confirmatory) |
Bulk of analysis. Look for answers and projections |
| Communication (Visualisation) |
Visualize data as charts, tables, maps, etc. |
Analysing data like log events leads to an intelligent understanding of ongoing events in the org. Other uses include:
Open source documents with code, text, and terminal functionality; easily shared and executed across systems. Good way to demonstrate and explain proof of concepts in Cybersecurity. Notebooks consist of cell that can be executed one at a time, step by step. Cells indicated by โIn [#]โ where number indexes to next higher number after successful completion. Can rearrange cells by grabbing around the cell number.
Lists are used to store a collection of values as a variable.
Series - Data structure of key-value pair.
Turn data set from list variable (transportation) into a series by:
transportation_seriespandas.Seriestransportation to functionprint(transportation_series)transportation_series = pd.Series(transportation)
print(transportation_series)
Print also returns dtype: which include:
DataFrames are a grouping of series similar to spreadsheet or database. Provide data similar to:
data = [['Ben', 24, 'UK'],
['Jacob', 32, 'US']
['Alice', 19, 'Germany']]
Convert to dataframe variable df by:
df = pd.DataFrame(data, columns=['Name', 'Age', 'Country of Residence'])
Return all data by calling variable (i.e. df).
Can also return a specific line with df.loc[#] where # is the line number (starting with 0).
Can then group dataframes by column, row, or comparing with groupby.
Group by two columns and sum them similar to:
df.groupby(['Department'])['Prize'].sum()
Group by two columns and provide summary of data in percentile:
df.groupby(['Department'])['Prize'].describe()
Allows creation of visualisations of data such as bar charts, histograms, pie charts, waterfalls, etc. If using within Jupyter notebook begin with %matplotlib inline otherwise it generates plots elsewhere. Create plots with plot() function. Similar to:
plt.xlabel('Months of the year')
plt.ylabel('Number of toys produced')
plt.title('A Line Graph Showing Number of Toys Produced Monthly')
plt.plot(['January', 'February', 'March', 'April'],[8,14,23,40])
Combine Pandas and Matplotlib to create a bar graph based on a csv file similar to:
spreadsheet = pd.read_csv('drinks.csv')
drinks = spreadsheet['Drink']
votes = spreadsheet['Vote']
plt.figure(figsize=(10, 6)) # adjust as necessary for readability
plt.barh(drinks, votes, color='skyblue') # barh for horizontal, barv for vertical.
plt.xlabel('Number of Votes')
plt.ylabel('Name of Drink')
plt.title('Bar Graph Showing Favorite Drinks')
plt.gca().invert_yaxis() # optionally invert the y-axis.
Analyse a packet capture saved as a csv. Answers gotten.